Let's bootstrap! Recall that bootstrapping is sampling with replacement from a dataset to produce a new dataset of the same size. Bootstrapping is used in random forests to guard against overfitting. It also has wide application in many other areas of statistics - let's see two of them.

1) Produce a bootstrapped estimate of the median and 95 percent confidence interval over the median of the dependent variable in the attached dataset.

2) Use the attached data to run the linear model y = xb. Produce bootstrapped estimates of the model parameters, b, and a 95% confidence interval over them.

**How does bootstrapping guard against overfitting?**

In [82]:
import pandas as pd
import numpy as np
import scipy.stats
import scipy as sp
import statsmodels.api as sm

In [2]:
df = pd.read_csv("/Users/hasanhaq/chi17_ds1/class_lectures/week06-mcnulty3/01-mle/Boot.csv")

In [6]:
df

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,Intercept,y
0,0.485176,0.809888,-0.900321,-0.024176,-1.173217,1,-1.517941
1,0.187372,2.440924,1.417704,-1.032075,0.090030,1,-0.919732
2,-0.174521,-0.641742,0.732103,0.488829,0.108709,1,1.883271
3,0.209709,-2.041166,0.058982,-0.451210,0.371843,1,0.313504
4,1.020862,-0.538123,0.664033,-0.422197,-0.336070,1,-0.516607
5,1.504307,0.599526,-0.686798,-0.705274,0.167085,1,-3.317442
6,0.948465,-0.473493,-0.592526,0.423172,-0.979033,1,-2.765445
7,-2.129218,1.728652,-0.177289,0.909394,1.014291,1,3.019595
8,0.923084,0.475517,0.599027,1.050015,0.102545,1,-1.662854
9,0.525010,0.782340,0.401453,1.355156,-0.562004,1,0.637847


### 1) Produce a bootstrapped estimate of the median and 95 percent confidence interval over the median of the dependent variable in the attached dataset.

In [80]:
df.sample(n=1000, replace=True, random_state=100)['y']

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,Intercept,y
520,0.800387,-0.431371,-1.291179,-1.016542,-1.075980,1,-3.858997
792,0.898458,-1.153093,-0.159555,-1.612104,-0.768983,1,-2.619429
835,0.445839,0.607006,-0.311470,0.522355,0.885619,1,-1.551964
871,0.098462,-0.218427,1.010482,-1.351074,1.847468,1,1.633654
855,2.327368,-0.447824,-1.753648,1.007483,0.078716,1,-4.102526
79,-0.442532,-0.110125,-1.153779,-0.363712,-1.819046,1,-1.247125
944,0.694470,-0.253692,0.042692,0.894759,0.126002,1,-0.511730
906,0.182521,-1.167277,-0.756297,-0.815814,0.166614,1,-2.838019
350,0.640359,0.381560,1.090414,0.146267,0.987397,1,0.667801
948,0.035252,-0.043193,-0.601774,1.016977,0.198649,1,-0.685369


In [88]:
np.median(df['y'].sample(n=2000, replace=True))

-0.36899526792311343

In [85]:
def median_confidence_interval(data, confidence=0.95):
    a = 1.0*np.array(data)
    n = len(a)
    m, se = np.median(a), scipy.stats.sem(a)
    h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
    return m, m-h, m+h

In [86]:
median_confidence_interval(df['y'])

(-0.407439743473898, -0.52543425711242642, -0.28944522983536958)

### 2) Use the attached data to run the linear model y = xb. Produce bootstrapped estimates of the model parameters, b, and a 95% confidence interval over them.

In [89]:
# Fit regression model
X = df.sample(n=2000, replace=True, random_state=100)[['x_1', 'x_2', 'x_3', 'x_4', 'x_5']]
y = df.sample(n=2000, replace=True, random_state=100)['y']

results = sm.OLS(y,X).fit()
# Inspect the results
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.68
Model:,OLS,Adj. R-squared:,0.679
Method:,Least Squares,F-statistic:,847.3
Date:,"Mon, 20 Feb 2017",Prob (F-statistic):,0.0
Time:,10:17:45,Log-Likelihood:,-2990.3
No. Observations:,2000,AIC:,5991.0
Df Residuals:,1995,BIC:,6019.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
x_1,-1.4078,0.025,-56.706,0.000,-1.456 -1.359
x_2,-0.2657,0.026,-10.222,0.000,-0.317 -0.215
x_3,0.5312,0.025,21.393,0.000,0.482 0.580
x_4,0.4047,0.024,16.757,0.000,0.357 0.452
x_5,0.2761,0.025,11.062,0.000,0.227 0.325

0,1,2,3
Omnibus:,0.636,Durbin-Watson:,1.765
Prob(Omnibus):,0.727,Jarque-Bera (JB):,0.584
Skew:,0.039,Prob(JB):,0.747
Kurtosis:,3.03,Cond. No.,1.14
