### Colab Activity 20.1: Implementing Bootstrapping

**Expected Time = 30 minutes**


This activity focuses on the construction of bootstrapped samples using `pandas`.  You will take samples with replacement from the data and build Logistic Regression models on these samples.  This is a starting step towards the random forest using a Decision Tree over a Logistic Regressor.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)


In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

In [6]:
df = pd.read_csv('data/fetal.zip', compression = 'zip')

In [7]:
X, y = df.drop('fetal_health', axis = 1), df['fetal_health']

In [8]:
X.sample(10, random_state = 42)

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_width,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency
282,133.0,0.002,0.01,0.003,0.002,0.0,0.0,46.0,1.1,0.0,...,69.0,95.0,164.0,5.0,0.0,139.0,135.0,138.0,9.0,0.0
1999,125.0,0.0,0.001,0.009,0.008,0.0,0.0,62.0,1.7,0.0,...,72.0,68.0,140.0,5.0,0.0,130.0,116.0,125.0,29.0,1.0
1709,131.0,0.004,0.003,0.004,0.005,0.0,0.001,60.0,2.1,0.0,...,90.0,78.0,168.0,8.0,0.0,133.0,127.0,132.0,21.0,0.0
988,131.0,0.011,0.0,0.005,0.0,0.0,0.0,29.0,1.3,0.0,...,89.0,82.0,171.0,8.0,0.0,143.0,145.0,145.0,9.0,1.0
2018,125.0,0.0,0.0,0.008,0.007,0.0,0.001,64.0,1.3,0.0,...,77.0,78.0,155.0,4.0,0.0,114.0,111.0,114.0,7.0,0.0
297,148.0,0.0,0.012,0.0,0.0,0.0,0.0,75.0,0.2,84.0,...,7.0,145.0,152.0,1.0,0.0,148.0,148.0,149.0,0.0,0.0
1737,134.0,0.005,0.001,0.007,0.005,0.0,0.0,61.0,1.1,0.0,...,83.0,90.0,173.0,5.0,0.0,142.0,143.0,147.0,17.0,1.0
651,123.0,0.0,0.0,0.0,0.0,0.0,0.0,74.0,0.3,90.0,...,9.0,120.0,129.0,2.0,0.0,123.0,124.0,125.0,0.0,0.0
70,144.0,0.001,0.0,0.005,0.0,0.0,0.0,45.0,0.8,2.0,...,30.0,138.0,168.0,3.0,0.0,162.0,157.0,160.0,5.0,1.0
290,144.0,0.0,0.005,0.0,0.0,0.0,0.0,65.0,0.4,21.0,...,27.0,129.0,156.0,2.0,0.0,150.0,146.0,148.0,3.0,1.0


[Back to top](#-Index)

### Problem 1

#### Taking Samples with Replacement


Complete the starter code for the `for` loop below as instructed:

- Use the `.sample()` method on the data `X` to take samples with replacement of size 20  Use the iteration index `i` as the `random_state` for each sample and set the argument `replace` equal to `True`.
- Calculate the mean of each `sample` and append it to the list `means`.
- Create a DataFrame named `sample_means` where each row is the observed mean of the sample column in `X`. 

In [9]:

means = []
for i in range(100):
    sample = X.sample(n=20, random_state=i, replace=True)
    means.append(sample.mean())
sample_means = pd.DataFrame(means)



### ANSWER CHECK
sample_means.head()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_width,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency
0,130.85,0.0025,0.00155,0.0042,0.00355,0.0,0.0,53.75,1.295,10.9,...,81.15,83.5,164.65,5.6,0.45,134.3,130.5,134.85,16.85,0.15
1,133.1,0.00355,0.0012,0.0044,0.00185,0.0,0.00025,41.1,1.395,8.0,...,77.3,84.65,161.95,4.65,0.4,136.4,133.6,137.45,13.45,0.4
2,131.85,0.00225,0.0093,0.00485,0.00295,0.0,0.0004,34.8,1.955,6.85,...,74.05,89.8,163.85,4.05,0.25,133.85,129.55,132.85,26.25,0.2
3,132.6,0.00425,0.00325,0.00455,0.0023,0.0,0.0002,48.1,1.49,5.15,...,75.25,90.95,166.2,4.1,0.4,140.8,135.6,140.2,20.05,0.45
4,134.9,0.00175,0.00335,0.00495,0.00245,0.0,5e-05,40.3,1.31,13.4,...,80.5,83.85,164.35,4.95,0.45,139.05,134.5,138.5,13.0,0.65


[Back to top](#-Index)

### Problem 2

#### Models on samples



Anticipating the bootstrap aggregation, build a logistic regression model on each iterations sample by following the steps below:

- Create `100` bootstrap samples of size `100` from the original dataset `df` using the `sample` method with  `random_state`  equal to `i` and `replace` equal to `True`.
- Split the data into features `X` and target `y`. `X` will contain all column of `df` except the `fetal_health`. `y` will be equal to the `fetal_health` column.
- Instantiate a logistic regression model with `random_state` equal to `42` and fit it to this data `X` and `y`.  
- Store the coefficients from each fitted model in a list `coefs`.


Finally, outside the `for` loop, create a DataFrame `coef_df` from the collected coefficients, with the column names matching those of the feature set `X`.



<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>baseline value</th>      <th>accelerations</th>      <th>fetal_movement</th>      <th>uterine_contractions</th>      <th>light_decelerations</th>      <th>severe_decelerations</th>      <th>prolongued_decelerations</th>      <th>abnormal_short_term_variability</th>      <th>mean_value_of_short_term_variability</th>      <th>percentage_of_time_with_abnormal_long_term_variability</th>      <th>mean_value_of_long_term_variability</th>      <th>histogram_width</th>      <th>histogram_min</th>      <th>histogram_max</th>      <th>histogram_number_of_peaks</th>      <th>histogram_number_of_zeroes</th>      <th>histogram_mode</th>      <th>histogram_mean</th>      <th>histogram_median</th>      <th>histogram_variance</th>      <th>histogram_tendency</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>-0.148653</td>      <td>0.000397</td>      <td>-0.000296</td>      <td>0.000575</td>      <td>0.000360</td>      <td>0.000000e+00</td>      <td>-0.000130</td>      <td>-0.326384</td>      <td>0.101630</td>      <td>-0.211499</td>      <td>-0.013581</td>      <td>-0.056798</td>      <td>0.182274</td>      <td>0.125475</td>      <td>-0.021006</td>      <td>-0.016422</td>      <td>-0.033148</td>      <td>0.179841</td>      <td>0.026306</td>      <td>-0.127317</td>      <td>-0.163619</td>    </tr>    <tr>      <th>1</th>      <td>-0.061691</td>      <td>0.000470</td>      <td>0.000546</td>      <td>0.000407</td>      <td>0.000115</td>      <td>-6.669241e-07</td>      <td>-0.000151</td>      <td>-0.353678</td>      <td>0.018719</td>      <td>-0.030565</td>      <td>-0.063293</td>      <td>0.105792</td>      <td>0.058392</td>      <td>0.164184</td>      <td>-0.175497</td>      <td>-0.044569</td>      <td>-0.253011</td>      <td>0.240469</td>      <td>0.017689</td>      <td>-0.159757</td>      <td>-0.078947</td>    </tr>    <tr>      <th>2</th>      <td>-0.075481</td>      <td>0.000434</td>      <td>-0.000229</td>      <td>0.000297</td>      <td>0.000023</td>      <td>0.000000e+00</td>      <td>-0.000075</td>      <td>-0.323813</td>      <td>0.092076</td>      <td>-0.143590</td>      <td>-0.075241</td>      <td>0.005551</td>      <td>0.065393</td>      <td>0.070944</td>      <td>-0.032991</td>      <td>-0.144008</td>      <td>0.071054</td>      <td>-0.055390</td>      <td>0.134176</td>      <td>-0.051194</td>      <td>-0.000221</td>    </tr>    <tr>      <th>3</th>      <td>-0.130622</td>      <td>0.001060</td>      <td>-0.003256</td>      <td>0.000483</td>      <td>0.000370</td>      <td>0.000000e+00</td>      <td>-0.000067</td>      <td>-0.274047</td>      <td>0.164185</td>      <td>-0.206988</td>      <td>0.014520</td>      <td>0.078620</td>      <td>-0.122271</td>      <td>-0.043652</td>      <td>-0.082847</td>      <td>0.120812</td>      <td>0.180340</td>      <td>-0.121190</td>      <td>0.433346</td>      <td>-0.212897</td>      <td>0.001096</td>    </tr>    <tr>      <th>4</th>      <td>-0.073390</td>      <td>0.000639</td>      <td>-0.000244</td>      <td>0.002119</td>      <td>0.000060</td>      <td>0.000000e+00</td>      <td>-0.000014</td>      <td>-0.508164</td>      <td>0.152837</td>      <td>-0.303871</td>      <td>0.017631</td>      <td>0.005860</td>      <td>0.065243</td>      <td>0.071102</td>      <td>-0.037210</td>      <td>-0.435003</td>      <td>0.116637</td>      <td>0.077635</td>      <td>0.117680</td>      <td>-0.090842</td>      <td>-0.008462</td>    </tr>  </tbody></table>


In [10]:

coefs = []
for i in range(100):
    sample = df.sample(n=100, random_state=i, replace=True)
    X_sample = sample.drop('fetal_health', axis=1)
    y_sample = sample['fetal_health']
    model = LogisticRegression(random_state=42)
    model.fit(X_sample, y_sample)
    coefs.append(model.coef_[0])
coef_df = pd.DataFrame(coefs, columns=X.columns)




### ANSWER CHECK
coef_df.head()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_width,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency
0,-0.147564,0.000393,-0.000294,0.000569,0.000356,0.0,-0.000129,-0.323361,0.100511,-0.206871,...,-0.056691,0.179133,0.122442,-0.018738,-0.016571,-0.030549,0.178046,0.026797,-0.125219,-0.161631
1,-0.061691,0.00047,0.000546,0.000407,0.000115,-6.66928e-07,-0.000151,-0.35368,0.018719,-0.030565,...,0.105792,0.058393,0.164185,-0.175498,-0.04457,-0.253013,0.240471,0.017689,-0.159758,-0.078947
2,-0.075482,0.000434,-0.000229,0.000297,2.3e-05,0.0,-7.5e-05,-0.323815,0.092076,-0.143591,...,0.005551,0.065393,0.070944,-0.032991,-0.144009,0.071055,-0.055391,0.134177,-0.051195,-0.000221
3,-0.130622,0.00106,-0.003256,0.000483,0.00037,0.0,-6.7e-05,-0.274043,0.164183,-0.206985,...,0.078619,-0.122269,-0.043651,-0.082845,0.12081,0.180337,-0.121187,0.433341,-0.212894,0.001096
4,-0.072985,0.000661,-0.000261,0.002213,6.1e-05,0.0,-1.5e-05,-0.514371,0.160097,-0.310162,...,0.00483,0.065848,0.070678,-0.031314,-0.449564,0.117808,0.077018,0.122154,-0.091711,-0.008422


[Back to top](#-Index)

### Problem 3

#### Correlation of Models



The multiple models are all built on the same features for the data.  Do you think these models are highly correlated?  Answer "yes" or "no" as a string to `ans3` below.  


In [11]:

ans3 = "yes"



### ANSWER CHECK
print(ans3)

yes
