<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lesson 6.02 - Bootstrapping and Bagging

## Importing libraries


We'll need the following libaries for today's lecture:
- `pandas`
- `numpy`
- `DecisionTreeClassifier` from `sklearn`'s `tree` module
- `BaggingClassifier` from `sklearn`'s `ensemble` module
- `train_test_split` and `cross_val_score` from `sklearn`'s `model_selection` module
- `accuracy_score` from `sklearn`'s `metrics` module

In [99]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import tree

## Load the Data

We'll be using the `Heart.csv` from the [ISLR Website](https://www-bcf.usc.edu/~gareth/ISL/). There's a copy in this repo under `./datasets/Heart.csv`.

In [6]:
# Read in the Heart .csv data.
df = pd.read_csv('./datasets/Heart.csv')

# Drop the `Unnamed: 0` column.
df.drop(columns=['Unnamed: 0'], inplace=True)

In [7]:
# Check the first few rows to make sure we dropped the column properly.
df.head()

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No


## Data cleaning: Drop rows with null values

In [8]:
# Check the shape of the data.
df.shape

(303, 14)

In [9]:
# How much missing data do we have?
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
Age          303 non-null int64
Sex          303 non-null int64
ChestPain    303 non-null object
RestBP       303 non-null int64
Chol         303 non-null int64
Fbs          303 non-null int64
RestECG      303 non-null int64
MaxHR        303 non-null int64
ExAng        303 non-null int64
Oldpeak      303 non-null float64
Slope        303 non-null int64
Ca           299 non-null float64
Thal         301 non-null object
AHD          303 non-null object
dtypes: float64(2), int64(9), object(3)
memory usage: 33.2+ KB


Minimal nulls in Ca, Thal.

In [11]:
# see those nulls
df[df['Ca'].isnull()|df['Thal'].isnull()]

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
87,53,0,nonanginal,128,216,0,2,115,0,0.0,1,0.0,,No
166,52,1,nonanginal,138,223,0,0,169,0,0.0,1,,normal,No
192,43,1,asymptomatic,132,247,1,2,143,1,0.1,2,,reversable,Yes
266,52,1,asymptomatic,128,204,1,0,156,1,1.0,2,0.0,,Yes
287,58,1,nontypical,125,220,0,0,144,0,0.4,2,,reversable,No
302,38,1,nonanginal,138,175,0,0,173,0,0.0,1,,normal,No


In [13]:
# Drop NAs.
df.dropna(inplace=True)

In [16]:
# Confirm all missing data is dropped.
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 297 entries, 0 to 301
Data columns (total 14 columns):
Age          297 non-null int64
Sex          297 non-null int64
ChestPain    297 non-null object
RestBP       297 non-null int64
Chol         297 non-null int64
Fbs          297 non-null int64
RestECG      297 non-null int64
MaxHR        297 non-null int64
ExAng        297 non-null int64
Oldpeak      297 non-null float64
Slope        297 non-null int64
Ca           297 non-null float64
Thal         297 non-null object
AHD          297 non-null object
dtypes: float64(2), int64(9), object(3)
memory usage: 34.8+ KB


All nulls dropped.

In [17]:
# What's the shape of our data now?
df.shape

(297, 14)

## Feature Engineering

In [25]:
# Create dummies for the `ChestPain`, `Thal`, and `AHD` columns.
# Be sure to set `drop_first=True`.
df = pd.get_dummies(df,columns=['ChestPain','Thal','AHD'], drop_first=True)

# Confirm we did this correctly.
df.head()

Unnamed: 0,Age,Sex,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,ChestPain_nonanginal,ChestPain_nontypical,ChestPain_typical,Thal_normal,Thal_reversable,AHD_Yes
0,63,1,145,233,1,2,150,0,2.3,3,0.0,0,0,1,0,0,0
1,67,1,160,286,0,2,108,1,1.5,2,3.0,0,0,0,1,0,1
2,67,1,120,229,0,2,129,1,2.6,2,2.0,0,0,0,0,1,1
3,37,1,130,250,0,0,187,0,3.5,3,0.0,1,0,0,1,0,0
4,41,0,130,204,0,2,172,0,1.4,1,0.0,0,1,0,1,0,0


## Model Prep: Create `X` and `y` variables

Our target column will be `AHD_Yes`: 
- 1 means the patient has heart disease
- 0 means they aren't diagnosed with heart disease

In [26]:

X = df.drop('AHD_Yes', axis='columns')

y = df['AHD_Yes']

In [37]:
# What is the accuracy of our baseline model?
# y.mean(), which gives 0.471279 and is equivalent to...
y.value_counts(normalize=True)

0    0.538721
1    0.461279
Name: AHD_Yes, dtype: float64

<details><summary>What does a false positive mean in this case?</summary>
    
- A false positive indicates someone **falsely** predict as being in the **positive** class.
- This is someone we incorrectly think has heart disease.
- Incorrectly predicting someone to have heart disease is bad... but it _might_ be worse to incorrectly predict that someone is healthy!
</details>

## Model Prep: Train/Test Split

In [28]:
# Split data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

## Model Instantiation

In [92]:
# Instantiate `DecisionTreeClassifier` object.
t = DecisionTreeClassifier(random_state=42)

## Model Evaluation

In [93]:
# Get a cross_val_score for our tree.
cross_val_score(tree, X_train, y_train, cv=5).mean()

0.7296969696969697

In [101]:
# Fit and score on the training data.
model = t.fit(X_train, y_train)
t.score(X_train, y_train)

1.0

In [95]:
# Score on the testing data.
t.score(X_test, y_test)

0.72

In [96]:
# view the probabilities
t.predict_proba(X_test)[:10]

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [97]:
# see which features are most important
pd.DataFrame(t.feature_importances_,X_train.columns)

Unnamed: 0,0
Age,0.138139
Sex,0.023288
RestBP,0.106381
Chol,0.089778
Fbs,0.0
RestECG,0.013603
MaxHR,0.037786
ExAng,0.013301
Oldpeak,0.094052
Slope,0.0


In [102]:
# plot tree
tree.plot_tree(model)

AttributeError: module 'sklearn.tree' has no attribute 'plot_tree'

<details><summary> Where do decision trees tend to fall on the Bias/Variance spectrum?</summary>
    
- Decision trees very easily overfit.
- They tend to suffer from **high error due to variance**.
</details>

## Introduction to Ensemble Methods
We can list out the different types of models we've built thus far:
- Linear Regression
- Logistic Regression
- $k$-Nearest Neighbors
- Naive Bayes Classification _(maybe)_

If we want to use any of these models, we follow the same type of process.
1. Based on our problem, we identify which model to use. (Is our problem classification or regression? Do we want an interpretable model?)
2. Fit the model using the training data.
3. Use the fit model to generate predictions.
4. Evaluate our model's performance and, if necessary, return to step 2 and make changes.

So far, we've always had exactly one model. Today, however, we're going to talk about **ensemble methods**. Mentally, you should think about this as if we build multiple models and then aggregate their results in some way.

## Why would we build an "ensemble model?"

Our goal is to estimate $f$, the true function. (Think about $f$ as the **true process** that dictates Ames housing prices.)

We can come up with different hypotheses $h_1$, $h_2$, and so on to get as close to $f$ as possible. (Think about $h_1$ as the model you built to predict $f$, think of $h_2$ as the model your neighbor built to predict $f$, and so on.)

![](./assets/Ensemble.png)


### (Advanced) Three Benefits: Statistical, Computational, Representational
- The **statistical** benefit to ensemble methods: By building one model, our predictions are almost certainly going to be wrong. Predictions from one model might overestimate housing prices; predictions from another model might underestimate housing prices. By "averaging" predictions from multiple models, we'll see that we can often cancel our errors out and get closer to the true function $f$.
- The **computational** benefit to ensemble methods: It might be impossible to develop one model that globally optimizes our objective function. (Remember that CART reach locally-optimal solutions that aren't guaranteed to be the globally-optimal solution.) In these cases, it may be **impossible** for one CART to arrive at the true function $f$. However, generating many different models and averaging their predictions may allow us to get results that are closer to the global optimum than any individual model.
- The **representational** benefit to ensemble methods: Even if we had all the data and all the computer power in the world, it might be impossible for one model to **exactly** equal $f$. For example, a linear regression model can never model a relationship where a one-unit change in $X$ is associated with some *different* change in $Y$ based on the value of $X$. All models have some shortcomings. (See [the no free lunch theorems](https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization).) While individual models have shortcomings, by creating multiple models and aggregating their predictions, we can actually create predictions that represent something that one model cannot ever represent.

We can summarize this as the **wisdom of the crowd**.

## Wisdom of the Crowd: Guess the weight of Penelope

![](./assets/penelope.jpg)

[Image source: https://www.npr.org](https://www.npr.org/sections/money/2015/07/17/422881071/how-much-does-this-cow-weigh)

## Ensemble predictions

Let's mimic the "wisdom of the crowd" by creating several decision trees and averaging their predictions on the test set.

In [49]:
# What is this line doing?
predictions = pd.DataFrame(index=X_test.index)
# we are creating an empty df, and then populate it with info in new columns

In [51]:
predictions.head()

113
195
64
27
245


In [55]:
# Generate ten decision trees.
for i in range(10):
    # Instantiate decision trees.
    t = DecisionTreeClassifier()
    # Fit to our training data.
    t.fit(X_train, y_train)
    
    # Put predictions in dataframe.
    predictions[f'Tree {i}'] = t.predict(X_test)

predictions.tail()

Unnamed: 0,Tree 0,Tree 1,Tree 2,Tree 3,Tree 4,Tree 5,Tree 6,Tree 7,Tree 8,Tree 9
93,1,1,1,1,1,1,1,1,1,1
133,0,0,0,0,0,0,0,0,0,0
33,0,0,0,0,0,0,0,0,0,0
20,1,1,0,1,1,0,1,0,1,0
76,1,1,1,1,1,1,1,1,1,1


In [56]:
# Generate aggregated predicted probabilities.
probs = predictions.mean(axis='columns')  #for each row, get the mean across all columns

In [58]:
# Check out probs.
print([x for x in probs])

[1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.2, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.8, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.8, 0.5, 0.4, 1.0, 0.0, 0.6, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.3, 0.0, 0.0, 1.0, 0.0, 0.0, 0.6, 1.0]


Notice that some probabilities are non 0 or 1. These are the 'contentious' ones, which indicate that the models are not well trained for these data, means maybe got some features which are not well trained in. Focus on these to improve model, or ignore these features, as seen below. 

In [59]:
accuracy_score(y_test, (probs > .5).astype(int))

0.6666666666666666

## Ensemble evaluation?

<details><summary>Why didn't our score improve?</summary>

- Because we fit the same model on the same data ten times!
</details>

## Bootstrapping

Let's get started actually making ensemble predictions. However, in order to do that, we'll need to introduce the idea of bootstrapping, or **random sampling with replacement.**

### Summary
When bootstrapping in order to fit multiple estimators, we want to:
- Take a sample of size $n$.
- With replacement.
- From our original data.

<details><summary>Why do you think we want to take a sample of size n?</summary>
    
- Because we want our estimators to be fit on data of the same size!
- If our original data had $n = 1,000$ and we bootstrapped a sample of size 50 to fit an estimator on, our estimator fit on size 50 would not have learnt enough features and will probably look very, very different from an estimator fit on size 1,000.
</details>

<details><summary>Why do you think we want to sample with replacement?</summary>
    
- If we didn't sample with replacement, we'd just get identical samples of size $n$. (These would be copies of our original data!)
</details>

Boostrapping with `pandas`:

In [61]:
X_train.sample(n=X_train.shape[0], #we want same no. of rows as original data
               replace=True).head()

Unnamed: 0,Age,Sex,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,ChestPain_nonanginal,ChestPain_nontypical,ChestPain_typical,Thal_normal,Thal_reversable
59,51,1,125,213,0,2,125,1,1.4,1,1.0,0,0,1,1,0
234,54,0,160,201,0,0,163,0,0.0,1,1.0,1,0,0,1,0
50,41,0,105,198,0,0,168,0,0.0,1,1.0,0,1,0,1,0
117,35,0,138,183,0,0,182,0,1.4,1,0.0,0,0,0,1,0
10,57,1,140,192,0,0,148,0,0.4,2,0.0,0,0,0,0,0


## Bagging: Bootstrap Aggregating

As we have seen, decision trees are very powerful machine learning models. However, decision trees have some limitations. In particular, trees that are grown very deep tend to learn highly irregular patterns (a.k.a. they overfit their training sets). 

Bagging (bootstrap aggregating) mitigates this problem by exposing different trees to different sub-samples of the training set.

The process for creating bagged decision trees is as follows:
1. From the original data of size $n$, bootstrap $k$ samples each of size $n$ (with replacement!).
2. Build a decision tree on each bootstrapped sample.
3. Make predictions by passing a test observation through all $k$ trees and developing one aggregate prediction for that observation.

![](./assets/Ensemble.png)

### What do you mean by "aggregate prediction?"
As with all of our modeling techniques, we want to make sure that we can come up with one final prediction for our observation. (Building 1,000 trees and coming up with 1,000 predictions for one observation probably wouldn't be very helpful.)

Suppose we want to predict whether or not a Reddit post is going to go viral, where `1` indicates viral and `0` indicates non-viral. We build 100 decision trees. Given a new Reddit post labeled `X_test`, we pass these features into all 100 decision trees.
- 70 of the trees predict that the post in `X_test` will go viral.
- 30 of the trees predict that the post in `X_test` will not go viral.

**`.predict_proba(X_test)` to do?**
<details><summary>What might you expect .predict(X_test) do?</summary>

- `.predict(X_test)` should output a 1, predicting that the post will go viral.

</details>

<details><summary>What might you expect .predict_proba(X_test) do?</summary>

- `.predict_proba(X_test)` should output 0.7, indicating the probability of the post going viral is 70%.
</details>


## Bagging Classifier using a `for` loop

In the cell below, we'll create an ensemble of trees like before, except this time we'll train each tree to a **bootstrapped** sample of the training data.

In [71]:
# Instantiate dataframe.
predictions = pd.DataFrame(index=X_test.index)

# Generate ten decision trees.
for i in range(10):
    
    # Bootstrap X data.
    # Should we add a random seed? No. we want different samples for each sample 
    X_sample = X_train.sample(n=X_train.shape[0],replace=True)
    
    # Get y data that matches the X data.
    y_sample = y_train[X_sample.index]
    
    # Instantiate decision trees.
    t = DecisionTreeClassifier()
    
    # Fit to this current sample data.
    t.fit(X_sample, y_sample)
    
    # Put predictions in dataframe.
    predictions[f'Tree {i}'] = t.predict(X_test)

predictions.head(20)

Unnamed: 0,Tree 0,Tree 1,Tree 2,Tree 3,Tree 4,Tree 5,Tree 6,Tree 7,Tree 8,Tree 9
113,1,1,1,1,1,1,0,1,1,1
195,1,1,1,1,1,1,0,1,1,0
64,1,1,1,0,1,1,1,1,1,1
27,0,0,1,1,0,1,1,1,1,0
245,1,0,1,1,0,0,0,0,0,0
210,0,0,0,0,0,0,0,0,0,0
221,1,1,0,1,1,0,1,1,1,0
1,1,1,1,1,1,1,1,1,1,0
116,0,0,0,1,0,0,0,0,0,0
157,1,1,1,0,1,1,1,1,1,1


In [72]:
# Generate aggregated predicted probabilities.
predictions.mean(axis='columns')   #for each row, get the mean across all columns

113    0.9
195    0.8
64     0.9
27     0.6
245    0.3
210    0.0
221    0.7
1      0.9
116    0.1
157    0.9
194    0.2
190    0.0
298    0.6
103    0.4
278    0.5
5      0.2
102    0.6
289    0.3
187    0.3
21     0.5
90     0.4
66     0.5
104    0.6
162    0.1
297    0.4
232    0.2
56     0.7
252    0.9
155    0.6
34     0.0
      ... 
265    0.4
135    0.0
169    0.0
236    0.9
163    0.1
214    0.6
23     0.8
172    0.2
272    1.0
300    0.9
275    0.3
255    0.0
198    0.0
242    0.0
270    1.0
280    1.0
137    0.8
281    0.0
3      0.9
132    0.0
183    0.7
62     1.0
168    0.8
70     0.0
295    0.0
93     0.2
133    0.0
33     0.5
20     0.5
76     1.0
Length: 75, dtype: float64

In [74]:
# Generate aggregated predicted probabilities as probs.
probs = predictions.mean(axis='columns')

In [75]:
accuracy_score(y_test, (probs >= .5).astype(int))

0.7866666666666666

This 'multiple bootstrapped dataset, multiple trees' method, will have Higher score than the earliest 'single dataset, single tree' method, and also higher than the 'single dataset, multiple trees' method.

## Bagging Classifier using `sklearn`

[BaggingClassifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)

In the cell below, create and score instance of `BaggingClassifier` on the test set. You should get a similar score to the one in the previous step.

In [78]:
# Instantiate BaggingClassifier.
bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)

# Fit BaggingClassifier.
bag.fit(X_train, y_train)

# Score BaggingClassifier.
print(bag.score(X_train, y_train))
print(bag.score(X_test, y_test))

0.9774774774774775
0.7733333333333333
