<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Ensemble Methods - Decision Trees and Bagging


---

## LEARNING OBJECTIVES


### Core

- Explain the power of using ensemble classifiers
- Know the difference between a base classifier and an ensemble classifier
- Describe how bagging creates different training sets with bootstrapping and fits a base classifier to each of them
- Use the bagging classifier in sklearn


### Target

- Describe
    - The statistical problem
    - The computational problem
    - The representational problem
- Describe how the bagging classifier makes predictions through a majority vote

### Stretch

- Describe how the bagging classifier can be stronger than any of its ensemble members
- Be aware that to make an improvement the individual classifiers have to beat the baseline and should not all make the same predictions

<h1>Lesson Guide<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#LEARNING-OBJECTIVES" data-toc-modified-id="LEARNING-OBJECTIVES-1">LEARNING OBJECTIVES</a></span><ul class="toc-item"><li><span><a href="#Core" data-toc-modified-id="Core-1.1">Core</a></span></li><li><span><a href="#Target" data-toc-modified-id="Target-1.2">Target</a></span></li><li><span><a href="#Stretch" data-toc-modified-id="Stretch-1.3">Stretch</a></span></li></ul></li><li><span><a href="#Introduction" data-toc-modified-id="Introduction-2">Introduction</a></span></li><li><span><a href="#Ensembles" data-toc-modified-id="Ensembles-3">Ensembles</a></span></li><li><span><a href="#The-Hypothesis-space" data-toc-modified-id="The-Hypothesis-space-4">The Hypothesis space</a></span><ul class="toc-item"><li><span><a href="#The-Statistical-Problem" data-toc-modified-id="The-Statistical-Problem-4.1">The Statistical Problem</a></span></li><li><span><a href="#The-Computational-Problem" data-toc-modified-id="The-Computational-Problem-4.2">The Computational Problem</a></span></li><li><span><a href="#The-Representational-Problem" data-toc-modified-id="The-Representational-Problem-4.3">The Representational Problem</a></span></li><li><span><a href="#Characteristics-of-Ensemble-methods" data-toc-modified-id="Characteristics-of-Ensemble-methods-4.4">Characteristics of Ensemble methods</a></span></li></ul></li><li><span><a href="#Bagging" data-toc-modified-id="Bagging-5">Bagging</a></span></li><li><span><a href="#Bagging-Classifier-in-Scikit-Learn" data-toc-modified-id="Bagging-Classifier-in-Scikit-Learn-6">Bagging Classifier in Scikit Learn</a></span><ul class="toc-item"><li><span><a href="#1.-Import-the-car-evaluation-data" data-toc-modified-id="1.-Import-the-car-evaluation-data-6.1">1. Import the car evaluation data</a></span></li><li><span><a href="#2.-Encode-the-features-properly" data-toc-modified-id="2.-Encode-the-features-properly-6.2">2. Encode the features properly</a></span></li><li><span><a href="#3.-Create-a-train-test-split-and-cross-validate-a-KNN-classifier" data-toc-modified-id="3.-Create-a-train-test-split-and-cross-validate-a-KNN-classifier-6.3">3. Create a train-test split and cross-validate a KNN classifier</a></span></li><li><span><a href="#4.-Research-and-describe-the-max_samples-and-max_features-hyperparameters-of-the-bagging-classifier" data-toc-modified-id="4.-Research-and-describe-the-max_samples-and-max_features-hyperparameters-of-the-bagging-classifier-6.4">4. Research and describe the <code>max_samples</code> and <code>max_features</code> hyperparameters of the bagging classifier</a></span></li><li><span><a href="#5.-Fit-a-BaggingClassifier-with-a-KNN-base-estimator" data-toc-modified-id="5.-Fit-a-BaggingClassifier-with-a-KNN-base-estimator-6.5">5. Fit a <code>BaggingClassifier</code> with a KNN base estimator</a></span></li><li><span><a href="#6.-Cross-validate-a-decision-tree-classifier" data-toc-modified-id="6.-Cross-validate-a-decision-tree-classifier-6.6">6. Cross-validate a decision tree classifier</a></span></li><li><span><a href="#7.-Fit-a-BaggingClassifier-with-a-decision-tree-base-estimator" data-toc-modified-id="7.-Fit-a-BaggingClassifier-with-a-decision-tree-base-estimator-6.7">7. Fit a <code>BaggingClassifier</code> with a decision tree base estimator</a></span></li><li><span><a href="#8.--Of-the-Hypothesis-Space-problems-we-discussed-earlier,-which-ones-are-solved-by-bagging?" data-toc-modified-id="8.--Of-the-Hypothesis-Space-problems-we-discussed-earlier,-which-ones-are-solved-by-bagging?-6.8">8.  Of the Hypothesis Space problems we discussed earlier, which ones are solved by bagging?</a></span><ul class="toc-item"><li><span><a href="#--Statistical?" data-toc-modified-id="--Statistical?-6.8.1">- Statistical?</a></span></li><li><span><a href="#--Computational?" data-toc-modified-id="--Computational?-6.8.2">- Computational?</a></span></li><li><span><a href="#--Representational?" data-toc-modified-id="--Representational?-6.8.3">- Representational?</a></span></li></ul></li><li><span><a href="#Bonus:-Tune-the-bagging-classifiers-with-grid-search" data-toc-modified-id="Bonus:-Tune-the-bagging-classifiers-with-grid-search-6.9">Bonus: Tune the bagging classifiers with grid search</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-7">Conclusion</a></span></li><li><span><a href="#ADDITIONAL-RESOURCES" data-toc-modified-id="ADDITIONAL-RESOURCES-8">ADDITIONAL RESOURCES</a></span></li></ul></div>

## Introduction

In the previous lessons we learned about Decision Trees as powerful tools for model building. Today we will learn about ensemble techniques: ways to combine different models in order to obtain a more powerful model.

Before we dive into the subject, let's recap a few things learned so far:

* What classifiers have we learned about so far? Which one is your favourite?

* How did we assess the "goodness" of a particular model?

* How could we improve the performance of a model?

## Ensembles
Ensemble techniques are supervised learning methods to improve predictive accuracy by combining several base models in order to enlarge the space of possible hypotheses to represent our data. Ensembles are often much more accurate than the base classifiers that compose them.

Two families of ensemble methods are usually distinguished:

- In **averaging methods**, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimators because its variance is reduced.

Examples of this family include **Bagging methods** and **Random Forests**.

- The other family of ensemble methods are **boosting methods**, where base estimators are built sequentially and one tries to reduce the bias of the combined estimators. The motivation is to combine several weak models to produce a powerful ensemble. We will discuss these in a future lesson.

Examples of this family include **AdaBoosts** and **Gradient Tree Boosting**.

![Ensemble](./assets/images/Ensemble.png)


- When might this be useful?
- How will forming ensembles affect model interpretability?
- Can you think of a business case where we may want to get a very, very accurate model (despite it being more complex)?

## The Hypothesis space

In any supervised learning task, our goal is to make predictions of the true classification function $f$ by learning the classifier $h$ (our estimate of the true function $f$). In other words we are searching in a certain hypothesis space for the most appropriate function to describe the relationship between our features and the target.

* Can you give an example of how this search is performed using one of the classifiers you know?


* What reasons could be preventing our hypothesis to reach a perfect score?

There could be several reasons why a base classifier doesn't perform terribly well in trying to approximate the true classification function of

- statistical
- computational
- representational

origin.

### The Statistical Problem
If the amount of training data available is small, the base classifier will have difficulty converging to $f$.

An ensemble classifier can mitigate this problem by "averaging out" base classifier predictions to improve convergence. This can be pictorially represented as a search in a space where multiple partial perspectives are combined to obtain a better picture of the goal.

![Statistical Problem](./assets/images/statistical.png)

(source: [T. Dietterich: Ensemble Methods in Machine Learning](http://www.cs.iastate.edu/~jtian/cs573/Papers/Dietterich-ensemble-00.pdf))

The true function $f$ is best approximated as an average of the base classifiers.

### The Computational Problem
Even with sufficient training data, it may still be computationally difficult to find the best classifier $h$.

For example, if our base classifier is a decision tree, an exhaustive search of the hypothesis space of all possible classifiers is extremely complex (NP-complete).

In fact this is why we used a heuristic algorithm (greedy search).

An ensemble composed of several _Base Classifiers_ with different starting points can provide a better approximation to $f$ than any individual _Base Classifier_.

![Computational Problem](./assets/images/computational.png)

The true function $f$ is often best approximated by using several starting points to explore the hypothesis space.

### The Representational Problem
Sometimes $f$ cannot be expressed in terms of our hypothesis at all. To illustrate this, suppose we use a decision tree as our base classifier. A decision tree works by forming a rectilinear partition of the feature space, i.e it always cuts at a fixed value along a feature.

![Decision Tree boundary](./assets/images/dtcut.png)

But what if $f$ is a diagonal line?

Then it cannot be represented by finitely many rectilinear segments, and therefore the true decision boundary cannot be obtained by a decision tree classifier.

However, it may still be possible to approximate $f$ or even to expand the space of representable functions using ensemble methods.

![Representational Problem](./assets/images/representational.png)

### Characteristics of Ensemble methods

In order for an ensemble classifier to outperform a single base classifier, the following conditions must be met:

- **accuracy**: base classifiers outperform random guessing
- **diversity**: misclassifications must occur on different training examples

![Ensemble performance](./assets/images/ensemble_performance.png)


* What base classifiers would you combine to have different perspectives?

## Bagging

- _Bagging or bootstrap aggregating_ is a method that involves manipulating the training set by resampling. 
- We learn $k$ base classifiers on $k$ different samples of training data.  
- These samples are independently created by resampling the training data using uniform weights (e.g. a uniform sampling distribution). In other words each model in the ensemble votes with equal weight. 
- In order to reduce model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set. 
- As an example, the random forest algorithm combines random decision trees with bagging to achieve very high classification accuracy.

|Original|1|2|3|4|5|6|7|8|
|----|----|----|----|----|----|----|----|----|
|Training set 1|2|7|8|3|7|6|3|1|
|Training set 2|7|8|5|6|4|2|7|1|
|Training set 3|3|6|2|7|5|6|2|2|
|Training set 4|4|5|1|4|6|4|3|8|

- Given a standard training set $D$ of size $n$, bagging generates $m$ new training sets $D_i$, each of size $n'$, by sampling from $D$ uniformly and with replacement. 
- By sampling with replacement, some observations may be repeated in each $D_i$. 
- The $m$ models are fitted using the above $m$ samples and combined by averaging the output (for regression) or voting (for classification).

Bagging reduces the variance in our generalization error by aggregating multiple base classifiers together (provided they satisfy our earlier requirements).

If the base classifier is stable then the ensemble error is primarily due to bias, and bagging may not be effective.

Since each sample of training data is equally likely, bagging is not very susceptible to overfitting with noisy data.

As they provide a way to reduce overfitting, **bagging** methods work best with strong and complex models (e.g., **fully developed decision trees**), in contrast with **boosting** methods which usually work best with weak models (e.g., **shallow decision trees**).

* Can you propose another sample to add to those above? Call out the numbers you would include.

## Bagging Classifier in Scikit Learn 

In scikit-learn, bagging methods are offered as a unified `BaggingClassifier` meta-estimator (resp. `BaggingRegressor`), taking as input a user-specified base estimator along with parameters specifying the strategy to draw random subsets. In particular, `max_samples` and `max_features` control the size of the subsets (in terms of samples and features), while `bootstrap` and `bootstrap_features` control whether samples and features are drawn with or without replacement. When using a subset of the available samples the generalization error can be estimated with the out-of-bag samples by setting `oob_score=True`.

As an example, we will compare the performance of a simple KNN classifier versus the Bagging Classifier on the car acceptability dataset.

### 1. Import the car evaluation data

Use `acceptability` as the target variable.

In [1]:
import pandas as pd

df = pd.read_csv('../../../../resource-datasets/car_evaluation/car.csv')

In [2]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [3]:
df.shape

(1728, 7)

In [5]:
y = df.pop("acceptability")

In [6]:
y.head()

0    unacc
1    unacc
2    unacc
3    unacc
4    unacc
Name: acceptability, dtype: object

In [8]:
X = df

In [9]:
X.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,vhigh,vhigh,2,2,small,low
1,vhigh,vhigh,2,2,small,med
2,vhigh,vhigh,2,2,small,high
3,vhigh,vhigh,2,2,med,low
4,vhigh,vhigh,2,2,med,med


# 2. Encode the features properly

In [10]:
X = pd.get_dummies(X,drop_first=True)


In [66]:
X.shape

(1728, 15)

### 3. Create a train-test split and cross-validate a KNN classifier

In [13]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)

In [15]:
knn = KNeighborsClassifier()

In [18]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [19]:
print(knn.score(X_train, y_train))
print(cross_val_score(knn, X_train, y_train, cv=5).mean())
print(knn.score(X_test, y_test))

0.9095513748191028
0.8545440276251766
0.8526011560693642


In [67]:
y_test.value_counts(normalize=True)

unacc    0.751445
acc      0.170520
good     0.040462
vgood    0.037572
Name: acceptability, dtype: float64

### 4. Research and describe the `max_samples` and `max_features` hyperparameters of the bagging classifier

The `BaggingClassifier` meta-estimator has several parameters.

Look at the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) for a detailed description of each and find out what `max_samples` and `max_features` do.

In [None]:
# fit(self, X, y, sample_weight=None)[source]

In [21]:
from sklearn.ensemble import BaggingClassifier

In [72]:
bag = BaggingClassifier(KNeighborsClassifier(),
                        max_samples=0.5, max_features=0.5,
                       n_estimators=100)

# 只用一半的features能够增加的model variation。对于决策树来说，这点尤其重。

In [73]:
bag.fit(X_train,y_train)

BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=0.5,
         max_samples=0.5, n_estimators=1000, n_jobs=None, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [74]:
print(bag.score(X_train, y_train))
print(cross_val_score(bag, X_train, y_train, cv=5).mean())
print(bag.score(X_test, y_test))

0.711287988422576
0.6953696436980066
0.7601156069364162


In [75]:
bag1 = BaggingClassifier(KNeighborsClassifier(),
                        max_samples=0.9, max_features=0.9,n_estimators=100)

In [76]:
bag1.fit(X_train,y_train)

BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=0.9,
         max_samples=0.9, n_estimators=100, n_jobs=None, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [77]:
print(bag1.score(X_train, y_train))
print(cross_val_score(bag1, X_train, y_train, cv=5).mean())
print(bag1.score(X_test, y_test))

0.9450072358900145
0.8357374561816565
0.8526011560693642


### 5. Fit a `BaggingClassifier` with a KNN base estimator

In [None]:
from sklearn.ensemble import BaggingClassifier

### 6. Cross-validate a decision tree classifier 

In [32]:
from sklearn.tree import DecisionTreeClassifier

In [34]:
classifier = DecisionTreeClassifier(criterion='gini',
                                    max_depth=None, 
                                    random_state=1)
classifier.fit(X_train, y_train)
print(classifier.score(X_train, y_train))
print(cross_val_score(classifier, X_train, y_train, cv=5).mean())
print(classifier.score(X_test, y_test))



1.0
0.9182258148903888
0.9393063583815029


### 7. Fit a `BaggingClassifier` with a decision tree base estimator

In [81]:
bag_dtree = BaggingClassifier(DecisionTreeClassifier(criterion='gini',
                                    max_depth=None, 
                                    random_state=1),
                        max_samples=1., max_features=1.,n_estimators=500
)

In [82]:
bag_dtree.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=500, n_jobs=None, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [83]:
print(bag_dtree.score(X_train, y_train))
print(cross_val_score(bag_dtree, X_train, y_train, cv=5).mean())
print(bag_dtree.score(X_test, y_test))

1.0
0.9363103646732591
0.9335260115606936


### 8.  Of the Hypothesis Space problems we discussed earlier, which ones are solved by bagging?

#### - Statistical?
#### - Computational?
#### - Representational?

In [None]:
# all 3 problem will occur when using bagging

### Bonus: Tune the bagging classifiers with grid search

In [41]:
from sklearn.model_selection import GridSearchCV

In [42]:
bag_try = BaggingClassifier(KNeighborsClassifier(),
                        max_samples=0.9, max_features=0.9)

In [43]:
param_grid_bag_try = {
                  'max_samples': [0.1,0.3,0.5,0.7,0.9],
"max_features":[0.1,0.3,0.5,0.7,0.9]}

In [44]:
gs_bag_try = GridSearchCV(bag_try, param_grid=param_grid_bag_try)

In [46]:
gs_bag_try.fit(X_train, y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=0.9,
         max_samples=0.9, n_estimators=10, n_jobs=None, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_samples': [0.1, 0.3, 0.5, 0.7, 0.9], 'max_features': [0.1, 0.3, 0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [47]:
gs_bag_try.fit(X_train, y_train)
print(gs_bag_try.score(X_train, y_train))
print(gs_bag_try.best_score_)
print(gs_bag_try.score(X_test, y_test))




0.8610709117221418
0.8176555716353111
0.8236994219653179


In [48]:
gs_bag_try.best_params_

{'max_features': 0.9, 'max_samples': 0.5}

In [49]:
bag_dtree = BaggingClassifier(DecisionTreeClassifier(criterion='gini',
                                    max_depth=None, 
                                    random_state=1),
                        )


In [61]:
param_grid_bag_dtree = {
    'max_samples': [0.1,0.3,0.5,0.7,0.9],
"max_features":[0.1,0.3,0.5,0.7,0.9]}

In [62]:
gs_bag_dtree = GridSearchCV(bag_dtree, param_grid=param_grid_bag_dtree)

In [63]:
gs_bag_dtree.fit(X_train, y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            ...stimators=10, n_jobs=None, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_samples': [0.1, 0.3, 0.5, 0.7, 0.9], 'max_features': [0.1, 0.3, 0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [64]:
print(gs_bag_dtree.score(X_train, y_train))
print(gs_bag_dtree.best_score_)
print(gs_bag_dtree.score(X_test, y_test))


0.9725036179450073
0.8914616497829233
0.869942196531792


In [65]:
gs_bag_dtree.best_params_

{'max_features': 0.9, 'max_samples': 0.7}

## Conclusion 

In this lesson we have learned about Ensemble Models and Bagging Classifiers. We have learned how they improve the performance of individual base models thanks to their better ability to approximate the real prediction function in a supervised learning problem.

## ADDITIONAL RESOURCES

- [Ensemble models on wikipedia](https://en.wikipedia.org/wiki/Ensemble_learning)
- [Bagging on wikipedia](https://en.wikipedia.org/wiki/Bootstrap_aggregating)
- [Ensemble methods on Scikit Learn](http://scikit-learn.org/stable/modules/ensemble.html)
- [Bagging Classifier documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)
- [Bias Varias Decomposition Scikit Learn Example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_bias_variance.html#example-ensemble-plot-bias-variance-py)
- [T. Dietterich: Ensemble Methods in Machine Learning](http://www.cs.iastate.edu/~jtian/cs573/Papers/Dietterich-ensemble-00.pdf)
- [N.C. Oza and K. Tumer: Classifier Ensembles: Select Real World Applications](https://www.researchgate.net/profile/Nikunj_Oza/publication/222425707_Classifier_ensembles_Select_real-world_applications/links/0c960514cd67f0fdde000000/Classifier-ensembles-Select-real-world-applications.pdf)
- [KDNuggets article 1](http://www.kdnuggets.com/2016/02/ensemble-methods-techniques-produce-improved-machine-learning.html)
- [KDNuggets article 2](http://www.kdnuggets.com/faq/simple-data-mining-case-study.html)
- [Ensemble Methods book](http://www.amazon.com/dp/1608452840)