# Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning.

In this chapter we will explore different ensemble methods for joining different models into one. We can aggregate the predictions of a group of predictors to get better predictions than with the best individual predictor. 

## Voting Classifiers

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. (this is also called a _hard voting_ classifier)

This voting classifier often achieves a higher accuracy than the best classifier in the ensemble.

Ensemble methods work best when the predictors are as independent from one another as possible.(different models)

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="liblinear", random_state=42) 
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42) 
svm_clf = SVC(gamma="auto", random_state=42)
voting_clf = VotingClassifier(
    estimators=[
        ('lr', log_clf), 
        ('rf', rnd_clf), 
        ('svc', svm_clf)], 
    voting='hard'
)
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=42,
                                                 solver='liblinear', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gin...
                                        

In [3]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896


If all classifiers are able to estimate class probabilities, then we can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. (this is called soft voting)

In [4]:
log_clf = LogisticRegression(solver='liblinear', random_state=42) 
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42) 
svm_clf = SVC(gamma="auto", probability=True, random_state=42) # probability=True because by default svc does not produce probability
voting_clf = VotingClassifier(
    estimators=[
        ('lr', log_clf), 
        ('rf', rnd_clf), 
        ('svc', svm_clf)], 
    voting='soft'
)
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=42,
                                                 solver='liblinear', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gin...
                                        

In [5]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912


## Bagging and Pasting

Another approach from using a diverse set of classifiersis is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set.

When sampling is performed with replacement, this method is called __bagging__. (allows training instances to be sampled several times for the same predictor)

When sampling is performed without replacement, it is called __pasting__.

After training the ensemble can make a prediction for a new instance by aggregating the predictions of all predictors.

The code below trains an ensemble of 500 Decision Tree classifiers, each trained on 100 training instances ran‐ domly sampled from the training set with replacement. (the BaggingClassifier automatically performs soft voting instead of hard voting)

Overall, bagging often results in better models.

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), 
    n_estimators=500,
    max_samples=100, 
    bootstrap=True, 
    n_jobs=-1, 
    random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

### Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well. This is controlled by two hyperparameters: max_features and bootstrap_features.

So each predictor will be trained on a random subset of the input features.

Sampling both training instances and features is called the Random Patches method.

Keeping all training instances (bootstrap=False and max_sam ples=1.0) but sampling features (bootstrap_features=True and/or max_features smaller than 1.0) is called the Random Subspaces method.

## Random Forests

Random Forest is an ensemble of Decision Trees, trained via the bagging method (or sometimes pasting), typically with max_samples set to the size of the training set.

In [7]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train) 
print(rnd_clf.__class__.__name__, accuracy_score(y_test, y_pred))

RandomForestClassifier 0.904


The Random Forest algorithm introduces extra randomness when growing trees and instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features.

The code below is equal to our Random Fores classifier.

In [8]:
bag_clf = BaggingClassifier( 
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16), 
    n_estimators=500, 
    max_samples=1.0, 
    bootstrap=True, 
    n_jobs=-1
)
bag_clf.fit(X_train, y_train) 
print(bag_clf.__class__.__name__, accuracy_score(y_test, y_pred))

BaggingClassifier 0.904


## Feature Importance

When we look at a single Decision Tree, important features are likely to appear closer to the root of the tree, while unimportant features will often appear closer to the leaves.

It is possible to get an estimate of a feature’s importance by computing the average depth at which it appears across all trees in the forest.

In [9]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10166144493702116
sepal width (cm) 0.025918171386428428
petal length (cm) 0.44530247338853657
petal width (cm) 0.4271179102880139


## Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. 
The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. 

### AdaBoost

AdaBoost technique is all about that a new predictor tries to correct its predecessor and pay a bit more attention to the training instances that the predecessor underfitted.

To build an AdaBoost classifier, a first base classifier is trained and used to make predictions on the training set. Then the relative weight of misclassified training instances is then increased. 
Next the second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and then the next and next classifier is trained.

In [10]:
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), 
    n_estimators=200, 
    algorithm="SAMME.R", 
    learning_rate=0.5
)
ada_clf.fit(X_train, y_train)
print(ada_clf.__class__.__name__, accuracy_score(y_test, y_pred))

AdaBoostClassifier 0.904


### Gradient Boosting

Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. 
Gradient Boosint tries to fit the new predictor to the residual errors made by the previous predictor.

In [11]:
from sklearn.tree import DecisionTreeRegressor
import numpy as np

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

Now train a second DecisionTreeRegressor on the residual errors made by the first predictor:

In [12]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=42, splitter='best')

Then we train a third regressor on the residual errors made by the second predictor:

In [13]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2) 
tree_reg3.fit(X, y3)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [14]:
X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

array([0.75026781])

In [15]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=1.0, loss='ls', max_depth=2,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=3,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

# Stacking

Instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble just train a model to perform this aggregation.

The final predictor which alse is called a blender, or a meta learner takes the predictions from previous predictors as inputs and makes the final prediction.

A common approach to train such blender model is to use a hold-out set, a training set that none of the initial predictors saw.