# Ensemble Learning

If we aggregate the predictions of a group of predictors (such as classifiers or regressors), we will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble, thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithms is called an Ensemble method.

For example, we can train a group of Decision Tree classifiers, each on a different random subset of the training set. To make predictions, we obtain the predictions of all the individual trees, then predict the class that gets the most votes.

## Voting Classifier

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.

This majority-vote classifier is called a hard voting classifier

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators = [('lr', log_clf),('rf', rnd_clf),('svc', svm_clf)],
    voting = 'hard'
)

In [4]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

moon_dataset = make_moons(n_samples=10000, noise=0.4)
X_train, X_test, y_train, y_test = train_test_split(moon_dataset[0], moon_dataset[1])

In [5]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

In [6]:
from sklearn.metrics import accuracy_score

for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
  clf.fit(X_train,y_train)
  y_pred = clf.predict(X_test)
  print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.8268
RandomForestClassifier 0.8336
SVC 0.8496
VotingClassifier 0.8472


## Soft Voting

If all classifiers are able to estimate class probabilities (they all have predict_proba() method), then we can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is call soft voting

## Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms. Another approach is to use the same training algorithm for every predictor and train them on different random subsets of training set. When sampling is performed with replacement, this method called "bagging" (short for bootstrap aggregating). When sampling is performed without replacement, it is called "pasting"

Both bagging and pasting also allow training instances to be sampled several time accross multiple predictors. But only bagging allow training instances to be sampled several time for same predictor

Once all predictors are trained, the ensemble can make predictions for a new instance by aggregating all predictions of predictors. The aggregation function is statistical mode for classification or average for regression. Each individual predictor has a higher bias than if it were trained on the original training set. Aggregation has a similar bias but lower variance

## Bagging and Pasting in Scikit-Learn

Scikit-Learn offers a simple API for both bagging and pasting with the  BaggingClassifier class (or BaggingRegressor for regression).

This is an example of bagging, for pasting, set bootstrap=False

The following code trains an ensemble of 500 Decision Tree classifiers: each is trained on 100 training instances randomly sampled from the training set with replacement, n_jobs=-1 to use all available cores

In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True
)
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1, oob_score=True)

BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities (e.g, if it has predict_proba() method)

## Out of Bag Evaluation

With Bagging, some instances may be sampled severals times for any even predictor, while others may not be sampled at all. The remaining of the training instances that are not sampled are called out-of-bag (oob) instances.

Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a seperate validation set. We can evaluate the ensemble itself by averaging out the oob evaluations of each predictor

In Scikit-Learn, we can set oob_score=True when creating a BaggingClassifier to request an automatic oob evaluation after training. The resulting evaluation score is available through the oob_score_ variable:

In [11]:
bag_clf.oob_score_

0.8589333333333333

Let's verify this by using accuracy_score:

In [12]:
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)
accuracy_score(y_pred, y_test)

0.8536

We also have the oob decision function for each training instance that available through the oob_decision_function_ variable. In this case, it returns class probabilities for each training instances.

For example, it estimates that the first training instance has a 94.3% belong to the positive class and 5.6% belong to the negative class

In [13]:
bag_clf.oob_decision_function_

array([[0.05645161, 0.94354839],
       [0.01417004, 0.98582996],
       [0.91549296, 0.08450704],
       ...,
       [0.01622718, 0.98377282],
       [0.38742394, 0.61257606],
       [0.17611336, 0.82388664]])

## Random Patches method: 

Sampling both training instances and feature

The BaggingClassifier class supports sampling features as well. Sampling is controlled by 2 hyperparameters: max_features and bootstrap_features. They work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features. This technique is useful for high-dimensional inputs (such as image)

## Random Subspaces method:

Keeping all training instance (by setting bootstrap=False and max_samples=1.0) but sampling features (by setting bootstrap_features=True and/or max_features <1.0)

Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance