# Chapter 7. Ensemble Learning and Random Forests

Suppose you pose a complex question to thousands of random people, then
aggregate their answers. In many cases you will find that this aggregated
answer is better than an expert’s answer. This is called the wisdom of the
crowd. Similarly, if you aggregate the predictions of a group of predictors
(such as classifiers or regressors), you will often get better predictions than
with the best individual predictor. A group of predictors is called an
ensemble; thus, this technique is called **ensemble learning**, and an ensemble
learning algorithm is called an **ensemble method**.

## Voting Classifiers
Suppose you have trained a few classifiers, each one achieving about 80%
accuracy. You may have a logistic regression classifier, an SVM classifier, a
random forest classifier, a k-nearest neighbors classifier, and perhaps a few
more. A very simple way to create an even better classifier is to aggregate the
predictions of each classifier: the class that gets the most votes is the
ensemble’s prediction. This majority-vote classifier is called a hard voting
classifier.

Somewhat surprisingly, this voting classifier often achieves a higher accuracy
than the best classifier in the ensemble. In fact, even if each classifier is a
weak learner (meaning it does only slightly better than random guessing), the
ensemble can still be a strong learner (achieving high accuracy), provided
there are a sufficient number of weak learners in the ensemble and they are
sufficiently diverse.

If all classifiers are able to estimate class probabilities (i.e., if they all have a
predict_proba() method), then you can tell Scikit-Learn to predict the class
with the highest class probability, averaged over all the individual classifiers.
This is called soft voting. It often achieves higher performance than hard
voting because it gives more weight to highly confident votes. All you need
to do is set the voting classifier’s voting hyperparameter to "soft", and ensure
that all classifiers can estimate class probabilities.


In [None]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
voting_clf = VotingClassifier(
estimators=[
('lr', LogisticRegression(random_state=42)),
('rf', RandomForestClassifier(random_state=42)),
('svc', SVC(random_state=42))
]
)
voting_clf.fit(X_train, y_train)

# Soft-voting
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

## Bagging and Pasting

Another approach is to use the same training
algorithm for every predictor but train them on different random subsets of
the training set. When sampling is performed *with replacement*, this
method is called **bagging** (short for bootstrap aggregating). When sampling
is performed *without replacement*, it is called **pasting**.


Once all predictors are trained, the ensemble can make a prediction for a new
instance by simply aggregating the predictions of all predictors. The
aggregation function is typically the statistical mode for classification (i.e.,
the most frequent prediction, just like with a hard voting classifier), or the
average for regression. Each individual predictor has a higher bias than if it
were trained on the original training set, but aggregation reduces both bias
and variance. Generally, the net result is that the ensemble has a similar
bias but a lower variance than a single predictor trained on the original
training set.

### Out-of-bag Evaluation

With bagging, some training instances may be sampled several times for any
given predictor, while others may not be sampled at all.  With this process, it
can be shown mathematically that only about 63% of the training instances
are sampled on average for each predictor. The remaining 37% of the
training instances that are not sampled are called out-of-bag (OOB) instances.
Note that they are not the same 37% for all predictors.

A bagging ensemble can be evaluated using OOB instances, without the need
for a separate validation set: indeed, if there are enough estimators, then each
instance in the training set will likely be an OOB instance of several
estimators, so these estimators can be used to make a fair ensemble prediction
for that instance. Once you have a prediction for each instance, you can
compute the ensemble’s prediction accuracy.

### Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well. Sampling
is controlled by two hyperparameters: max_features and bootstrap_features.
They work the same way as max_samples and bootstrap, but for feature
sampling instead of instance sampling. Thus, each predictor will be trained
on a random subset of the input features.

This technique is particularly useful when you are dealing with 
high-dimensional inputs (such as images), as it can considerably speed up training.
Sampling both training instances and features is called the **random patches
method**. Keeping all training instances (by setting bootstrap=False and
max_samples=1.0) but sampling features (by setting bootstrap_features to
True and/or max_features to a value smaller than 1.0) is called the **random
subspaces method**.


## Random Forests

As we have discussed, a random forest is an ensemble of decision trees,
generally trained via the bagging method (or sometimes pasting), typically
with max_samples set to the size of the training set.

The random forest algorithm introduces extra randomness when growing
trees; instead of searching for the very best feature when splitting a node (see
Chapter 6), it searches for the best feature among a random subset of features.
By default, it samples n features (where n is the total number of features).
The algorithm results in greater tree diversity, which (again) trades a higher
bias for a lower variance, generally yielding an overall better model.

### Feature importance

Yet another great quality of random forests is that they make it easy to
measure the relative importance of each feature. Scikit-Learn measures a
feature’s importance by looking at how much the tree nodes that use that
feature reduce impurity on average, across all trees in the forest. More
precisely, it is a weighted average, where each node’s weight is equal to the
number of training samples that are associated with it.

Scikit-Learn computes this score automatically for each feature after training,
then it scales the results so that the sum of all importances is equal to 1. 



## Boosting

Boosting (originally called hypothesis boosting) refers to any ensemble
method that can combine several weak learners into a strong learner. The
general idea of most boosting methods is to train predictors sequentially, each
trying to correct its predecessor.

### AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more
attention to the training instances that the predecessor underfit. This results in
new predictors focusing more and more on the hard cases. This is the
technique used by AdaBoost.

For example, when training an AdaBoost classifier, the algorithm first trains
a base classifier (such as a decision tree) and uses it to make predictions on
the training set. The algorithm then increases the relative weight of
misclassified training instances. Then it trains a second classifier, using the
updated weights, and again makes predictions on the training set, updates the
instance weights, and so on.

### Gradient Boosting

Just like AdaBoost, gradient boosting works by sequentially adding predictors to an
ensemble, each one correcting its predecessor. However, instead of tweaking
the instance weights at every iteration like AdaBoost does, this method tries
to fit the new predictor to the residual errors made by the previous predictor.

## Stacking

Stacking is based on a simple idea: instead of
using trivial functions (such as hard voting) to aggregate the predictions of all
predictors in an ensemble, why don’t we train a model to perform this
aggregation?

For more details, we have 3 predictors and each of them generates a different
out for a task. Our mission is aggregate the predictions to get the final prediction.
But, different from trivial function (hard voting, softvoting,..), we build an model
which generate output from outputs of previous models.