# 7. Ensemble Learning and Random Forests

Aggregating the result of different predictors is called **Ensemble Learning**. Similarly to the phenomenon known as _wisdom of the crowd_, these aggregated models often turn out to be more effective than any individual one. 

**Note**: Ensemble methods work best when the predictors are as independent from one another as possible.

### Voting Classifiers

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes (**hard voting**) or predict the class with the highest class probability, averaged over all the individual classifiers (**soft voting**) assuming all the classifiers outputs are probabilities. 

### Bagging and Pasting

The method we discussed above involves using different classifiers on the same dataset. Another approach is to use the same training algorithm for every
predictor, but to train them on different random subsets of the training set. 

This second method has two variants:
1. **Bagging** (with replacement)
2. **Pasting** (without replacement)

Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

#### Out-of-bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. Usually this means around 37% of instances never sampled (they are left **out-of-bag** so to speak). We can therefore use them to evaluate our predictors, since our predictor has never seen them in training.  

Simply add `oob_score=True` to the classifier. 

#### Random Patches and Random Subspaces

Sometimes, especially when dealing with high dimensionality, it may be helpful to sample a subset of **features** rather than samples. 
* Sampling both training instances and features is called the **Random Patches** method
* Keeping all training instances and sampling features is called the **Random Subspaces** method

### Random Forests

A Random Forest is an ensemble of Decision Trees, generally trained using bagging and with `max_samples` = training set. 

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in greater diversity > high bias / low variance.

#### Extra-Trees

It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds. We call this type of tree **Extremely Randomized Trees** (Extra).

Once again, this trades more bias for a lower variance. It also makes Extra-Trees much faster to train than regular Random Forests since finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.

#### Feature Importance

Another great property of Random Forest is that they make easy to measure the relative importance of each feature. 
This is very easy to see with an example:

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09600708420887698
sepal width (cm) 0.023022717399856896
petal length (cm) 0.43234024077555183
petal width (cm) 0.4486299576157143


Therefore, Random Forests are very useful if we need a quick understanding of our features, potentially to perform further **feature selection**.  

### Boosting

The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. 

We will cover two of them: Adaptative Boosting (**AdaBoost**) and **Gradient Boosting**.

In [None]:
### AdaBoost

Adaptative boosting