# Chapter 7: Ensemble Learning and Random Forests

Asking a question to a crowd of thousands of people and aggregating the answer will likely give you a better answer than asking an expert. This is called *wisdom of the crowd*. 

Similarly, we can aggregate the predictions of a group of predictors (predictors in this case are models: classifiers or regressors) and we get better predictions with a group than with even the best individual predictor. 

A group of predictors is called an **Ensemble**, thus we have **Ensemble Learning**. An Ensemble Learning algorithm is called an **Ensemble method**.

For example, we can train a group of Decision Trees on different subsets of a training set, make predictions, then select the class which gets the most "votes" from the group of trees. An ensemble of Decision Trees is called a **Random Forest** and is one of the most powerful ML algorithms available; Ensemble methods frequently win ML competitions.

This chapter covers:
 - Bagging and Pasting
 - Boosting
 - Stacking
 - etc.

## Voting Classifiers
Suppose we have some different classification predictors (models), each one achieving around 80% accuracy: LRC, SVM, RFC (random forest), KNN, etc. 

A simple way to get an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. A majority voting classifier is called a *Hard Voting* classifier.

The voting classifier will produce a higher accuracy than the individual classifiers, themselves. 

Even if our voting classifier is filled with *weak learners* (algorithms that are only slightly better than random guessing), the ensemble is usually still better than a strong learner, provided there are a lot of weak learners and they are diverse. *Ensemble methods work best when the predictors are as independent as possible from one another*.

Below see creation of a voting classifier:

```
# Imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# 3 Different (and diverse) classifiers
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()


# Ensemble method: VotingClassifier
voting_clf = VotingClassifier(estimators=[('lr', log_clf), 
                                          ('rf', rnd_clf), ('svc', svm_clf)],
                              voting='hard')
voting_clf.fit(X_train, y_train)


from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
```

When running this code, the `voting_clf` should at least slightly outperform the individual classifiers.

If all classifiers are able to estimate class probabilites (have a predict_proba() method), then SciKit-Learn can predict the class with the highest probability, averaged over the individual classifiers. This is called **soft voting**. Soft voting **achieves higher accuracy than hard voting because it gives higher weight to highly confident votes**.

To *use soft voting*, replace `voting = 'hard'` with `voting = 'soft'`

Note: SVC does not output probabilities by default; it needs the hyperparameter `probability = True`

## Bagging and Pasting
With Voting Classifiers, we used diverse (very different) training algorithms (Support vector, Logistic regression, and random forest in the above code).

Another approach is to use many of the same training algorithm (predictor), but on different subsets of the training set. 

**Bagging**, short for *Bootstrap aggregating*, is taking subsets of the training set *with replacement*.

**Pasting** is taking subsets of the training set *without replacement*.

**Bagging and Pasting allow training instances to be sampled several times over multiple predictors, but ONLY BAGGING allows training instances to be samples several times for the same predictor, hence with replacement**.

Once all predictors are trained, the ensemble makes the prediction by aggregating the individual predictions. The aggregation function is usually the *statistical mode* (the most frequent prediction just like Hard Voting), or the average for a Regression task.

The individual predictors have a higher bias than if trained on the full training set, but the ensemble will have a similar bias and a lower variance.

Training and predictions in ensemble methods can be done in parallel by allocating CPU cores or using different servers; just another reason ensembles are so strong.

### Bagging and Pasting in Scikit-Learn
Bagging and Pasting are achieved by using the same class in SK-Learn.

```
# Imports
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Bagging/Pasting class
bag_clf = BaggingClassifier(
 DecisionTreeClassifier(), n_estimators=500,
 max_samples=100, bootstrap=True, n_jobs=-1)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
```

In the above code for the BaggingClassifier():
 - DecisionTreeClassifier() is the predictor
 - `n_estimators` is using the predictor (Decision Tree) 500 times
 - `max_samples = 100` means each of the 500 predictors will be trained on 100 training instances
 - `bootstrapping = True` means bagging is used
  - `= False` means pasting is used
 - `n_jobs = -1` means SK-Learn will use all CPU cores!
 
BaggingClassifier *automatically performs soft voting if the base classifier has a predict_proba() method*.

Note: Bagging generally performs better, but checking pasting with CV is a good idea sometimes.

### Out of Bag Evaluation
With bagging, since bootstrapping is used, some training instances are never seen by any classifier in the ensemble; these samples are said to be **out-of-bag (oob) instances**; because the *out-of-bag* instances were never seen, they can be used for evaluation on the predictor instead of a validation set.

With the BaggingClassifier() use the `oob_score = True` parameter to automatically evaluate on the out-of-bag instances after training. The oob_score is recoverable with the `.oob_score_` attribute.

```
 # Note the oob_score=True parameter
 bag_clf = BaggingClassifier(DecisionTreeClassifier(), 
                             n_estimators=500, bootstrap=True, 
                             n_jobs=-1, oob_score=True)

 bag_clf.fit(X_train, y_train)    # Fitting automatically evals the oob instances
 bag_clf.oob_score_               # Evaluation score on oob instances
```

We can also get the prediction probabilities with the `oob_decision_function_ ` attribute.

```
 bag_clf.oob_decision_function_
```

## Random Patches and Random Subspaces
BaggingClassifier also support sampling features as well as the instances.

Sampling the features is controlled by the two hyperparameter `max_features` and `bootstrap_features`, which work exactly the same way as sampling instances, just with the features instead. Thus, each predictor (model) is trained on a random subset of the input features.

This is particularly useful when dealing with high-dimensional inputs, like images.

Sampling both instances and features is called the **Random Patches method**.

Keeping all training instances (for all predictors in the ensemble? ; with `bootstrap = False` and `max_samples = 1.0`) but sampling features (`bootstrap_features = True` and/or `max_features` < 1.0) is called **Random Subspaces method**.

Note: using max_sample = 1.0 is saying to sample 100% of the instances for each predictor. Pretty sure.

Sampling features results in even more predictor diversity, trading a bit more bias for
a lower variance.

## Random Forests
Recall a Random Forest is an Ensemble of Decision Trees. Random Forests generally use bagging, but sometimes pasting. 

The RandomForestClassifier() (or RandomForestRegressor) is optimized for Decision Trees and better to use over the BaggingClassifier(). 

The following code trains 500 Decision trees with a maximum of 16 nodes per tree.

```
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
```

With a few exceptions, the RandomForestClassifier has all the hyperparameters of the DecisionTreeClassifier to control how the tree grows, and BaggingClassifier to control the ensemble itself.

The Random Forest, instead of searching for THE very best feature when splitting a node, it searches for the best feature in a random subset of nodes. This results in greater tree diversity, a higher bias, lower variance, and a generally better model.

The code below is "roughly equivalent to a RandomForestClassifier":

```
bag_clf = BaggingClassifier(
 DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
 n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)
```

### Extra Trees
Note, in the paragraph above that the Random Forest searches for the best feature in a random subset of features: this is done by searching for the best threshold for splitting. We can make the Random Forest even more random by making the threshold that it searches to split on, randomized. This is called an **Extremely Randomized Tree Ensemble** aka **"Extra Trees"**.

Once again, this increases bias, but reduces variance. **Training is much faster than random forests** because searching for the best threshold is one of the most time-consuming tasks; making the threshold random takes way less time. 

The `ExtraTreesClassifier` and `ExtraTreesRegressor` has identical API to the RandomForest equivalents and will use the same hyperparameters.

*It is difficult to know whether ExtraTrees or RandomForest performs better on a problem without using GridSeachCV.*

### Feature Importance
Another great quality of Random Forests is that SK-learn measures automatically measures a feature's importance by taking the average impurity reduction across all nodes that use that feature. It is a weighted average where the node's weight depends on the number of training samples associated with it.

The `feature_importances_` attribute of the Random Forest will sum to 1.

Below is an example on the iris dataset:

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

NameError: name 'RandomForestClassifier' is not defined