# Chapter 7: Ensemble Learning and Random Forests

Asking a question to a crowd of thousands of people and aggregating the answer will likely give you a better answer than asking an expert. This is called *wisdom of the crowd*. 

Similarly, we can aggregate the predictions of a group of predictors (predictors in this case is models: classifiers or regressors) and we get better predictions than with even the best individual predictor. 

A group of predictors is called an **Ensemble**, thus we have **Ensemble Learning**. An Ensemble Learning algorithm is called an **Ensemble method**.

For example, we can train a group of Decision Trees on different subsets of a training set, make predictions, then select the class which gets the most "votes" from the group of trees. An ensemble of Decision Trees is called a **Random Forest** and is one of the most powerful ML algorithms available. Ensemble methods frequently win ML competitions.

This chapter covers:
 - Bagging
 - Boosting
 - Stacking
 - etc.

## Voting Classifiers
Suppose we have some different classification predictors (models), each one achieving around 80% accuracy: LRC, SVM, RFC (random forest), KNN, etc. 

A simple way to get an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. A majority voting classifier is called a *Hard Voting* classifier.

The voting classifier will produce a higher accuracy than the individual classifiers, themselves. 

Even if our voting classifier is filled with *weak learners* (algorithms that are only slightly better than random guessing), then ensemble is usually still better than a strong learner, provided there are a lot of weak learners and they are diverse. *Ensemble methods work best when the predictors are as independent as possible from one another*.

Below see creation of a voting classifier:

```
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# 3 Different (and diverse) classifiers
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()


# Ensemble method: VotingClassifier
voting_clf = VotingClassifier(estimators=[('lr', log_clf), 
                                          ('rf', rnd_clf), ('svc', svm_clf)],
                              voting='hard')
voting_clf.fit(X_train, y_train)


from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
```

When running this code, the `voting_clf` should at least slightly outperform the individual classifiers.

If all classifiers are able to estimate class probabilites (have a predict_proba() method), then SciKit-Learn can predict the class with the highest probability, averaged over the individual classifiers. This is called **soft voting**. Soft voting **achieves higher accuracy than hard voting because it gives higher weight to highly confident votes**.

To *use soft voting*, replace `voting = 'hard'` with `voting = 'soft'`

Note: SVC does not output probabilities by default; it needs the hyperparameter `probability = True`

## Bagging and Pasting
With Voting Classifiers, we used diverse (very different) training algorithms (Support vector, Logistic regression, and random forst in the above code).

Another approach is to use many of the same training algorithm (predictor), but on different subsets of the training set. 

**Bagging**, short for *Bootstrap aggregating*, is taking subsets of the training set *with replacement*.

**Pasting** is taking subsets of the training set *without replacement*.

**Bagging and Pasting allow training instances to be sampled several times over multiple predictors, but ONLY BAGGING allows training instances to be samples several times for the same predictor**.

One all predictors are trained, the ensemble makes the prediction by aggregating the individual predictions. The aggregation function is usually the *statistical mode* (the most frequent prediction just like Hard Voting), or the average for a Regression task.

The individual predictors have a higher bias than if trained on the full training set, but the ensemble will have a similar bias and a lower variance.

Training and predictions in ensemble methods can be done in parallel by allocating CPU cores or using different servers; just another reason ensembles are so strong.

### Bagging and Pasting in Scikit-Learn
