# Ensemble learning and Random Forests

## Voting Classifiers
* Train multiple classifiers on the training set (possibly the same or subsets - more later) and construct an ensemble classifier by e.g. a mode (the most frequent output)
* Typically performs better than individual classifiers - follows from the law of large numbers (the more classifiers the better)
* But using the same training set is problematic as individual classifiers might be correlated (not iid) - classifiers will learn the same mistakes, especially when all implement the same learning algorithm
* Regression works similarly and typically uses mean instead of statistical mode for the final output

In [4]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=10_000, noise=0.4)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svc_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svc_clf)],
    voting='hard',
)

for clf in (log_clf, rnd_clf, svc_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.8225
RandomForestClassifier 0.833
SVC 0.8575
VotingClassifier 0.852


## Bagging and Pasting
* The idea here is to prevent correlation by splitting the training set into subsets and training each predictor (all of one type) on one subset
* When the subset sampling is done *with replacement* then we call this technique **bagging** (in statistics also called **bootstrap**), when it's done *without replacement* we call it **pasting**
* Aggregate model should in theory generalize better - the reasoning here is that we trade increased bias (because we use many models of the same type) for lower variance (because of splitting the training set and using aggregation)

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
)
bag_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.854

### Out-of-Bag Evaluation
Because bootstrap samples (with replacement) out of all $m$ training just some into each taining sub-set it means that only about 63% on average of all the training instances will sampled for some predictor. The rest, which is never sampled, is called *out-of-bag* instances.

The idea here is that we can use OOB instances as a free validation set because it is likely that each training instance will be OOB for several estimators. *Scikit-Learn* can automatically collect OOB score during training.

In [6]:
bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=500,
    bootstrap=True,
    n_jobs=-1,
    oob_score=True,
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.841875

In [7]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.827

In [9]:
bag_clf.oob_decision_function_

array([[0.515625  , 0.484375  ],
       [0.        , 1.        ],
       [0.41847826, 0.58152174],
       ...,
       [0.        , 1.        ],
       [0.26842105, 0.73157895],
       [1.        , 0.        ]])

### Random Patches and Random Subspaces

Another variant (or rather extension) of a bagging classifier is sampling features as well as training instances.
* *Random patches* - sample both features and instances
* *Random subspaces* - sample features but keep all instance (set `bootstrap=False, max_samples=1.0`, `bootstrap_features=True` and/or `max_features` to less than 1

In *Scikit-Learn* one can control feature sampling with `bootstrap_features` and `max_features` which works similarly to `bootstrap` and `max_samples`.

## Random Forests

In *Scikit-Learn* it's basically a convenience API for a begging (sometimes pasting) classifier/regressor.

In [11]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

# Equivalent to:
# bag_clf = BaggingClassifier(
#     base_estimator=DecisionTreeClassifier(max_features='auto', max_leaf_nodes=16),
#     n_estimators=500,
#     max_samples=1.0,
#     bootstrap=True,
#     n_jobs=-1,
# )

y_pred_rf = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred_rf)

0.851

### Feature Importance

Random forests have a nice property that they can easily estimate feature importance by looking at how much the tree nodes using a feature reduce impurity on average (over all trees in the forest, weighted by the leaf sample size).

In [12]:
from sklearn.datasets import load_iris

iris = load_iris()

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])

for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10262207841339653
sepal width (cm) 0.021958038468146413
petal length (cm) 0.44384344969893047
petal width (cm) 0.4315764334195266
