# Ensemble Learning and Random Forests

### Ensemble Learning

Aggregated predictions of a group of predictors are often better than prediction of individual predictor (Wisdom of Crowd). A group of predictors is called ensemble, thus this technique is called an Ensemble Learning, and an Ensemble Learning algorithm is called Ensemble Method.

Example of Ensemble Method- A group of Decision Tree Classifiers, each trained on different random subset of training data. To make predictions, gather the predictions of all individual trees and then predict the class that gets maximum number of votes. Such an ensemble of Decision Trees is known as **Random Forest** and they are one of the most powerfull ML Algorithms available today.

### Voting Classifiers

#### Hard Vote Classifier

A classifier which might use many diverse methods to get predictions and then finally predict using **majority-vote** is called Hard Vote Classifier. This voting classifier often achieves a higher accuracy amoung the best classifiers in ensemble. In fact even if each classifier is weak learner (slightly better than random guessing), the ensemble can be a strong learner given there are sufficient number of weak learner and they are sufficiently diverse.

#### Why does Ensemble work well?

Due to the **Law of Large Numbers**, the more the number of predictors the higher will be the probability of predicting the right class. However, this is only true if the classifiers are perfectly indipendent, making uncorrelated errors (which might not be possible if they are trained with same set of data as they are likely to have same type of errors).

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [29]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

In [30]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [31]:
from sklearn.metrics import accuracy_score

In [32]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.888
SVC 0.896
VotingClassifier 0.904


#### Soft Voting

If all the classifiers have predict_proba method sklearn can predict the class with highest probability, averaged over all the individual classifiers. This is called **Soft Voting**. This slighly outperforms Hard Voting because it gives more weight to highly confident votes. This can be used bu changing to voting='soft' and making sure the classifier has predic_proba which is not the case with SVC, and it requires probability hyperparameter to be set True (which makes it a bit slow).

In [35]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')

In [36]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [37]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.904
SVC 0.896
VotingClassifier 0.92


## Bagging and Pasting

One way of getting a diverse set of classifiers is to use same training algorith for every prediction and train them on different random subsets of the training set. 

- When sampling is performed with replacement this method is called **Bagging** (Bootstrap Aggregating).
- When sampling is performed without replacement this method is called **Pasting**.

In other words, both bagging and pasting allows training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

Each individual predictor has higher bias than if it were trained on orignal dataset, but aggregation reduces both bias and variance. Usually, ensemble has a similar bias but lower variance than a single predictor trained on orignal dataset.

Multiple predictors can be trained together parallaly on multiple CPU cores or on multiole Servers.

In [38]:
from sklearn.ensemble import BaggingClassifier  # BaggingRegressor for Regression
from sklearn.tree import DecisionTreeClassifier

In [39]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)
# Training 500 Decision Trees each trained on 100 training instances randomly sampled with replacement (bagging, for pasting bootstrap=False), 
# n_jobs is number of cores to be use (-1 => all available), it uses soft voting if the model used has predict_proba

In [40]:
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

### Out-of-Bag Evaluation

With bagging some instances maybe sampled several times whereas some instances might not be sampled even once. On average only about 63% of the training data is sampled for each predictor that is 37% of data remains unsampled. Note that they are not the same 37% for all predictors. This 37% of data is called Out-of-Bag (oob).

Since a predictor never sees oob instances it can be used as validation dataset. In sklearn oob_score=True can te used.

In [44]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [45]:
bag_clf.oob_score_

0.9253333333333333

In [46]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

In [50]:
bag_clf.oob_decision_function_[:10]  # decision function which shows the probability predicted for each class

array([[0.34139785, 0.65860215],
       [0.390625  , 0.609375  ],
       [1.        , 0.        ],
       [0.0025    , 0.9975    ],
       [0.00787402, 0.99212598],
       [0.0754717 , 0.9245283 ],
       [0.35805627, 0.64194373],
       [0.06958763, 0.93041237],
       [0.94750656, 0.05249344],
       [0.82857143, 0.17142857]])

### Random Patches and Random Subspaces

BaggingClassifier also supports sampling **features**. This can be controlled by max_features and bootstrap_features hyperparameters (like max_samples and bootstrap for instances). Thus each predictor will be trained on a random subset of input features. This is usefull when dealing with high-dimentional data (like images).

- Sampling both features and instances is called **Random Patches Method**.
- Sampling only features and keeping all the instances (by setting bootstrap=False, max_samples=1, bootstrap_features=True and max_features to less than 1) is called **Random Subspaces Method**.

Sampling features results in more predictor diversity, trading a bit more bias for lower variance.

### Random Forests

Random Forest is a ensemble of Decision Trees, generally trained via bagging method (sometimes pasting), max_samples set to the size of training data.

RandomForestClassifier can be used as it is more optimized.

Random Forest Hyperparameters = Decision Tree Hyperparameters (mostly) + Bagging Classifier Hyperparameters

Below code uses all cpu cores available with 100 trees each limited to max 16 nodes.

In [51]:
from sklearn.ensemble import RandomForestClassifier

In [52]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=16, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [53]:
y_pred = rnd_clf.predict(X_test)

The Random Forest introduces extra randomness when growing trees, instead of searching for **very best feature** while splitting a node it searches for **best feature** amoung a random subset of features. This results in higher tree diversity, which is again a tradeoff for higher bias for less variance. This can be done using Bagging Classifier as well-

In [54]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter=True, max_leaf_nodes=16), n_estimators=500, n_jobs=-1, max_samples=1.0, bootstrap=True)

### Extra-Trees

When growing tree in Random Forests, at each node only a random feature is considered for splitting. It is also possible to make trees even more ransom by also using random thresholds for each feature rather than searching for best possible threshold.

These are called as Extremely Randomized Trees ensemble. Extra-Trees are usually faster than Random Forests as computational time is saved by not finding best possible threshold.

sklears ExtraTreesClassifier can be used for this.

### Feature Importance

Random Forests make it easy to calculte feature importance. Sklearn measures feature importance by looking at how much the tree nodes that use that feature reduce impurity on average (weighted average across all the trees, where weigth of node is equal to the number of training samples associated with it).

In [55]:
from sklearn.datasets import load_iris

In [56]:
iris = load_iris()

In [59]:
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [62]:
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10783107465587839
sepal width (cm) 0.026495191952894067
petal length (cm) 0.433940281584413
petal width (cm) 0.4317334518068144


## Boosting

Boosting refers to any ensemble method that combine several weak lerners into a strong learner. The genral idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

Example- AdaBoost and Gradient Boosting

### AdaBoosting (Adaptive Boosting)

AdaBoost corrects its predecessor by paying more attention to the instances on which the predecessor underfitted. This results in focusing more and more on hard cases. To do this ada boost uses relative weight, it increases the weight of instance which the predecesors do not perform well. These weights are updated after training each predictor. For prediction the ada boost predicts using all predictors and then weighted based the earlier calculated weights.

Sklearn multiclass version of Ada Boost called as SAMME (Stagewise Adaptive Modeling using a Multiclass Exponential loss function)

SAMME.R uses Probability for prediction and is usually better in prediction

In [63]:
from sklearn.ensemble import AdaBoostClassifier  # or AdaBoostRegressor

Default Base Estimator is Decision Stumps (i.e. Decision Tree with Max Depth 1) and Default algorith is SAMME.R

In [64]:
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, algorithm='SAMME.R', learning_rate=0.5)

In [65]:
ada_clf.fit(X, y)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                          

**To Regualarize reduce number of n_estimator or Regularize the base estimator**

### Gradient Boosting

Similar to Ada Boost, Gradient Boost work by sequentially adiign predictors each one trying to correcting its predecessors. However, instead of tweaking the instance weight at every iteration, this method tries to fit the new predictor to the **Residual Errors** made by the previous predictor.

In [69]:
from sklearn.tree import DecisionTreeRegressor

In [70]:
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=2,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [71]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=2,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [72]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y2)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=2,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [74]:
y_pred = sum(tree.predict(X) for tree in (tree_reg1, tree_reg2, tree_reg3))

#### Directly using Sklearn

In [75]:
from sklearn.ensemble import GradientBoostingRegressor

In [77]:
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=1.0, loss='ls', max_depth=2,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=3,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

The Learning Rate Hyperparameter scales the contribution of each tree. If set low like 0.1, more trees in ensemble will be required to fit the training set. But 
this will usually genrallize better. This regualization technique is called **Shrinkage**.

To estimate the number of tree early stopping can be used (discussed in chap 4). It can also be done using Staged_predict()

In [78]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [82]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [83]:
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=2,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=120,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

_Staged Predict returns an iterator over the predictions made by ensemble at each stage_

In [84]:
errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimator = np.argmin(errors) + 1

In [85]:
gbrt_bst = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimator)
gbrt_bst.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=2,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=86,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

It is also possible to actually stop the model early instead training first large number of trees.

In [87]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float('inf')
error_going_up = 0

for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_err = mean_squared_error(y_val, y_pred)
    if val_err < min_val_error:
        min_val_error = val_err
        error_going_up = 0
    else:
        error_going_up +=1
        if error_going_up == 5:
            break

Sklearns Gradient Boosting Regressor also has a Sub sample Hyperparameter that allows to use a specific percentage of training date to train individual trees. This trades high bias for a lower variance and is considerable faster. This called as **Stochastic Gradient Boosting**.

### XGBoost

XGBoost is a optimised implementation of Gradient Boosting. It is available as an python [library](github.com/dmlc/xgboost). The X stands for Extreme Gradient Boosting. This is very popular and often used in winning ML Competitions.

In [90]:
import xgboost

In [91]:
xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

## Stacking

Stacking (Stacked Generalization) is based on simple idea: instead of trivial function (like hard voting) to aggeregate the predictions of all the predictors in an ensemble, train a model to perform this aggregations. The model that does this aggeregation is known as blender or meta-learner. It takes predections of all models in ensemble and give out the final prediction.

To train the blender, most common is the **hold-out** approach. Firstly the training dataset is split into 2 subsets. Then the first subset is used to train the first layer of actual predictors. Then prediction is made on the second subset of the data using first layer. This acts as training data for the second layer (blender).

Multiple Layers of Blender can also be used, then the training set will also have to be splitted in n+1 subset (n = no. of blenders).

Sklearn does not direcly support stacking. There are a few 3rd Party Libraries which can be used.