# Chapter 7. Ensemble Learning and Random Forests
* **If you aggregate your predictions of a group of predictors(such as classifiers or regressors), you will often get better predictions than with the best individual predictors**
* **A group of predictors is called an *ensemble*, this technique is called *Ensemble Learning*, and an Ensemble Learning algorithm is called an *Ensemble method***
* ***Random Forest:* You can train a group of Decision Tree classifiers, each on a different random subset of the trainning set.**
* **You will often use Ensemble methods near the end of a project, once you have already build a few good predictors, to combine them into an even better predictor.**

## Voting Classifiers
* ***hard voting* classifier: A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes**
* **This voting classifier often achieves a higher accuracy than the best classifier in the ensemble**
* **This is only true if all classifiers are perfectly independent, making uncorrelated errors**
* **Because most classifiers are trained on the same data, they are likely to make the same type of errors, so there will be many majority votes for the wrong classes, reducing the ensemble's accuracy**
* **One way to get diverse classifiers is to train them using very different algorithms. They increase the chance that they will make very different type of errors, improving the ensemble's accuracy**

**Creates and trains a voting classifier in SK-Learn, composed of three diverse classifiers**

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)



In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [2]:
log_clf=LogisticRegression()
rnd_clf=RandomForestClassifier()
svm_clf=SVC()

In [3]:
voting_clf=VotingClassifier(
    estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],
    voting='hard'
)

In [5]:
voting_clf.fit(X_train,y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

**Look at each classifier's accuracy on the test set**

In [7]:
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


***soft voting:* If all classifiers are able to estimate the class probabilities, then you can tell sk-learn to predict the class with the highest class probabilty, avergaed over all the individual classifiers**
* **Often achieves higher performance than hard voting because it gives more weight to highly confident votes**
* **All you need to do is replace** voting="hard" **with** voting="soft" **and ensure that all classifiers can estimate class porbabilities**
* **Also you need to ensure that all classifiers have** predict_proba() **method**

## Bagging and Pasting
* **Another apporach is to use the same trainning algorithm for every predictor and train them on different random subsets of the trainning sample**
* **When sampling is performed *with* replacement, this method is called *bagging***
* **When sampling is performed without replacement, it is called *pasting***
* **Both bagging and pasting allow trainning instances to be sampled several times across multiple predictors, but only bagging allows trainning instances to be sampled several times for the same predictor.**
* **Once predictors are trained, the ensemble can make predictos for a new instance by simply aggregating the predictions of all predictors**
* **The aggregation is typically the *statistical mode* for classification, or the averge for regression**
* **Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set**
* **Predictors can all be trained in parallel, via different CPU cores or even different servers.**
* **Similarly, predictions can be made in parallel: bagging and pasting scales very well**

### Bagging and Pasting in Scikit-Learn
* **SK-Learn offers a simple API for both bagging and pasting with the** BaggingClassifier **class(or** BaggingRegressor **for regression).**
* **In bagging, use** bootstrap=True. **if you wwant to use pasting, set** bootstrap=False
* **The** n_jobs **parameter tells SK-Learn the number of CPU cores to use for training and predictions(-1 tells SK-Learn to use all availiable cores)**
* **The** BaggingClassifier **automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities(i.e., if it has a** predict_proba( ) **method), which is the case with Decision Tree classifiers**

In [9]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [10]:
#contains an ensemble of 500 Decision Tree classifiers
bag_clf=BaggingClassifier(
    DecisionTreeClassifier(),n_estimators=500,
    max_samples=100,bootstrap=True,n_jobs=-1
)

In [11]:
bag_clf.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1)

In [12]:
y_pred=bag_clf.predict(X_test)

In [14]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.904


* **The ensembe has a comparable bias but a smaller variance(it makes roughly the same number of errors on the trainning set, but the decision boundary is less irregular)**
* **Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversityh also means that the predictors end up being less correlated, so the ensemble's variance is reduced**

### Out-of-Bag Evaluation
* **By default, a** BaggingClassifier **samples *m* training instances with replacement, where *m* is the size of the training set.**
* **This means that only 63% of the trainning instances are sampled on average for each predictor**
* **A bagging snsemble can be evaluated using oob instances, without the need for a separaye validation set**
* **In SK-learn, you can set** oob_score=True **when creating a bagging classifier to request an automatica oob evaluation after trainning**

In [15]:
bag_clf=BaggingClassifier(
    DecisionTreeClassifier(),n_estimators=500,
    bootstrap=True, n_jobs=-1,oob_score=True
)

In [16]:
bag_clf.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=500,
                  n_jobs=-1, oob_score=True)

**According to this oob evaluation, this** BaggingClassifier **is likely to achieve about 90.1% accuracy on the test set**

In [17]:
bag_clf.oob_score_

0.896

In [19]:
from sklearn.metrics import accuracy_score

In [20]:
y_pred=bag_clf.predict(X_test)

In [21]:
accuracy_score(y_test,y_pred)

0.912

**The oob decision function for each training instance is also availiable through** oob_decision_function_ **variable. In this case(since the base estimator has a** predic_proba( ) **method), the decision function returns the class probabiliteis for each trainning instance**

In [22]:
bag_clf.oob_decision_function_

array([[0.34391534, 0.65608466],
       [0.32160804, 0.67839196],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.04918033, 0.95081967],
       [0.33870968, 0.66129032],
       [0.01176471, 0.98823529],
       [0.99404762, 0.00595238],
       [0.98324022, 0.01675978],
       [0.74731183, 0.25268817],
       [0.00546448, 0.99453552],
       [0.785     , 0.215     ],
       [0.84065934, 0.15934066],
       [0.95321637, 0.04678363],
       [0.05319149, 0.94680851],
       [0.        , 1.        ],
       [0.98536585, 0.01463415],
       [0.95698925, 0.04301075],
       [0.99494949, 0.00505051],
       [0.04      , 0.96      ],
       [0.37222222, 0.62777778],
       [0.90322581, 0.09677419],
       [1.        , 0.        ],
       [0.96891192, 0.03108808],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.64044944, 0.35955056],
       [0.

## Random Patches and Random Subspaces
* BaggingClassifier **class supports sampling the features as well**
* **Sampling is controlled by two hyperparameters:** max_features **and** bootstrap_features**
* **Each predictor will be trained on a random subset of the input features**
* **This technique is particularly useful when you are dealing with high dimensional inputs**
* ***Random Patches* method: Sampling both trainning instances and features**
* ***Random Subspaces method:* Keeping all trainning instances(by setting** bootstrap=False **and** max_samples=1.0 **) but sampling features.**
* **Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance**

## Radom Forests
**Random Forest is an ensemble of Decision Trees, generally trained via the bagging method(or sometimes pasting), typically with** max_samples **set to the size of the trainning set**

In [23]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)

In [24]:
rnd_clf.fit(X_train,y_train)

RandomForestClassifier(max_leaf_nodes=16, n_estimators=500, n_jobs=-1)

In [25]:
y_pred_rf=rnd_clf.predict(X_test)

In [26]:
from sklearn.metrics import accuracy_score

In [27]:
accuracy_score(y_test,y_pred)

0.912

* **With a few exceptions, a** RandomForestClassifier **has all the hyperparameters of a** DecisionTreeCalssifier **(to control how trees are grown), plus all the hyperparameters of a** BaggingClassifier **to control the ensemble itself**
* **Random Forest algorithm introduces extra randomness when grwoing tree; instead of serarching for the very best features when splitting a node, it searches for the best feature among a random subset of features**
* **Results in greater diversity,trades higher bias for lower bariance, generally yielding an overall better model**

In [28]:
#Equivalent to the previous RandomForestClassifier
bag_clf=BaggingClassifier(
    DecisionTreeClassifier(max_features="auto",max_leaf_nodes=16),
    n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1
)

In [29]:
bag_clf.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(max_features='auto',
                                                        max_leaf_nodes=16),
                  n_estimators=500, n_jobs=-1)

In [30]:
y_pred_bc=bag_clf.predict(X_test)

In [31]:
from sklearn.metrics import accuracy_score

In [32]:
accuracy_score(y_test,y_pred_bc)

0.912

### Extra-Trees
* **When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting.**
* **This technique trades more bias for a lower variance.**
* **It also maked Extra-Trees much faster to train than a regular Random Forests**
* **You can create an Extra-Trees classifier using SK-Learn's** ExtraTreesClassifier, **its API is identical to the** RandomForestClassifier
* **The** ExtraTreesRegressor **class has the same API as the** RandomForestRegressor
* **It is hard to tell in adcance whether a** RandomForestClassifier **will perform better or worse than an** ExtraTreesClassifier. **Generally, the only way to know is to try both and compare them using cross-validation(tunning the hyperparameter using grid search)**

### Feature Importance
* **Another great quality of Random Forests is that they make it easy to measure the realtive importance of each feature**
* **SK-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average(across all trees in the forest).**
* **SK-Learn automatically computes feature importance for each feature after trainning, then it scales the results so that the sum of all importances is equal to 1**
* **You can access the result using the** _feature_importance_ **variable**

In [33]:
from sklearn.datasets import load_iris

In [34]:
iris=load_iris()

In [36]:
rnd_clf=RandomForestClassifier(n_estimators=500,n_jobs=-1)

In [37]:
rnd_clf.fit(iris["data"],iris['target'])

RandomForestClassifier(n_estimators=500, n_jobs=-1)

In [39]:
for name,score in zip(iris["feature_names"],rnd_clf.feature_importances_):
    print(name,score)

sepal length (cm) 0.10358326325367606
sepal width (cm) 0.022052524481100653
petal length (cm) 0.4496100165010137
petal width (cm) 0.4247541957642097


**Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection**

## Boosting
**Refers to any Ensemble method that can combine several weak learners into a strong learner**<br>
**The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessors**

### AdaBoost
* **A new predictor correct its predeccessor's mistake by paying a bit more attention to the training instances that the predecessor underfitted**
* **This resulted in new predictors focusing more on the hard cases**
* **AdaBoost Algorithm**
    * **When training an AdaBoost classifier, the algorithm first trains a base classifier and uses it to make predictions on the training set**
    * **The algorithm then increases the relative weight of misclassified training instances**
    * **Then it trains a second classifier, using the updated weights, and again makes predictions on the training set, updates the instance weights, and so on**
* **Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, expcet that predictors have different weights depending on their overall accuaracy on the weighted trainning set**
* **There is one importany drawaback to this sequential learning technique: it cannot be parallelized(or only partially), since each predictor can only be trained after the previous predictor has been trained and evaluated. It does not scale as well as bagging or pasting**
* **SK-Learn uses a multiclass version of AdaBoost called SAMME. If the predictors can estimate class porbabilities(i.e., if they have a** predict_proba( ) **method), SK-Learn can use a variant of SAMME called *SAMMER.R*, which relies on class probabilities rather than predictiuons and generally performs better**

**Following code trains an AdaBoost classifier based on *200 Decision Stumps* using SK-Learn** AdaBoostClassifier **class**
* **A Decision Stump is a Decision Tree with** max_depth=1, **a tree composed of a single decision node plus two leaft nodes**
* **This is the default base estimator for the** AdaBoostClassifier **class**
* **If your AdaBoost ensemble is overfitting the trainning set, you can try reducing the number of estimators or more strongly regularizing the base estimator**

In [40]:
from sklearn.ensemble import AdaBoostClassifier

In [42]:
ada_clf=AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),n_estimators=200,
    algorithm="SAMME.R",learning_rate=0.5
)

### Gradient Boosting
* **Sequentially adding predictors to an ensemble, each one correcting its predictors**
* **Instead of tweaking the instance weghts at every iteration like AdaBoost does, this method tries to fit the new predictor to the *residual errors* made by the previous predictor**

In [54]:
import numpy as np
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [55]:
from sklearn.tree import DecisionTreeRegressor

In [56]:
tree_reg1=DecisionTreeRegressor(max_depth=2)

In [57]:
tree_reg1.fit(X,y)

DecisionTreeRegressor(max_depth=2)

**Train a second** DecisionTreeRegressor **on the residual errors made by the first predictor**

In [58]:
y2=y-tree_reg1.predict(X)

In [59]:
tree_reg2=DecisionTreeRegressor(max_depth=2)

In [60]:
tree_reg2.fit(X,y2)

DecisionTreeRegressor(max_depth=2)

**Train a third regressor on the residual errors made by the second predictor**

In [61]:
y3=y2-tree_reg2.predict(X)

In [62]:
tree_reg3=DecisionTreeRegressor(max_depth=3)

In [63]:
tree_reg3.fit(X,y3)

DecisionTreeRegressor(max_depth=3)

**We have an ensemble containing three trees. It can make predictions on new instance by simply adding up the predictions for all the trees**

In [65]:
X_new = np.array([[0.8]])

In [66]:
y_pred=sum(tree.predict(X_new) for tree in [tree_reg1,tree_reg2,tree_reg3])

In [67]:
y_pred

array([0.75026781])

* **Ensemble's predictions gradually get better as trees are added to the ensemble**
* **SK-Learn's** GradientBoostingRegressor **class. It has hyperparameters to control the growth of Decision Trees(e.g.,** max_depth, min_samples_leaf **), as well as hyperparamters to control the ensemble training, such as the number of trees(** n_estimatros)
* **Shrinkage: The** learning_rate **parameter scales the contribution of each tree. If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the trainning set, but the predicitions will usually generalize better**

In [68]:
from sklearn.ensemble import GradientBoostingRegressor

In [69]:
gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=3,learning_rate=1.0)

* **In order to find the optimal number of trees, you can use early stopping.**
* **A simply way to implement this is to use the** staged_predict( ) **method: it returns an iterator over the predictions made by the ensemble at each stage of trainning**

In [70]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [71]:
X_train,X_val,y_train,y_val=train_test_split(X,y)

In [72]:
gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=120)

In [73]:
gbrt.fit(X_train,y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=120)

In [74]:
errors=[mean_squared_error(y_val,y_pred)
       for y_pred in gbrt.staged_predict(X_val)]

In [75]:
best_n_estimators=np.argmin(errors)+1

In [76]:
gbrt_best=GradientBoostingRegressor(max_depth=2,n_estimators=best_n_estimators)

In [77]:
gbrt_best.fit(X_train,y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=53)

* **It is also possible to implement early stopping by actually stopping training early**
* **You can do so by setting** warm_start=True, **which makes SK-Learn keep existing trees when the** fit( ) **method is called, allowing incremental trainning**

**Stops trainning when the validation error does not improve for five iterations in a row**

In [78]:
gbrt=GradientBoostingRegressor(max_depth=2,warm_start=True)

In [79]:
min_val_error=float("inf")

In [80]:
error_going_up=0

In [81]:
for n_estimators in range(1,120):
    gbrt.n_estimators=n_estimators
    gbrt.fit(X_train,y_train)
    y_pred=gbrt.predict(X_val)
    val_error=mean_squared_error(y_val,y_pred)
    if val_eror<min_val_error:
        min_val_error=val_error
        error_going_up=0
    else:
        error_going_up+=1
        if error_going_up==5:
            break #early stopping

**Stochastic Gradient Boosting:** GadientBoostingRegressor **also support a** subsample **hyperparameter, which specifies the fraction of trainning instance to be used for trainning each tree**
* **This technique trades a higher bias for a lower variance. It also speeds up training considerably**

**An optimized implementation of Gradient Boosting is availiable in popular Python library XGBoost, whcih stands for Extreme Gradient Boosting.**
* **The interface is similar to SK-Learn**

In [88]:
import xgboost

In [89]:
xgb_reg=xgboost.XGBRegressor()

In [90]:
xgb_reg.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [91]:
y_pred=xgb_reg.predict(X_val)

**XGBoost cab automatically take care of early stopping**

In [92]:
xgb_reg.fit(X_train,y_train,
           eval_set=[(X_val,y_val)],early_stopping_rounds=2)

[0]	validation_0-rmse:0.21083
Will train until validation_0-rmse hasn't improved in 2 rounds.
[1]	validation_0-rmse:0.14897
[2]	validation_0-rmse:0.11373
[3]	validation_0-rmse:0.09167
[4]	validation_0-rmse:0.07759
[5]	validation_0-rmse:0.06909
[6]	validation_0-rmse:0.06445
[7]	validation_0-rmse:0.06278
[8]	validation_0-rmse:0.06152
[9]	validation_0-rmse:0.06114
[10]	validation_0-rmse:0.06077
[11]	validation_0-rmse:0.06146
[12]	validation_0-rmse:0.06154
Stopping. Best iteration:
[10]	validation_0-rmse:0.06077



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [93]:
y_pred=xgb_reg.predict(X_val)

In [94]:
y_pred

array([ 0.1885474 ,  0.5384966 , -0.01354021,  0.58057415,  0.1885474 ,
        0.0152958 ,  0.67034435,  0.0152958 ,  0.12898493,  0.1885474 ,
        0.0152958 ,  0.1643205 ,  0.5021704 ,  0.15496236,  0.28682178,
        0.55297387,  0.47523323,  0.35735914,  0.68745124,  0.47523323,
        0.57613474,  0.12898493,  0.33966637,  0.1885474 ,  0.1643205 ],
      dtype=float32)

## Stacking
* **Instead of using trivial functions to aggregate the predictions of all predictiors in an ensemble, why don't we train a model to perform aggregation**
* **To train the blender, a common approach is to use a hold-out set**
    1. **First, the trainning set is split into two subsets. The first subset is used to train the predictors in the first layer**
    1. **Next, the first layer's predictors are used to make predictions on the second(held-out) set.**
    1. **We can create a new training set using the predicted values, as input features, ans keeping the target values**
    1. **The blender is trained on the new trainning set, so it learns to preduct the target valye, given the first layer's predictions** 
    1. **It is possible to train several different blenders this way, the trick is to split the training set into three subsets: the first one is used to train the first layer, the second one is used to create the training set used to train the second layer(using predictions made by the predictors of the first layer), and the third one is used to create the training set to train the third layer**