## Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about $80$% accuracy. You may have a Logistic Regression classifier, an SVM classifier, a Random Forest classifier, a K-Nearest Neighbors classifier, and perhaps a few more (see figure below).

![](training_diverse_classifiers.png)

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a *hard voting* classifier (see figure below)

![](hard_voting_classifier_predictions.png)

Somewhat suprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a *weak learner* (meaning it does only slighlty better than random guessing), the ensemble can still be a *strong learner* (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse.

How is this possible? The following analogy can help shed some light on this mystery. Suppose you have a slightly biased coin that has a $51$% chance of coming up heads, and $49$% chance of coming up tails. If you toss it $1000$ times, you will generally get more less $510$ heads and 490 tails, and hence a majority heads. If you do the math, you will find that the probability of obtaining a majority of heads after $1000$ tosses is close to $75$%. The more you toss the coin, the higher the probability (e.g., with $10000$ tosses, the probability climbs over $97$%). This is due to the *law of large numbers:* as you keep tossing the coin, the ratio of heads gets closer and closer to the probability of heads ($51$%). The below figure shows $10$ series of biased coin tosses. You can see that as the number of tosses increases, the ratio of heads approaches $51$%. Eventually all $10$ series end up so close to $51$% that they are consistenly above $50$%.

![](law_of_large_numbers.png)

Similarly, suppose you build an ensemble containing $1000$ classifiers that are indivdually correct only  $51$% of the time (barely better than random guessing). If you predict the majority voted class, you can hope for up to $75$% accuracy! However, this only true if all the classifiers are perfectly independent, making uncorrelated errors, which is clearly not the case since they are trained on the same data. They are likely to make the same type of errors, so there will be many majority votes for the wrong class, reducing the ensemble's acccuracy.

Ensemble methods work best when the predictors are as independent
from one another as possible. One way to get diverse classifiers
is to train them using very different algorithms. This increases the
chance that they will make very different types of errors, improving
the ensemble’s accuracy.

The following code creates and trains a voting classifier in Scikit-Learn, composed of three diverse classifiers (the training set is the moons dataset):

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [3]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

In [4]:
voting_clf = VotingClassifier(
estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting = 'hard'
)
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

Let's look at each classifier's accuracy on the test set:

In [5]:
from sklearn.metrics import accuracy_score

In [6]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.896
VotingClassifier 0.888


The voting classifier slighlty outperforms all the individual classifiers.

 If all classifiers are able to estimate class probabilites (that is, they have a predic_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called *soft voting*. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to is replace voting = "hard" with voting = "soft" and ensure that all classifiers can estimate class probabilies. This is not the case of the SVC class by default, so you need to set its probability hyperparameter to True (this will make the SVC class to use cross-validation to estimate class probabilities, slowing down training, and it will add a predict_proba() method). If you modify the preceding code to use soft voting, you will find that the voting classifier achieves over $91.2$% accuracy!

## Bagging and Pasting

One way to get a diverse set of classifiers is to use a very different training algorithms, as just discussed. Another approach is to use the same training algorithm for every predictor, but to train them on different subsets of the training set. When sampling is performed with replacement, this method is called *bagging* (short for bootstrap aggregating). When sampling is performed without replacement, it is called *pasting*.

In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor. This sampling and training process is represented in the figure below

![](pasting_bagging_training_set_sampling_training.png)

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the *statistical mode* (that is the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression. Each individual has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

As you can see in the above figure predictors can all be trained in parallel, via different CPU cores or even different servers. Similarly, predictions can be made in parallel.
This is one of the reasons why bagging and pasting are such popular methods: they scale very well


## Bagging and Pasting in Scikit-Learn

Scikit-Learn offers a simple API for both bagging and pasting with BaggingClassifier class (or BaggingRegressor for regression). The following code trains an ensemble of $500$ Decision Tree classifiers, each trained on $100$ training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting method, just set bootstrap = False). The n_jobs parameters tells Scikit-Learn the number of CPU cores to use for training and predictions (-1 tells Scikit-Learn to use all available cores):

In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [8]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators = 500, 
max_samples = 100, bootstrap = True, n_jobs = -1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

The BaggingClassifier automatically performs soft voting
instead of hard voting if the base classifier can estimate class probabilities
(i.e., if it has a predict_proba() method), which is the case
with Decision Trees classifiers.

The below figure compares the decision boundary of single Decision Tree with the decision boundary of a bagging ensemble of $500$ trees (from the preceeding code), both trained on the moons dataset. As you can see, the ensemble's predictions will likley generalize much better than the sinlge Decision Tree's predictions: the ensemble has a comparable bais but a smaller variance (it makes roughly the same number of errors on the training set, but the decision boundary is less irregular).

![](single_decision_tree_versus_bagging_ensemble_500_trees.png)

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slighlty higher bias than pasting, but this also means that predictors end up being less correlated so the ensemble's variance is reduced. Overall, bagging often results in better models, which explains why it is generally preferred. However, if you have spare time and CPU power you can use cross-validation to evaluate both bagging and pasting and select the one that works best.

## Out-of-Bag Evaluation

With bagging, some instances may be samples several times for any given predictor, while others may not be sampled at all. By default a BaggingClassifier samples $m$ training instances with replacement (bootstrap = True), where $m$ is the size of the training set. This means only about $63$% of the training instances are sampled on average for each predictor. The remaining $37$% of the training instances that are not sampled are called *out-of-bag* (oob) instances. Note that they are not the same $37$% for all predictors.

Since a predictor never seens the oob instances during training, it can be evaluated on these instances. without the need for a separate validation set. You can evaluate the ensemble itself by averaging out the oob evulations of each predictor.

In Scikit-Learn you can set oob_score = True when creating a BaggingClassifier to request an automatic oob evaluation after training. The following code demonstrates this. The resulting evaluation score is available through the oob_score_ variable:

In [9]:
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators = 500,
bootstrap = True, n_jobs = -1, oob_score = True)

bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=500,
                  n_jobs=-1, oob_score=True)

In [10]:
bag_clf.oob_score_

0.9013333333333333

According to this oob evaluation, this BaggingClassifier is likley to achieve about $90.1$% accuracy on the test set. Let's verify this:

In [11]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

Close enough!

The oob decision function for each training instance is also available through the oob_decision_function_ variable. In this case (since the base estimator has a predict_proba() method) the decision function returns the class proababilities for each training instance. For example, the oob evaluation estimates that the first training instance has a $58.82$% probability of belonging to the positive class (and $41.17$% of belonging to the negative class):

In [12]:
bag_clf.oob_decision_function_

array([[0.35384615, 0.64615385],
       [0.35233161, 0.64766839],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.10650888, 0.89349112],
       [0.39086294, 0.60913706],
       [0.02824859, 0.97175141],
       [1.        , 0.        ],
       [0.96855346, 0.03144654],
       [0.75409836, 0.24590164],
       [0.005     , 0.995     ],
       [0.78918919, 0.21081081],
       [0.86082474, 0.13917526],
       [0.95628415, 0.04371585],
       [0.05405405, 0.94594595],
       [0.        , 1.        ],
       [0.98984772, 0.01015228],
       [0.96666667, 0.03333333],
       [0.99473684, 0.00526316],
       [0.00990099, 0.99009901],
       [0.34054054, 0.65945946],
       [0.89893617, 0.10106383],
       [1.        , 0.        ],
       [0.97073171, 0.02926829],
       [0.        , 1.        ],
       [0.98984772, 0.01015228],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.69186047, 0.30813953],
       [0.

## Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well, This is controlled by two hyperparameters: max_features and bootstrap_features. They work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features.

This is particularly useful when you are dealing with high-dimensional inputs (such as images). Sampling both training instances and features is called the *Random Patches method*. Keeping all the training instances (that is bootstrap = False and max_samples = 1.0) but sampling features (that is bootstrap_features = True and/or max_features smaller than 1.0) is called the *Random Subspaces method*.

Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

## Random Forests

As we have discussed, a Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with max_samples set to the size of the training set. Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead use the RandomForestClassifier class, which is more convenient and optimized for DecisionTrees (similary, there is a RandomForestRegressor class for regression tasks). The following code trains a Random Forest Classifier with $500$ trees (each limited to maximum $16$ nodes), using all available CPU cores:

In [13]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators = 500, max_leaf_nodes = 16, n_jobs = -1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier:


In [14]:
bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

## Extra-Trees

When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting (as discussed earlier). It is possible to make
trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like regular Decision Trees do).

A forest of such extremely random trees is simply called an Extremely Randomized Trees ensemble (or Extra-Trees for short). Once again, this trades more bias for a
lower variance. It also makes Extra-Trees much faster to train than regular Random Forests since finding the best possible threshold for each feature at every node is one
of the most time-consuming tasks of growing a tree.

You can create an Extra-Trees classifier using Scikit-Learn’s ExtraTreesClassifier
class. Its API is identical to the RandomForestClassifier class. Similarly, the

It is hard to tell in advance whether a RandomForestClassifier
will perform better or worse than an ExtraTreesClassifier. Generally,
the only way to know is to try both and compare them using
cross-validation (and tuning the hyperparameters using grid
search).

## Feature Importance

Yet another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature’s importance by
looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each
node’s weight is equal to the number of training samples that are associated with it.

Scikit-Learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1. You can access the
result using the feature_importances_ variable. For example, the following code trains a RandomForestClassifier on the iris dataset and outputs each feature’s importance. It seems that the most important features are the
petal length (44%) and width (42%), while sepal length and width are rather unimportant in comparison (11% and 2%, respectively).

In [15]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators = 500, n_jobs = -1)
rnd_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09897975576214006
sepal width (cm) 0.024563451621514045
petal length (cm) 0.42810844826272715
petal width (cm) 0.4483483443536187


Similary, if you train a Random Forest classifier on the MNIST dataset and plot each pixel's importance, you get the image represented in the below figure

![](mnist_pixel_importance_random_forest_classifier.png)

Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

## Boosting

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each try to correct its predecessor. There are many boosting methods available, but by far the most popular are *AdaBoost* (short for Adaptive Boosting) and Gradient Boosting. Let's start with AdaBoost.

## AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.

For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set, A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on (see below figure).

![](adaboost_sequential_training_instance_weight_updates.png)

The figure shows the decision boundaries of five consecutive predictors on the moons dataset (in this example, each predictor is a highly regularzied SVM classifier with an RBF kernel). The first classifier gets many instances wrong, so their weights get boosted. The second classifier therefore does a better job on these instances, and so on. The plot on the right represents the same sequence of predictors except that the learning rate is halved (that is, the misclassified instance weights are boosted half as much at every iteration). As you can see, this sequential learning technique has some similarities with Gradient Descent, except that instead of tweaking a single predictor's parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better.

![](decision_boundaries_consecutive_predictors.png)

Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set.

There is one important drawback to this sequential learning technique: it cannot be parallelized (or only partially), since each predictor can only be trained after the previous predictor has been
trained and evaluated. As a result, it does not scale as well as bagging or pasting.

Let's take a closer look at the AdaBoost algorithm. Each instance weight $w^{(i)}$ is initially set to $\frac{1}{m}$. A first predictor is trained and its weight error rate $r_1$ is computed on the training set; see below equation

$$r_j = \frac{\sum_{i=1}^{m} w^{(i)} \ (\hat{y}_{j}^{(i)} \neq y^{(i)})}{\sum_{i=1}^{m} w^{(i)}}$$

where $\hat{y}_{j}^{(i)}$ is the $j$th predictor's prediction for the $i$th instance.

The predictor's weight $\alpha_j$ is then computed using the below equation

$$\alpha_j = \eta \log (\frac{1-r_j}{r_j})$$

where $\eta$ is learning rate hyperparameter (defaults to $1$). The more accurate the predictor is, the highers its  weight will be. If it just guessing randomly, then its weight will be close to zero. However, if it is most often wrong (that is, less accurate than random guessing) then its weight will be negative.

Next the instance weights are updating using the below equation: the misclassified instances are boosted.

for $i=1,2, \dots, m$

$$w^{(i)} \leftarrow \begin{cases} w^{(i)} \text{ if } \hat{y}_{j}^{(i)} = y^{(i)}  \\ w^{(i)} \ \exp(\alpha_j) \text{ o/w } \end{cases} $$

Then all instance weights are normalized (that is, divided by $\sum_{i=1}^{m} w^{(i)}$).

Finally, a new predictor is trained using the updated weights, and the whole process is repeated (the new predictor's weight is computed, the instance weights are updated, then another predictor is trained, and so on). The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found.

To make predictions, AdaBoost simply computes the predictions of all the predictors and weights them using the predictor weights $\alpha_j$. The predicted class is the one that recieves the majority of the weighted votes (see below equation)

$$\hat{y}(x) = \text{argmax}_{k} \sum_{j=1}^{N} \alpha_j \ (\hat{y}_j (x) = k)$$

where $N$ is the number of predictors.

Scikit-Learn actually uses a multiclass version of AdaBoost called SAMME16 (which stands for Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are just two classes, SAMME is equivalent to AdaBoost. Moreover, if the predictors can estimate class probabilities (i.e., if they have a predict_proba() method), Scikit-Learn can use a variant of SAMME called SAMME.R (the R stands for “Real”), which relies on class probabilities rather than predictions and generally
performs better.

The following code trains an AdaBoost classifier based on 200 Decision Stumps using
Scikit-Learn’s AdaBoostClassifier class (as you might expect, there is also an Ada
BoostRegressor class). A Decision Stump is a Decision Tree with max_depth=1—in
other words, a tree composed of a single decision node plus two leaf nodes. This is
the default base estimator for the AdaBoostClassifier class:

In [16]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), n_estimators=200,
algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200)

If your AdaBoost ensemble is overfitting the training set, you can
try reducing the number of estimators or more strongly regularizing
the base estimator.

## Gradient Boosting

Another very popular Boosting algorithm is Gradient Boosting. Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

Let’s go through a simple regression example using Decision Trees as the base predictors
(of course Gradient Boosting also works great with regression tasks). This is
called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT). First, let’s
fit a DecisionTreeRegressor to the training set (for example, a noisy quadratic training
set):

In [18]:
import numpy as np

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [19]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth = 2)
tree_reg1.fit(X,y)

DecisionTreeRegressor(max_depth=2)

Now train a second DecisionTreeRegressor on the residual errors made by the frist predictor:

In [20]:
y2 = y -tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X,y2)

DecisionTreeRegressor(max_depth=2)

Then we train a third regressor on the residual errors made by the second predictor:

In [22]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth = 2)
tree_reg3.fit(X,y3)

DecisionTreeRegressor(max_depth=2)

Now we have ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all trees:

In [25]:
X_new = np.array([[0.8]])

In [26]:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

The figure

![](gradient_boosting.png)

represents the predictions of these three trees in the left column, and the ensemble’s predictions in the right column. In the first row, the ensemble has just one tree, so its predictions are exactly the same as the first tree’s predictions. In the second
row, a new tree is trained on the residual errors of the first tree. On the right you can see that the ensemble’s predictions are equal to the sum of the predictions of the first
two trees. Similarly, in the third row another tree is trained on the residual errors of the second tree. You can see that the ensemble’s predictions gradually get better as
trees are added to the ensemble.

A simpler way to train GBRT ensembles is to use Scikit-Learn’s GradientBoostingRe
gressor class. Much like the RandomForestRegressor class, it has hyperparameters to
control the growth of Decision Trees (e.g., max_depth, min_samples_leaf, and so on),
as well as hyperparameters to control the ensemble training, such as the number of
trees (n_estimators). The following code creates the same ensemble as the previous
one:

In [27]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as $0.1$, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called *shrinkage*. The figure below shows two GBRT ensembles trained with a low learning rate: the one on the left does not have enough trees to fit the training set, while the one on the right has too many trees and overfits the training set.

![](gbrt_ensembles_not_enough_predictors.png)

In order to find the optimal number of trees, you can use early stopping. A simple way to implement this is to use the staged_predict() method: it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two trees, etc.). The following code trains a GBRT ensemble with $120$ trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:

In [28]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
for y_pred in gbrt.staged_predict(X_val)]

bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=84)

The validation errors are represented on the left of below figure, and the best model's predictions are represented on the right.

![](tuning_the_number_of_trees_early_stopping.png)

It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number). You can do so by setting warm_start=True, which makes Scikit- Learn keep existing trees when the fit() method is called, allowing incremental training. The following code stops training when the validation error does not improve for five iterations in a row:

In [29]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

In [30]:
print(gbrt.n_estimators)

69


In [31]:
print("Minimum validation MSE:", min_val_error)

Minimum validation MSE: 0.002750279033345716


The GradientBoostingRegressor class also supports a subsample hyperparameter,
which specifies the fraction of training instances to be used for training each tree. For
example, if subsample=0.25, then each tree is trained on 25% of the training instances,
selected randomly. As you can probably guess by now, this trades a higher bias
for a lower variance. It also speeds up training considerably. This technique is called
Stochastic Gradient Boosting.

It is possible to use Gradient Boosting with other cost functions.
This is controlled by the loss hyperparameter (see Scikit-Learn’s
documentation for more details).

For XGBoost use Google Collab

## Stacking 

Skipping for now!