# CHAPTER 7



# Ensemble Learning and Random Forests

1. if you aggregate
the predictions of a group of predictors (such as classifiers or regressors), 
2. you will
often get better predictions than with the best individual predictor.
3. A group of pre‐
dictors is called an ensemble; thus, 
4. this technique is called Ensemble Learning, and 
5. an
Ensemble Learning algorithm is called an Ensemble method

1. an ensemble of Decision Trees is called a Random Forest,
2. you can train a group of Decision Tree classifiers, each on a different
random subset of the training set. 
3. To make predictions, you just obtain the predic‐
tions of all individual trees, then predict the class that gets the most votes

1. Ensemble methods, including:
1. bagging, 
3. boosting, 
4. stacking, and a few others

# hard voting classifier 
1. aggregate the predictions of
each classifier and predict the class that gets the most votes. This majority-vote classi‐
fier is called a hard voting classifier
2. suppose you build an ensemble containing 1,000 classifiers that are individ‐
ually correct only 51% of the time (barely better than random guessing). 
3. If you pre‐
dict the majority voted class, you can hope for up to 75% accuracy! 
4. However, this is
only true if all classifiers are perfectly independent, making uncorrelated errors,
5. which is clearly not the case since they are trained on the same data. They are likely to
make the same types of errors,
6. Ensemble methods work best when the predictors are as independ‐
ent from one another as possible. One way to get diverse classifiers
is to train them using very different algorithms. This increases the
chance that they will make very different types of errors, improving
the ensemble’s accuracy.


following code creates and trains a voting classifier in Scikit-Learn, composed of
three diverse classifiers (the training set is the moons dataset,


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Note: to be future-proof, we set solver="lbfgs", n_estimators=100, and gamma="scale" since these will be the default values in upcoming Scikit-Learn versions.

In [12]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#hard voting
lg = LogisticRegression(random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
sc = SVC(random_state=42)

vc =  VotingClassifier(
          [('lg', lg),('rf',rf),('sc',sc)], voting= 'hard'
)


vc.fit(X_train, y_train)

VotingClassifier(estimators=[('lg', LogisticRegression(random_state=42)),
                             ('rf', RandomForestClassifier(random_state=42)),
                             ('sc', SVC(random_state=42))])

In [15]:
#check accuracy
from sklearn.metrics import accuracy_score
for i in (lg,rf,sc,vc):
    i.fit(X_train, y_train)
    y_pred = i.predict(X_test)
    print(i.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


soft voting

1. If all classifiers are able to estimate class probabilities (i.e., they have a pre
dict_proba() method), 
2. then you can tell Scikit-Learn to predict the class with the
highest class probability, averaged over all the individual classifiers. 
3. This is called so
voting. 
4. It often achieves higher performance than hard voting because it gives more
weight to highly confident votes. 
5. ensure that all classifiers can estimate class probabilities. 
6. This is
not the case of the SVC class by default, so you need to set its probability hyperpara‐
meter to True (this will make the SVC class use cross-validation to estimate class prob‐
abilities, slowing down training, and it will add a predict_proba() method

In [16]:
#soft voting
lg = LogisticRegression(random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
sc = SVC(random_state=42, probability= True)

vc =  VotingClassifier(
          [('lg', lg),('rf',rf),('sc',sc)], voting= 'soft'
)


vc.fit(X_train, y_train)

VotingClassifier(estimators=[('lg', LogisticRegression(random_state=42)),
                             ('rf', RandomForestClassifier(random_state=42)),
                             ('sc', SVC(probability=True, random_state=42))],
                 voting='soft')

In [17]:
#check accuracy
from sklearn.metrics import accuracy_score
for i in (lg,rf,sc,vc):
    i.fit(X_train, y_train)
    y_pred = i.predict(X_test)
    print(i.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92


# Bagging and Pasting

1. One way to get a diverse set of classifiers is to use very different training algorithms,
as just discussed. 
2. Another approach is to use the same training algorithm for every
predictor, but to train them on different random subsets of the training set 

Bagging and Pasting

3. When
sampling is performed with replacement, this method is called bagging1
(short for
bootstrap aggregating2
). 
4. When sampling is performed without replacement, it is called
pasting.
5. both bagging and pasting allow training instances to be sampled sev‐
eral times across multiple predictors, but only bagging allows training instances to be
sampled several times for the same predictor. 
6. In statistics, resampling with replacement is called bootstrapping
7. Once all predictors are trained, the ensemble can make a prediction for a new
instance by simply aggregating the predictions of all predictors.
8. 

benifits of bootstraping

1. each individual predictor has a higher bias than if it were trained on the original training set, but
aggregation reduces both bias and variance.
2. Generally, the net result is that the
ensemble has a similar bias but a lower variance than a single predictor trained on the
original training set.
3. predictors can all be trained in parallel, via different
CPU cores or even different servers. Similarly, predictions can be made in parallel.
4. This is one of the reasons why bagging and pasting are such popular methods: they
scale very well.


Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClas
sifier class (or BaggingRegressor for regression)

1. following code trains an
ensemble of 500 Decision Tree classifiers,
each trained on 100 training instances ran‐
domly sampled from the training set with replacement (this is an example of bagging,

2. but if you want to use pasting instead, just set bootstrap=False). 
3. The n_jobs param‐
eter tells Scikit-Learn the number of CPU cores to use for training and predictions
4. (–1 tells Scikit-Learn to use all available cores):


In [23]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bg = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, max_samples=100, 
    bootstrap=True, n_jobs=-1)
bg.fit(X_train,y_train)
y_pred = bg.predict(X_test)

The BaggingClassifier automatically performs soft voting
instead of hard voting if the base classifier can estimate class proba‐
bilities (i.e., if it has a predict_proba() method), which is the case
with Decision Trees classifiers.

In [24]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.936


In [25]:
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))

0.856


As you can see, the ensemble’s predictions will likely
generalize much better than the single Decision Tree’s predictions: the ensemble has a
comparable bias but a smaller variance (it makes roughly the same number of errors
on the training set, but the decision boundary is less irregular)

In [None]:
1. Bootstrapping introduces a bit more diversity in the subsets that each predictor is
trained on, so bagging ends up with a slightly higher bias than pasting, but this also
means that predictors end up being less correlated so the ensemble’s variance is
reduced. Overall, bagging often results in better models, which explains why it is gen‐
erally preferred.

In [29]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bg = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, max_samples=375, 
    bootstrap=True, n_jobs=-1)
bg.fit(X_train,y_train)
y_pred = bg.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.912


In [27]:
X_train.shape

(375, 2)

# Out-of-Bag Evaluation


1. With bagging, some instances may be sampled several times for any given predictor,
while others may not be sampled at all. 
2. By default a BaggingClassifier samples m
training instances with replacement (bootstrap=True), where m is the size of the
training set. 
3. This means that only about 63% of the training instances are sampled on
average for each predictor.6

out-of-bag (oob) instances

4. The remaining 37% of the training instances that are not
sampled are called out-of-bag (oob) instances. Note that they are not the same 37%
for all predictors.

5. Since a predictor never sees the oob instances during training, it can be evaluated on
these instances, without the need for a separate validation set. 
6. You can evaluate the
ensemble itself by averaging out the oob evaluations of each predictor.

n Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier to
request an automatic oob evaluation after training



In [31]:
bg = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,oob_score=True,
    bootstrap=True, n_jobs=-1)
bg.fit(X_train,y_train)

bg.oob_score_

0.9013333333333333

In [32]:
#test accuracy
y_pred = bg.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.904


1. The oob decision function for each training instance is also available through the
oob_decision_function_ variable. 
2. In this case (since the base estimator has a pre
dict_proba() method) the decision function returns the class probabilities for each
training instance.

In [33]:
bg.oob_decision_function_

array([[0.35449735, 0.64550265],
       [0.34730539, 0.65269461],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.00529101, 0.99470899],
       [0.09659091, 0.90340909],
       [0.34078212, 0.65921788],
       [0.        , 1.        ],
       [0.98387097, 0.01612903],
       [0.96756757, 0.03243243],
       [0.77966102, 0.22033898],
       [0.        , 1.        ],
       [0.7761194 , 0.2238806 ],
       [0.87165775, 0.12834225],
       [0.95027624, 0.04972376],
       [0.03743316, 0.96256684],
       [0.        , 1.        ],
       [0.97969543, 0.02030457],
       [0.96      , 0.04      ],
       [0.98802395, 0.01197605],
       [0.03092784, 0.96907216],
       [0.35911602, 0.64088398],
       [0.92307692, 0.07692308],
       [1.        , 0.        ],
       [0.97927461, 0.02072539],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.63068182, 0.36931818],
       [0.

# Random Patches and Random Subspaces

1. The BaggingClassifier class supports sampling the features as well. 
2. This is con‐
trolled by two hyperparameters: max_features and bootstrap_features. 
3. They work
the same way as max_samples and bootstrap, 
4. but for feature sampling instead of
instance sampling. Thus, 
5. each predictor will be trained on a random subset of the
input features.
6. This is particularly useful when you are dealing with high-dimensional inputs (such
as images). 
7. Sampling both training instances and features is called the Random
Patches method.


8. Keeping all training instances (i.e., bootstrap=False and max_sam
ples=1.0) but sampling features (i.e., bootstrap_features=True and/or max_fea
tures smaller than 1.0) is called the Random Subspaces method.


In [38]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bg = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, max_samples=1.0, 
    bootstrap=False, n_jobs=-1,bootstrap_features=True,max_features=1.0 )
bg.fit(X_train,y_train)
y_pred = bg.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.848


Sampling features results in even more predictor diversity, trading a bit more bias for
a lower variance.


# Random Forests

1. Random forests provide an improvement over bagged trees by way of a random
small tweak that decorrelates the trees. 
2. As in bagging, we build a number forest
of decision trees on bootstrapped training samples. But when building these
decision trees, each time a split in a tree is considered, a random sample of
m predictors is chosen as split candidates from the full set of p predictors.
3. The split is allowed to use only one of those m predictors. 
4. A fresh sample of
m predictors is taken at each split, and typically we choose m ≈ √p—that
is, the number of predictors considered at each split is approximately equal
to the square root of the total number of predictors (4 out of the 13 for the
Heart data).

5. In other words, in building a random forest, at each split in the tree,
the algorithm is not even allowed to consider a majority of the available
predictors.

1. generally
trained via the bagging method (or sometimes pasting), typically with max_samples
set to the size of the training set. 
2. Instead of building a BaggingClassifier and pass‐
ing it a DecisionTreeClassifier, you can instead use the RandomForestClassifier
class, 
3. which is more convenient and optimized for Decision Trees10 (similarly, there is
a RandomForestRegressor class for regression tasks).

4. With a few exceptions, a RandomForestClassifier has all the hyperparameters of a
DecisionTreeClassifier (to control how trees are grown), plus all the hyperpara‐
meters of a BaggingClassifier to control the ensemble itself

5. The BaggingClassifier class remains useful if you want a bag of something other than Decision Trees.
6. There are a few notable exceptions: splitter is absent (forced to "random"), presort is absent (forced to
False), max_samples is absent (forced to 1.0), and base_estimator is absent (forced to DecisionTreeClassi
fier with the provided hyperparameters).


1. The Random Forest algorithm introduces extra randomness when growing trees;
2. instead of searching for the very best feature when splitting a node ,
3. it
searches for the best feature among a random subset of features. 
4. This results in a
greater tree diversity, which (once again) trades a higher bias for a lower variance,
5. generally yielding an overall better model.

In [39]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred_rf))


0.92


The following BaggingClassifier is
roughly equivalent to the previous RandomForestClassifier:

In [43]:
bg = BaggingClassifier(
 DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
 n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1,bootstrap_features=True,max_features=1.0)
bg.fit(X_train, y_train)

y_pred_rf = bg.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred_rf))

0.912


# Extra-Trees
1. When you are growing a tree in a Random Forest, at each node only a random subset
of the features is considered for splitting (as discussed earlier). It is possible to make
trees even more random by also using random thresholds for each feature rather than
searching for the best possible thresholds (like regular Decision Trees do).
A forest of such extremely random trees is simply called an Extremely Randomized
Trees ensemble12 (or Extra-Trees for short).
2. this trades more bias for a
lower variance. 
3. It also makes Extra-Trees much faster to train than regular Random
Forests since finding the best possible threshold for each feature at every node is one
of the most time-consuming tasks of growing a tree.

You can create an Extra-Trees classifier using Scikit-Learn’s ExtraTreesClassifier
class.

It is hard to tell in advance whether a RandomForestClassifier
will perform better or worse than an ExtraTreesClassifier. Gen‐
erally, the only way to know is to try both and compare them using
cross-validation (and tuning the hyperparameters using grid
search).


In [48]:
from sklearn.tree import ExtraTreeClassifier
et = ExtraTreeClassifier( max_leaf_nodes=16)
et.fit(X_train, y_train)

y_pred_rf = et.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred_rf))

0.816


# Feature Importance
1. another great quality of Random Forests is that they make it easy to measure the 
relative importance of each feature. 
2. Scikit-Learn measures a feature’s importance by
looking at how much the tree nodes that use that feature reduce impurity on average
(across all trees in the forest). 
3. More precisely, it is a weighted average, where each
node’s weight is equal to the number of training samples that are associated with it
4. You can access the
result using the feature_importances_ variable.


In [51]:
from sklearn.datasets import load_iris
iris = load_iris()
rf = RandomForestClassifier(n_estimators=500, random_state=42)
rf.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], (rf.feature_importances_*100)):
    print(name, score,'%')

sepal length (cm) 11.249225099876375 %
sepal width (cm) 2.311928828251033 %
petal length (cm) 44.10304643639577 %
petal width (cm) 42.33579963547682 %


 It seems that the most important features are the
petal length (44%) and width (42%),

Random Forests are very handy to get a quick understanding of what features
actually matter, in particular if you need to perform feature selection.

# Boosting

1. Boosting (originally called hypothesis boosting) refers to any Ensemble method that
can combine several weak learners into a strong learner. 
The general idea of most
2. boosting methods is to train predictors sequentially, each trying to correct its prede‐
cessor

# AdaBoost
One way for a new predictor to correct its predecessor is to pay a bit more attention
to the training instances that the predecessor underfitted.

There is one important drawback to this sequential learning techni‐
que: it cannot be parallelized (or only partially), since each predic‐
tor can only be trained after the previous predictor has been
trained and evaluated. As a result, it does not scale as well as bag‐
ging or pasting.

Scikit-Learn actually uses a multiclass version of AdaBoost called SAMME16 (which
stands for Stagewise Additive Modeling using a Multiclass Exponential loss function).
When there are just two classes, SAMME is equivalent to AdaBoost. Moreover, if the
predictors can estimate class probabilities (i.e., if they have a predict_proba()
method), Scikit-Learn can use a variant of SAMME called SAMME.R (the R stands
for “Real”), which relies on class probabilities rather than predictions and generally
performs better.

In [8]:
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
 DecisionTreeClassifier(max_depth=1), n_estimators=200,
 algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)


AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200)

In [None]:
If your AdaBoost ensemble is overfitting the training set, you can
try reducing the number of estimators or more strongly regulariz‐
ing the base estimator

# Gradient Boosting
Another very popular Boosting algorithm is Gradient Boosting.
17 Just like AdaBoost,
Gradient Boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. However, instead of tweaking the instance weights at every
iteration like AdaBoost does, this method tries to fit the new predictor to the residual
errors made by the previous predictor

Let’s go through a simple regression example using Decision Trees as the base predic‐
tors (of course Gradient Boosting also works great with regression tasks). This is
called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT).
Let create a simple quadratic dataset:

In [9]:
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [11]:
#Now let's train a decision tree regressor on this dataset:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

DecisionTreeRegressor(max_depth=2, random_state=42)

In [12]:
#Now train a second DecisionTreeRegressor on the residual errors made by the first predictor:
y2 = y- tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(max_depth=2)

In [13]:
#Then we train a third regressor on the residual errors made by the second predictor:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)


DecisionTreeRegressor(max_depth=2)

Now we have an ensemble containing three trees. It can make predictions on a new
instance simply by adding up the predictions of all the trees:

In [14]:
X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

array([0.75026781])

A simpler way to train GBRT ensembles is to use Scikit-Learn’s GradientBoostingRe
gressor class. Much like the RandomForestRegressor class, it has hyperparameters to
control the growth of Decision Trees (e.g., max_depth, min_samples_leaf, and so on),
as well as hyperparameters to control the ensemble training, such as the number of
trees (n_estimators). The following code creates the same ensemble as the previous
one:


In [15]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)
gbrt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3,
                          random_state=42)

In [16]:
gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=200, learning_rate=0.1, random_state=42)
gbrt_slow.fit(X, y)

GradientBoostingRegressor(max_depth=2, n_estimators=200, random_state=42)

The learning_rate hyperparameter scales the contribution of each tree. If you set it
to a low value, such as 0.1, you will need more trees in the ensemble to fit the train‐
ing set, but the predictions will usually generalize better. This is a regularization tech‐
nique called shrinkage. Figure 7-10 shows two GBRT ensembles trained with a low
learning rate: the one on the left does not have enough trees to fit the training set,
while the one on the right has too many trees and overfits the training set.

In [None]:
Gradient Boosting with Early stopping:

In order to find the optimal number of trees, you can use early stopping (see Chap‐
ter 4). A simple way to implement this is to use the staged_predict() method: it
returns an iterator over the predictions made by the ensemble at each stage of train‐
ing (with one tree, two trees, etc.)

In [17]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)



GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)

In [22]:
for i in gbrt.staged_predict(X_val):
    print(i)
#it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two trees, etc.)

[0.25042018 0.25042018 0.25042018 0.25042018 0.25042018 0.25042018
 0.25042018 0.25042018 0.30371637 0.29018668 0.25042018 0.25042018
 0.25042018 0.25042018 0.30371637 0.25042018 0.25042018 0.25042018
 0.30371637 0.25042018 0.25042018 0.30371637 0.25042018 0.28211232
 0.29018668]
[0.23313281 0.23313281 0.23313281 0.23313281 0.23313281 0.23313281
 0.23313281 0.23313281 0.33858247 0.31049451 0.26238205 0.27072802
 0.26238205 0.23313281 0.33858247 0.23313281 0.23313281 0.23313281
 0.33858247 0.23313281 0.23313281 0.33858247 0.23313281 0.29407419
 0.31049451]
[0.21915515 0.21915515 0.21915515 0.21915515 0.21915515 0.21915515
 0.21915515 0.21915515 0.36996196 0.33446413 0.27314773 0.25675036
 0.27314773 0.21915515 0.36996196 0.21915515 0.21915515 0.21915515
 0.36996196 0.21915515 0.21915515 0.36996196 0.21915515 0.30483988
 0.33446413]
[0.20444938 0.20444938 0.20444938 0.20444938 0.20444938 0.20444938
 0.20444938 0.20444938 0.39820349 0.35135615 0.28268598 0.27364238
 0.28268598 0.20444938 

  0.67455915]
[ 0.18034942  0.18034942 -0.0276359   0.18546622  0.05225241 -0.01262025
 -0.01262025  0.0915815   0.59943821  0.58102888  0.37273007  0.28334965
  0.37273007  0.0158191   0.57840777  0.09818162  0.12274515  0.03777676
  0.57840777 -0.01262025  0.18034942  0.58231348  0.23166132  0.43585625
  0.67458507]
[ 0.18041212  0.18041212 -0.0275732   0.18552891  0.0523151  -0.01255755
 -0.01255755  0.0908784   0.59937013  0.58109157  0.37266198  0.28341234
  0.37266198  0.0158818   0.57833969  0.09747851  0.12280784  0.03783945
  0.57833969 -0.01255755  0.18041212  0.5822454   0.23095821  0.43578817
  0.67464776]
[ 0.1804092   0.1804092  -0.02757612  0.18552599  0.05231218 -0.01256048
 -0.01256048  0.09087548  0.59942281  0.58108865  0.37265906  0.28340942
  0.37265906  0.01587888  0.57839238  0.09747559  0.12280492  0.03783653
  0.57839238 -0.01256048  0.1804092   0.58489739  0.23095529  0.43290457
  0.67464484]
[ 0.18039922  0.18039922 -0.0275861   0.18551602  0.0523022  -0.0125

In [23]:
errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=55, random_state=42)

It is also possible to implement early stopping by actually stopping training early
(instead of training a large number of trees first and then looking back to find the
optimal number). You can do so by setting warm_start=True, which makes ScikitLearn keep existing trees when the fit() method is called, allowing incremental
training. The following code stops training when the validation error does not
improve for five iterations in a row:

In [24]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

In [25]:
print(gbrt.n_estimators)

61


In [26]:
print("Minimum validation MSE:", min_val_error)

Minimum validation MSE: 0.002712853325235463


# Stochastic Gradient Boosting
The GradientBoostingRegressor class also supports a subsample hyperparameter,
which specifies the fraction of training instances to be used for training each tree. For
example, if subsample=0.25, then each tree is trained on 25% of the training instan‐
ces, selected randomly. As you can probably guess by now, this trades a higher bias
for a lower variance. It also speeds up training considerably. This technique is called
Stochastic Gradient Boosting

# XGBoost
It is worth noting that an optimized implementation of Gradient Boosting is available
in the popular python library XGBoost, which stands for Extreme Gradient Boosting

In [30]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.6.1-py3-none-win_amd64.whl (125.4 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.6.1


In [31]:
import xgboost
xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

val_error = mean_squared_error(y_val, y_pred) # Not shown
print("Validation MSE:", val_error)

Validation MSE: 0.004000408205406276


In [32]:
#XGBoost also offers several nice features, such as automatically taking care of early stopping

xgb_reg.fit(X_train, y_train,
                eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_pred)  # Not shown
print("Validation MSE:", val_error)



[0]	validation_0-rmse:0.22834
[1]	validation_0-rmse:0.16224
[2]	validation_0-rmse:0.11843
[3]	validation_0-rmse:0.08760
[4]	validation_0-rmse:0.06848
[5]	validation_0-rmse:0.05709
[6]	validation_0-rmse:0.05297
[7]	validation_0-rmse:0.05129
[8]	validation_0-rmse:0.05155
Validation MSE: 0.002630868681577655


In [33]:
%timeit xgboost.XGBRegressor().fit(X_train, y_train)

425 ms ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]:
%timeit GradientBoostingRegressor().fit(X_train, y_train)

119 ms ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
