## Ensemble Learning
If we aggregate
the predictions of a group of predictors (such as classifiers or regressors), you will
often get better predictions than with the best individual predictor. A group of predictors
is called an **ensemble**; thus, this technique is called **Ensemble Learning**, and an
Ensemble Learning algorithm is called an **Ensemble method**.

Such an ensemble (Decision Tree classifiers) of Decision Trees is called a **Random Forest**

Popular Ensemble methods, including **bagging,
boosting, stacking**

## Voting Classifiers

<br>

<img src="images/hard_voting_clf.jpg" width='600' />

Voting classifier often achieves a higher accuracy than the
best classifier in the ensemble.

Suppose you build an ensemble containing 1,000 classifiers that are individually
correct only 51% of the time (barely better than random guessing). If you predict
the majority voted class, you can hope for up to 75% accuracy! However, this is
only true if all classifiers are perfectly independent, making uncorrelated errors,
which is clearly not the case since they are trained on the same data. They are likely to
make the same types of errors, so there will be many majority votes for the wrong
class, reducing the ensemble’s accuracy.
- Ensemble methods work best when the predictors are as independent
from one another as possible. One way to get diverse classifiers
is to train them using very different algorithms. This increases the
chance that they will make very different types of errors, improving
the ensemble’s accuracy.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier


In [3]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

In [4]:
voting_clf = VotingClassifier(estimators= [('lr', log_clf), ('rf', rnd_clf), ('svm', svm_clf)], 
                              voting='hard')
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svm', SVC())])

In [5]:
# Let’s look at each classifier’s accuracy on the test set

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.896
VotingClassifier 0.904


If all classifiers are able to estimate class probabilities (i.e., they have a pre
dict_proba() method), then we can predict the class with the
highest class probability, averaged over all the individual classifiers. This is called **soft
voting.** <br>
It often achieves higher performance than hard voting because it gives more
weight to highly confident votes.


## Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms. <br>
**Bagging:** Another approach is to use the same training algorithm for every predictor, but to train them on **different random subsets of the training set.**
- When sampling is performed with replacement, this method is called bagging (short for
bootstrap aggregating). <br>
- When sampling is performed without replacement, it is called **pasting.**

**Sample with replacement:** The two sample values are independent. Practically, this means that what we get on the first one doesn't affect what we get on the second. Mathematically, this means that the covariance between the two is zero. <br>
The outcome of the first draw does not affect the probability of the outcome on the second draw.

<br>

When we **sample without replacement**, the items in the sample are dependent because the outcome of one random draw is affected by the previous draw.

<img src="images/bagging_1.png" width='700' />

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), 
                            n_estimators=500, 
                            max_samples=100,
                            bootstrap=True,
                            n_jobs=-1)

In [7]:
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [8]:
y_pred

array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)

The BaggingClassifier automatically performs soft voting
instead of hard voting if the base classifier can estimate class probabilities
(i.e., if it has a predict_proba() method), which is the case
with Decision Trees classifiers.

## Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor,
while others may not be sampled at all. <br>
The remaining training instances that are not sampled are called out-of-bag (oob) instances. <br>
Since a predictor never sees the oob instances during training, it can be evaluated on
these instances, without the need for a separate validation set. We can evaluate the
ensemble itself by averaging out the oob evaluations of each predictor.


In [9]:
# automatic oob evaluation after training
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                            n_estimators=500,
                            bootstrap=True, n_jobs=-1, oob_score=True)

bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=500,
                  n_jobs=-1, oob_score=True)

In [10]:
bag_clf.oob_score_

0.8986666666666666

In [11]:
# Let’s verify this accuracy of oob eveluation

from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

In [12]:
# oob decision function for each training instance is also available through the oob_decision_function_ variable.
bag_clf.oob_decision_function_

array([[0.36945813, 0.63054187],
       [0.31638418, 0.68361582],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.10465116, 0.89534884],
       [0.38372093, 0.61627907],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [0.965     , 0.035     ],
       [0.7679558 , 0.2320442 ],
       [0.        , 1.        ],
       [0.71098266, 0.28901734],
       [0.82681564, 0.17318436],
       [0.99456522, 0.00543478],
       [0.06451613, 0.93548387],
       [0.        , 1.        ],
       [0.98882682, 0.01117318],
       [0.94565217, 0.05434783],
       [0.98857143, 0.01142857],
       [0.01098901, 0.98901099],
       [0.32085561, 0.67914439],
       [0.92105263, 0.07894737],
       [1.        , 0.        ],
       [0.98492462, 0.01507538],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.65555556, 0.34444444],
       [0.

## Random Patches and Random Subspaces
BaggingClassifier class supports sampling the features as well.This is controlled
by two hyperparameters:
- max_features 
- bootstrap_features.

Thus, each predictor will be trained on a random subset of the input features.
This is particularly useful when you are dealing with high-dimensional inputs (such
as images). <br>
Sampling both training instances and features is called the **Random
Patches method**. <br>
Keeping all training instances (i.e., bootstrap=False and max_samples=1.0) but sampling features (i.e., bootstrap_features=True and/or max_fea
tures smaller than 1.0) is called the **Random Subspaces method**.

## Random Forests
Ensemble of Decision Trees, generally
trained via the bagging method (or sometimes pasting).
Instead of building a **BaggingClassifier** and passing
it a **DecisionTreeClassifier**, you can instead use the **RandomForestClassifier**
class, which is more convenient and optimized for Decision Trees (The BaggingClassifier class remains useful if you want a bag of something other than Decision Trees)

In [13]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

**RandomForestClassifier** has all the hyperparameters of a
**DecisionTreeClassifier** (to control how trees are grown), plus all the hyperparameters
of a **BaggingClassifier** to control the ensemble itself.

## Feature Importance
Random Forests make it easy to measure the
relative importance of each feature. Scikit-Learn measures a feature’s importance by
looking at how much the tree nodes that use that feature reduce impurity on average
(across all trees in the forest). More precisely, it is a weighted average, where each
node’s weight is equal to the number of training samples that are associated with it.

Scikit-Learn computes this score automatically for each feature after training, then it
scales the results so that the sum of all importances is equal to 1. You can access the
result using the **feature_importances_** variable.

In [14]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09120139699453958
sepal width (cm) 0.023989731426464052
petal length (cm) 0.4226600768284621
petal width (cm) 0.4621487947505343


Random Forests are very handy to get a quick understanding of what features
actually matter, in particular if you need to perform feature selection.

# Boosting
Boosting (originally called hypothesis boosting) refers to any Ensemble method that
can combine several weak learners into a strong learner. The general idea of most
boosting methods is to **train predictors sequentially, each trying to correct its predecessor.**

<img src="images/boosting.png" width='700' />

<br>

Most popular boosting methods are:
- AdaBoost(Adaptive Boosting)
- Gradient Boosting




## AdaBoost

One way for a new predictor to correct its predecessor is **to pay a bit more attention
to the training instances that the predecessor underfitted.** This results in new predictors
focusing more and more on the hard cases. This is the technique used by Ada‐
Boost.

<br>

<img src="images/adaboost.jpg" width='700' />

<br>

The first classifier gets many instances wrong, so their weights get boosted. The second classifier therefore does a better job on these instances, and
so on.

There is one **important drawback** to this sequential learning technique:
**it cannot be parallelized (or only partially),** since each predictor
can only be trained after the previous predictor has been
trained and evaluated. As a result, it does not scale as well as bagging
or pasting.



Scikit-Learn actually uses a multiclass version of AdaBoost called **SAMME** (which
stands for Stagewise Additive Modeling using a Multiclass Exponential loss function).
When there are just two classes, SAMME is equivalent to AdaBoost.

The following code trains an AdaBoost classifier based on 200 **Decision Stumps** using
Scikit-Learn’s AdaBoostClassifier class. A **Decision Stump is a Decision Tree with max_depth=1—in
other words, a tree composed of a single decision node plus two leaf nodes.** This is
the default base estimator for the AdaBoostClassifier class

In [16]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), 
    n_estimators=200,
    algorithm="SAMME.R", 
    learning_rate=0.5)

ada_clf.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200)

## Gradient Boosting

Gradient Boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. However, instead of tweaking the instance weights at every
iteration like AdaBoost does, this method tries to fit the new predictor to the residual
errors made by the previous predictor.

Gradient Boosting also works great with regression tasks. This is
called Gradient **Tree Boosting**, or **Gradient Boosted Regression Trees (GBRT).**

In [18]:
# Let’s go through a simple regression example using Decision Trees as the base predictors.

from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

DecisionTreeRegressor(max_depth=2)

In [19]:
# Now train a second DecisionTreeRegressor on the residual errors made by the first predictor

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(max_depth=2)

In [20]:
# Then we train a third regressor on the residual errors made by the second predictor

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(max_depth=2)

Now we have an ensemble containing three trees. It can make predictions on a new
instance simply by adding up the predictions of all the trees.

In [22]:
y_pred = sum(tree.predict(X_test) for tree in (tree_reg1, tree_reg2, tree_reg3))

<img src="images/gradient_boosting.jpg" width='700' />

**Figure** represents the predictions of these three trees in the left column, and the
ensemble’s predictions in the right column. In the first row, the ensemble has just one
tree, so its predictions are exactly the same as the first tree’s predictions. In the second
row, a new tree is trained on the residual errors of the first tree. On the right you can
see that the ensemble’s predictions are equal to the sum of the predictions of the first
two trees. Similarly, in the third row another tree is trained on the residual errors of
the second tree. You can see that the ensemble’s predictions gradually get better as
trees are added to the ensemble.

Scikit-Learn’s **GradientBoostingRegressor** class. Much like the RandomForestRegressor class, it has hyperparameters to
control the growth of Decision Trees (e.g., max_depth, min_samples_leaf, and so on),
as well as hyperparameters to control the ensemble training, such as the number of
trees (n_estimators).

In [23]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

The learning_rate hyperparameter scales the contribution of each tree. If you set it
to a low value, such as 0.1, you will need more trees in the ensemble to fit the training
set, but the predictions will usually generalize better. This is a regularization technique
called **shrinkage.**

<img src="images/gradient_boosting_2.jpg" width='700' />


**Figure** shows two GBRT ensembles trained with a low
learning rate: the one on the left does not have enough trees to fit the training set,
while the one on the right has too many trees and overfits the training set.

In order to find the optimal number of trees, you can use **early stopping**. A simple way to implement this is to use the staged_predict() method: it
returns an iterator over the predictions made by the ensemble at each stage of training
(with one tree, two trees, etc.).

In [24]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)

gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) 
          for y_pred in gbrt.staged_predict(X_val)]

bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)

gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=88)

It is also possible to implement early stopping by actually stopping training early
(instead of training a large number of trees first and then looking back to find the
optimal number). You can do so by setting warm_start=True, which makes Scikit-
Learn keep existing trees when the fit() method is called, allowing incremental
training.

The following code stops training when the validation error does not
improve for five iterations in a row:

In [25]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)
min_val_error = float("inf")
error_going_up = 0

for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break # early stopping

The **GradientBoostingRegressor** class also supports a **subsample** hyperparameter,
which specifies the fraction of training instances to be used for training each tree. For
example, if subsample=0.25, then each tree is trained on 25% of the training instances,
selected randomly. As you can probably guess by now, this trades a higher bias
for a lower variance. It also speeds up training considerably. **This technique is called
Stochastic Gradient Boosting.**


It is worth noting that an optimized implementation of Gradient Boosting is available
in the popular python library **XGBoost**, which stands for **Extreme Gradient Boosting.**

In [27]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

XGBoost also offers several nice features, such as **automatically taking care of early
stopping**

In [28]:
xgb_reg.fit(X_train, y_train,
            eval_set=[(X_val, y_val)], early_stopping_rounds=2)

y_pred = xgb_reg.predict(X_val)

[0]	validation_0-rmse:0.41028
[1]	validation_0-rmse:0.36708
[2]	validation_0-rmse:0.34373
[3]	validation_0-rmse:0.33505
[4]	validation_0-rmse:0.33242
[5]	validation_0-rmse:0.33519
[6]	validation_0-rmse:0.33227
[7]	validation_0-rmse:0.33541
[8]	validation_0-rmse:0.33165
[9]	validation_0-rmse:0.33338
[10]	validation_0-rmse:0.33470




# Stacking

**stacking** (short for
stacked generalization) it is based on a simple idea: instead of using trivial functions
(such as hard voting) to aggregate the predictions of all predictors in an ensemble,
why don’t we train a model to perform this aggregation.

Figure shows such an ensemble performing a regression task on a new instance. Each of the three
predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor
(called a blender, or a meta learner) takes these predictions as inputs and makes the
final prediction (3.0).

<img src="images/stacking_2.jpg" width='500' />

<br>

<img src="images/stacking.jpeg" width='700' />


To train the blender, a common approach is to use a hold-out set.
First, the training set is split in two subsets. The first subset is used to train the
predictors in the first layer.

Next, the first layer predictors are used to make predictions on the second (held-out)
set. This ensures that the predictions are “clean,” since the predictors
never saw these instances during training. Now for each instance in the hold-out set there are three predicted values. We can create a new training set using these predicted
values as input features (which makes this new training set three-dimensional),
and keeping the target values. The blender is trained on this new training set, so it
learns to predict the target value given the first layer’s predictions.

<img src="images/stacking_3.jpg" width='400' />

It is actually possible to train several different blenders this way (e.g., one using Linear
Regression, another using Random Forest Regression, and so on): we get a whole
layer of blenders. The trick is to split the training set into three subsets: the first one is
used to train the first layer, the second one is used to create the training set used to
train the second layer (using predictions made by the predictors of the first layer),
and the third one is used to create the training set to train the third layer (using predictions
made by the predictors of the second layer). Once this is done, we can make
a prediction for a new instance by going through each layer sequentially.

<img src="images/stacking_4.jpg" width='400' />
