# Bagging-based estimator

In [None]:
# temporary fix to avoid spurious warning raised in scikit-learn 1.0.0
# it will be solved in scikit-learn 1.0.1
import warnings
warnings.filterwarnings("ignore", message="X has feature names.*")
warnings.filterwarnings("ignore", message="X does not have valid feature names.*")

## Bagging estimator

We saw that by increasing the depth of the tree, we are going to get an over-fitted model. A way to bypass the choice of a specific depth it to combine several trees together.

Let's start by training several trees on slightly different data. The slightly different dataset could be generated by randomly sampling with replacement. In statistics, this called a boostrap sample. We will use the iris dataset to create such ensemble and ensure that we have some data for training and some left out data for testing.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X, y = X[:100], y[:100]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

Before to train several decision trees, we will run a single tree. However, instead to train this tree on `X_train`, we want to train it on a bootstrap sample. You can use the `np.random.choice` function sample with replacement some index. You will need to create a sample_weight vector and pass it to the `fit` method of the `DecisionTreeClassifier`. We provide the `generate_sample_weight` function which will generate the `sample_weight` array.

In [None]:
import numpy as np

def bootstrap_idx(X):
    indices = np.random.choice(
        np.arange(X.shape[0]), size=X.shape[0], replace=True
    )
    return indices

In [None]:
bootstrap_idx(X_train)

In [None]:
from collections import Counter
Counter(bootstrap_idx(X_train))

In [None]:
def bootstrap_sample(X, y):
    indices = bootstrap_idx(X)
    return X[indices], y[indices]

In [None]:
X_train_bootstrap, y_train_bootstrap = bootstrap_sample(X_train, y_train)

In [None]:
print(f'Classes distribution in the original data: {Counter(y_train)}')
print(f'Classes distribution in the bootstrap: {Counter(y_train_bootstrap)}')

<div class="alert alert-success">
    <b>EXERCISE: Create a bagging classifier</b>:
    <br>
    A bagging classifier will train several decision tree classifiers, each of them on a different bootstrap sample.
     <ul>
      <li>
          Create several <tt>DecisionTreeClassifier</tt> and store them in a Python list;
      </li>
      <li>
          Loop over these trees and <tt>fit</tt> them by generating a bootstrap sample using <tt>bootstrap_sample</tt> function;
      </li>
      <li>
          To predict with this ensemble of trees on new data (testing set), you can provide the same set to each tree and call the <tt>predict</tt> method. Aggregate all predictions in a NumPy array;
      </li>
      <li>
          Once the predictions available, you need to provide a single prediction: you can retain the class which was the most predicted which is called a majority vote;
      </li>
      <li>
          Finally, check the accuracy of your model.
      </li>
    </ul>
</div>

In [None]:
# %load solutions/solution_06.py

In [None]:
# %load solutions/solution_07.py

In [None]:
# %load solutions/solution_08.py

In [None]:
# %load solutions/solution_09.py

In [None]:
# %load solutions/solution_10.py

<div class="alert alert-success">
    <b>EXERCISE: using scikit-learn</b>:
    <br>
    After implementing your own bagging classifier, use a <tt>BaggingClassifier</tt> from scikit-learn to fit the above data.
</div>

In [None]:
# %load solutions/solution_11.py

### Note regarding the base estimator

In the previous section, we used decision tree as a base estimator in the bagging ensemble. However, this method can accept any kind of base estimator. We will compare two bagging models: one that uses decision tree and another that uses a linear model with a preprocessing step.

Let's first create a synthetic regression dataset.

In [None]:
import pandas as pd

#create a random number generator that will be used to set the randomness
rng = np.random.RandomState(1)

n_samples = 30
x_min, x_max = -3, 3
x = rng.uniform(x_min, x_max, size=n_samples)
noise = 4.0 * rng.randn(n_samples)
y = x ** 3 - 0.5 * (x + 1) ** 2 + noise
y /= y.std()

data_train = pd.DataFrame(x, columns=["Feature"])
data_test = pd.DataFrame(
    np.linspace(x_max, x_min, num=300), columns=["Feature"])
target_train = pd.Series(y, name="Target")


In [None]:
import seaborn as sns
sns.set_context("poster")

In [None]:
ax = sns.scatterplot(
    x=data_train["Feature"], y=target_train, color="black",
    alpha=0.5
)
_ = ax.set_title("Synthetic regression dataset")

We will first train a `BaggingRegressor` where the based estimator are `DecisionTreeRegressor`.

In [None]:
from sklearn.ensemble import BaggingRegressor

bagged_trees = BaggingRegressor(n_estimators=50, random_state=0)
bagged_trees.fit(data_train, target_train)

We can make a plot where will show the prediction given by each individual trees and the averaged response given by the baggin regressor.

In [None]:
import matplotlib.pyplot as plt

for tree_idx, tree in enumerate(bagged_trees.estimators_):
    label = "Predictions of individual trees" if tree_idx == 0 else None
    tree_predictions = tree.predict(data_test)
    plt.plot(data_test, tree_predictions, linestyle="--", alpha=0.1,
             color="tab:blue", label=label)

sns.scatterplot(x=data_train["Feature"], y=target_train, color="black",
                alpha=0.5)

bagged_trees_predictions = bagged_trees.predict(data_test)
plt.plot(data_test, bagged_trees_predictions,
         color="tab:orange", label="Predictions of ensemble")
_ = plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))

Now, we will show that we can use a model other than a decision tree. Indeed, we will create a model that will use a `PolynomialFeatures` to augment features followed by a linear model that is `Ridge`.

In [None]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline

polynomial_regressor = make_pipeline(
    MinMaxScaler(),
    PolynomialFeatures(degree=4),
    Ridge(alpha=1e-10),
)

In [None]:
bagged_trees = BaggingRegressor(
    n_estimators=100, base_estimator=polynomial_regressor, random_state=0
)
bagged_trees.fit(data_train, target_train)

for tree_idx, tree in enumerate(bagged_trees.estimators_):
    label = "Predictions of individual trees" if tree_idx == 0 else None
    tree_predictions = tree.predict(data_test)
    plt.plot(data_test, tree_predictions, linestyle="--", alpha=0.1,
             color="tab:blue", label=label)

sns.scatterplot(x=data_train["Feature"], y=target_train, color="black",
                alpha=0.5)

bagged_trees_predictions = bagged_trees.predict(data_test)
plt.plot(data_test, bagged_trees_predictions,
         color="tab:orange", label="Predictions of ensemble")
_ = plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))

We can observe that both base estimators can be used to model our toy example.

## Random Forests

### Random forest classifier

A very famous classifier is the random forest classifier. It is similar to the bagging classifier. In addition of the bootstrap, the random forest will use a subset of features (selected randomly) to find the best split.

<div class="alert alert-success">
    <b>EXERCISE: Create a random forest classifier</b>:
    <br>
    Use your previous code which was generated several <tt>DecisionTreeClassifier</tt>. Check the list of the option of this classifier and modify one of the parameters such that only the $\sqrt{F}$ features are used for the splitting. $F$ represents the number of features in the dataset.
</div>

<div class="alert alert-success">
    <b>EXERCISE: using scikit-learn</b>:
    <br>
    After implementing your own random forest classifier, use a <tt>RandomForestClassifier</tt> from scikit-learn to fit the above data.
</div>

In [None]:
# %load solutions/solution_12.py

### Random forest regressor

<div class="alert alert-success">
    <b>EXERCISE</b>:
    <br>
    <ul>
        <li>Load the datasets available in <tt>sklearn.datasets.fetch_california_housing</tt>.</li>
        <li>Fit a <tt>RandomForestRegressor</tt> with the default parameters.</li>
        <li>What is the number of features used during the training process?</li>
        <li>What is the difference between an <tt>BaggingRegressor</tt> and a <tt>RandomForestRegressor</tt>?</li>
    </ul>
</div>

In [None]:
# %load solutions/solution_13.py

In [None]:
# %load solutions/solution_14.py

In [None]:
# %load solutions/solution_15.py

In [None]:
# %load solutions/solution_16.py

### Hyperparameters

The hyperparameters having an impact on the training process will mainly be the same than for the decision tree. One can look at the documentation. However, since we are dealing with a forest of trees, there is a new parameter `n_estmators`. We can quickly make an exercise to check the effect of modifying this parameter. For this matter, we will use a validation curve.

In [None]:
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

<div class="alert alert-success">
    <b>EXERCISE</b>:
    <br>
    <ul>
        <li>Use the <tt>sklearn.model_selection.validation_curve</tt> to compute the train and test scores and thus analyse the impact of the `n_estimators` parameter. You will have to define a range of values for this parameter.</li>
        <li>Plot the train and test scores as well as the confidence intervals.</li>
    </ul>
    What is the impact of increasing the number of trees in the ensemble in terms of statistical performance? Do you think that there is a trade-off with the computational performance?
</div>

In [None]:
# %load solutions/solution_17.py

In [None]:
# %load solutions/solution_18.py

In [None]:
# %load solutions/solution_19.py

The other parameters controlling the tree individual trees overfitting could also be tuned. Sometimes, there is no need to have fully grown trees. However, be aware that with random forest, trees are generally deep since we are seeking to overfit the learners on the bootstrap samples because this will be mitigated by combining them. Assembling underfitted trees (i.e. shallow trees) might also lead to an underfitted forest.