In [1]:
# -*- coding: utf-8 -*-

# 📝 Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and
a testing set.

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor`
to its parameter `base_estimator`. Train the regressor and evaluate its
generalization performance on the testing set using the mean absolute error.

In [3]:
# Write your code here.
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

model = BaggingRegressor(DecisionTreeRegressor())
model.fit(data_train, target_train)
target_pred = model.predict(data_test)

mae = mean_absolute_error(target_test, target_pred)

print(f'Basic mean absolute error of the prediction: {round(mae, 2)}')

Basic mean absolute error of the prediction: 36.52


Now, create a `RandomizedSearchCV` instance using the previous model and
tune the important parameters of the bagging regressor. Find the best
parameters  and check if you are able to find a set of parameters that
improve the default regressor still using the mean absolute error as a
metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt>
method.</p>
</div>

In [4]:
for param in model.get_params():
    print(param)

base_estimator__ccp_alpha
base_estimator__criterion
base_estimator__max_depth
base_estimator__max_features
base_estimator__max_leaf_nodes
base_estimator__min_impurity_decrease
base_estimator__min_samples_leaf
base_estimator__min_samples_split
base_estimator__min_weight_fraction_leaf
base_estimator__random_state
base_estimator__splitter
base_estimator
bootstrap
bootstrap_features
max_features
max_samples
n_estimators
n_jobs
oob_score
random_state
verbose
warm_start


In [5]:
# Write your code here.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_grid = {
    'n_estimators': randint(10, 30),
    'max_samples': [0.5, 0.8, 1],
    'max_features': [0.5, 0.8, 1],
    'base_estimator__max_depth': randint(3, 10)
} 

search = RandomizedSearchCV(model, param_grid,
                            n_iter=20, scoring='neg_mean_absolute_error')

search.fit(data_train, target_train)

RandomizedSearchCV(estimator=BaggingRegressor(base_estimator=DecisionTreeRegressor()),
                   n_iter=20,
                   param_distributions={'base_estimator__max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f7bc4fc2580>,
                                        'max_features': [0.5, 0.8, 1],
                                        'max_samples': [0.5, 0.8, 1],
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f7bc4fab970>},
                   scoring='neg_mean_absolute_error')

In [6]:
import pandas as pd

cols = [f'param_{name}' for name in param_grid.keys()]
cols += ['mean_test_error', 'std_test_error']

cv_results = pd.DataFrame(search.cv_results_)
cv_results['mean_test_error'] = -cv_results['mean_test_score']
cv_results['std_test_error'] = cv_results['std_test_score']
cv_results[cols].sort_values(by='mean_test_error')

Unnamed: 0,param_n_estimators,param_max_samples,param_max_features,param_base_estimator__max_depth,mean_test_error,std_test_error
9,24,0.5,0.8,9,39.332595,0.615869
16,11,0.5,0.5,9,47.858667,3.794445
12,19,0.5,0.5,5,52.72026,1.319104
5,15,0.5,0.5,3,60.729305,2.06836
18,11,0.5,1.0,6,74.078768,2.175631
13,17,0.8,1.0,4,75.13038,3.20823
15,25,0.5,1.0,6,76.188324,2.18075
4,20,0.5,1.0,4,76.794597,3.266011
2,13,0.5,1.0,6,77.529241,1.925885
1,28,0.5,1.0,3,78.075618,2.054251


In [7]:
target_predicted = search.predict(data_test)

print(f'Mean absolute error after tuning of the bagging regressor:\n'
     f'{mean_absolute_error(target_test, target_predicted):.2f}')

Mean absolute error after tuning of the bagging regressor:
37.95


We see that the predictor provided by the bagging regressor does not need
much hyperparameter tuning compared to a single decision tree.