# 📃 Solution for Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and
a testing set.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5)

In [2]:
data_train.shape

(10320, 8)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor`
to its parameter `base_estimator`. Train the regressor and evaluate its
generalization performance on the testing set using the mean absolute error.

In [3]:
# solution
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

tree = DecisionTreeRegressor()
bagging = BaggingRegressor(base_estimator=tree, n_jobs=2)
bagging.fit(data_train, target_train)
target_predicted = bagging.predict(data_test)
print(f"Basic mean absolute error of the bagging regressor:\n"
      f"{mean_absolute_error(target_test, target_predicted):.2f} k$")



Basic mean absolute error of the bagging regressor:
37.15 k$


Now, create a `RandomizedSearchCV` instance using the previous model and
tune the important parameters of the bagging regressor. Find the best
parameters  and check if you are able to find a set of parameters that
improve the default regressor still using the mean absolute error as a
metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt>
method.</p>
</div>

In [4]:
# solution
for param in bagging.get_params().keys():
    print(param)

base_estimator__ccp_alpha
base_estimator__criterion
base_estimator__max_depth
base_estimator__max_features
base_estimator__max_leaf_nodes
base_estimator__min_impurity_decrease
base_estimator__min_samples_leaf
base_estimator__min_samples_split
base_estimator__min_weight_fraction_leaf
base_estimator__random_state
base_estimator__splitter
base_estimator
bootstrap
bootstrap_features
estimator
max_features
max_samples
n_estimators
n_jobs
oob_score
random_state
verbose
warm_start


In [5]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "n_estimators": randint(10, 30),
    "max_samples": [0.5, 0.8, 1.0],
    "max_features": [0.5, 0.8, 1.0],
    "base_estimator__max_depth": randint(3, 10),
}
search = RandomizedSearchCV(
    bagging, param_grid, n_iter=20, scoring="neg_mean_absolute_error"
)
_ = search.fit(data_train, target_train)



In [6]:
RandomizedSearchCV?

[1;31mInit signature:[0m
[0mRandomizedSearchCV[0m[1;33m([0m[1;33m
[0m    [0mestimator[0m[1;33m,[0m[1;33m
[0m    [0mparam_distributions[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mn_iter[0m[1;33m=[0m[1;36m10[0m[1;33m,[0m[1;33m
[0m    [0mscoring[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrefit[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mcv[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mpre_dispatch[0m[1;33m=[0m[1;34m'2*n_jobs'[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0merror_score[0m[1;33m=[0m[0mnan[0m[1;33m,[0m[1;33m
[0m    [0mreturn_train_score[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Ran

In [15]:
DecisionTreeRegressor?

[1;31mInit signature:[0m
[0mDecisionTreeRegressor[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mcriterion[0m[1;33m=[0m[1;34m'squared_error'[0m[1;33m,[0m[1;33m
[0m    [0msplitter[0m[1;33m=[0m[1;34m'best'[0m[1;33m,[0m[1;33m
[0m    [0mmax_depth[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmin_samples_split[0m[1;33m=[0m[1;36m2[0m[1;33m,[0m[1;33m
[0m    [0mmin_samples_leaf[0m[1;33m=[0m[1;36m1[0m[1;33m,[0m[1;33m
[0m    [0mmin_weight_fraction_leaf[0m[1;33m=[0m[1;36m0.0[0m[1;33m,[0m[1;33m
[0m    [0mmax_features[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmax_leaf_nodes[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmin_impurity_decrease[0m[1;33m=[0m[1;36m0.0[0m[1;33m,[0m[1;33m
[0m    [0mccp_alpha[0m[1;33m=[0m[1;36m0.0[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[

In [16]:
BaggingRegressor?

[1;31mInit signature:[0m
[0mBaggingRegressor[0m[1;33m([0m[1;33m
[0m    [0mestimator[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mn_estimators[0m[1;33m=[0m[1;36m10[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mmax_samples[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [0mmax_features[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [0mbootstrap[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mbootstrap_features[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0moob_score[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mwarm_start[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mbase_estimator[0m[1;33m=[0m[1;34m'deprecated'[0m[1;33m,[0m[1;33m


In [7]:
import pandas as pd

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")

Unnamed: 0,param_n_estimators,param_max_samples,param_max_features,param_base_estimator__max_depth,mean_test_error,std_test_error
10,28,1.0,1.0,9,39.369892,1.091804
17,21,1.0,0.8,8,40.335195,1.117009
14,14,1.0,0.8,8,40.638359,0.802786
13,26,0.5,0.8,7,42.179621,0.942178
11,29,1.0,0.8,7,42.642421,0.997369
6,19,0.8,0.8,6,45.117379,1.065416
0,22,0.5,1.0,6,45.273058,1.209723
4,26,1.0,0.5,8,45.39694,2.051425
3,16,1.0,0.8,6,45.442448,1.120276
2,15,0.5,0.8,6,45.537053,0.684391


In [8]:
target_predicted = search.predict(data_test)
print(f"Mean absolute error after tuning of the bagging regressor:\n"
      f"{mean_absolute_error(target_test, target_predicted):.2f} k$")

Mean absolute error after tuning of the bagging regressor:
39.03 k$


In [9]:
from scipy.stats import randint
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [10,30,1],
    "max_samples": [0.5, 0.8, 1.0],
    "max_features": [0.5, 0.8, 1.0],
    "base_estimator__max_depth": [3, 10,1],
}
search = GridSearchCV(
    bagging, param_grid, scoring="neg_mean_absolute_error"
)
_ = search.fit(data_train, target_train)



In [10]:
accuracy = search.score(data_test, target_test)
print(
    f"The test accuracy score of the grid-searched pipeline is: "
    f"{accuracy:.3f}"
)

The test accuracy score of the grid-searched pipeline is: -39.693


In [11]:
print(f"The best set of parameters is: "
      f"{search.best_params_}")

The best set of parameters is: {'base_estimator__max_depth': 10, 'max_features': 0.8, 'max_samples': 0.8, 'n_estimators': 30}


In [12]:
search.cv_results_

{'mean_fit_time': array([0.05200429, 0.14461513, 0.01040087, 0.05998869, 0.1721983 ,
        0.01259518, 0.06698666, 0.15903363, 0.01260347, 0.0525918 ,
        0.14400496, 0.01019287, 0.06559944, 0.19540348, 0.01519852,
        0.07739244, 0.22521229, 0.01859875, 0.06681666, 0.20141802,
        0.0143981 , 0.0870101 , 0.24601107, 0.01659184, 0.09199924,
        0.26973267, 0.02159767, 0.09699507, 0.30459433, 0.01859736,
        0.13361306, 0.39399381, 0.02619781, 0.18032808, 0.52237039,
        0.03479047, 0.14400949, 0.41518521, 0.02740817, 0.18940611,
        0.68914638, 0.04479408, 0.23520036, 0.6502183 , 0.05100007,
        0.22137742, 0.66398931, 0.03819909, 0.25938449, 0.797861  ,
        0.05360622, 0.29959269, 0.90122375, 0.06700692, 0.03320885,
        0.0715889 , 0.00639057, 0.0313993 , 0.08659596, 0.00819831,
        0.0355998 , 0.08759127, 0.00659881, 0.02800255, 0.07361355,
        0.00640616, 0.03239336, 0.10280252, 0.00759697, 0.03601141,
        0.10159655, 0.00940261,

In [13]:
import pandas as pd

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")

Unnamed: 0,param_n_estimators,param_max_samples,param_max_features,param_base_estimator__max_depth,mean_test_error,std_test_error
40,30,0.8,0.8,10,37.863721,0.952264
43,30,1.0,0.8,10,37.924230,1.526065
52,30,1.0,1.0,10,38.046621,1.159262
49,30,0.8,1.0,10,38.152831,0.969548
37,30,0.5,0.8,10,38.331491,1.365070
...,...,...,...,...,...,...
59,1,0.8,0.5,1,76.198014,3.414329
65,1,0.5,0.8,1,76.623332,4.183858
68,1,0.8,0.8,1,76.719683,3.965568
56,1,0.5,0.5,1,81.535320,5.878345


In [14]:
target_predicted = search.predict(data_test)
print(f"Mean absolute error after tuning of the bagging regressor:\n"
      f"{mean_absolute_error(target_test, target_predicted):.2f} k$")

Mean absolute error after tuning of the bagging regressor:
39.69 k$


We see that the predictor provided by the bagging regressor does not need
much hyperparameter tuning compared to a single decision tree.