# 📝 Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and a
testing set.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5
)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor` to its
parameter `estimator`. Train the regressor and evaluate its generalization
performance on the testing set using the mean absolute error.

In [6]:
# Write your code here.
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

bag_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor()
)

bag_reg.fit(data_train, target_train)

pred_test = bag_reg.predict(data_test)

abs_error = mean_absolute_error(pred_test, target_test)
print(f"Mean absolute error on test dataset: {abs_error:.02f}")

Mean absolute error on test dataset: 36.81


Now, create a `RandomizedSearchCV` instance using the previous model and tune
the important parameters of the bagging regressor. Find the best parameters
and check if you are able to find a set of parameters that improve the default
regressor still using the mean absolute error as a metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt> method.</p>
</div>

In [15]:
# Write your code here.
from sklearn.model_selection import RandomizedSearchCV
import numpy as np 

param_distributions = {
    "estimator__max_depth": np.arange(5, 50, 1),
    "n_estimators": [10, 20, 30, 40, 50],
}

model_random_search = RandomizedSearchCV(
    bag_reg,
    param_distributions=param_distributions,
    n_iter=10,
    cv=5,
    verbose=1,
    scoring='neg_mean_absolute_error'
)
model_random_search.fit(data_train, target_train)
#bag_reg.get_params()

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [16]:
accuracy = -model_random_search.score(data_test, target_test)

print(f"The test accuracy score of the best model is {accuracy:.2f}")

The test accuracy score of the best model is 34.72


In [17]:
model_random_search.best_params_

{'n_estimators': 50, 'estimator__max_depth': 45}

In [18]:
bag_reg.get_params()

{'base_estimator': 'deprecated',
 'bootstrap': True,
 'bootstrap_features': False,
 'estimator__ccp_alpha': 0.0,
 'estimator__criterion': 'squared_error',
 'estimator__max_depth': None,
 'estimator__max_features': None,
 'estimator__max_leaf_nodes': None,
 'estimator__min_impurity_decrease': 0.0,
 'estimator__min_samples_leaf': 1,
 'estimator__min_samples_split': 2,
 'estimator__min_weight_fraction_leaf': 0.0,
 'estimator__random_state': None,
 'estimator__splitter': 'best',
 'estimator': DecisionTreeRegressor(),
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}