# 📝 Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and
a testing set.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5)

In [15]:
data_train.shape

(10320, 8)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor`
to its parameter `base_estimator`. Train the regressor and evaluate its
generalization performance on the testing set using the mean absolute error.

In [34]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_absolute_error

# Don't use a max depth on the regressor.
# tree = DecisionTreeRegressor(max_depth=3, random_state=0)

# Don't set the number of estimators for BaggingRegressor
# n_estimators=100
# but we can use it in the randomized search.

# Don't use squared error, use absolute.
# bagged_error = mean_squared_error(target_test, predictions)

bagged_trees = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(), n_jobs=2  
)
bagged_trees.fit(data_train, target_train)
predictions = bagged_trees.predict(data_test)

bagged_error = mean_absolute_error(target_test, predictions)
bagged_error

36.829958168604655

In [19]:
bagged_trees.best_params_

AttributeError: 'BaggingRegressor' object has no attribute 'best_params_'

In [47]:
from scipy.stats import randint
next(randint(10, 30))

TypeError: 'rv_frozen' object is not an iterator

Now, create a `RandomizedSearchCV` instance using the previous model and
tune the important parameters of the bagging regressor. Find the best
parameters  and check if you are able to find a set of parameters that
improve the default regressor still using the mean absolute error as a
metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt>
method.</p>
</div>

In [3]:
bagged_trees.get_params()

{'base_estimator__ccp_alpha': 0.0,
 'base_estimator__criterion': 'squared_error',
 'base_estimator__max_depth': 3,
 'base_estimator__max_features': None,
 'base_estimator__max_leaf_nodes': None,
 'base_estimator__min_impurity_decrease': 0.0,
 'base_estimator__min_samples_leaf': 1,
 'base_estimator__min_samples_split': 2,
 'base_estimator__min_weight_fraction_leaf': 0.0,
 'base_estimator__random_state': None,
 'base_estimator__splitter': 'best',
 'base_estimator': DecisionTreeRegressor(max_depth=3),
 'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [None]:
from scipy.stats import loguniform


class loguniform_int:
    """Integer valued version of the log-uniform distribution"""
    def __init__(self, a, b):
        self._distribution = loguniform(a, b)

    def rvs(self, *args, **kwargs):
        """Random variable sample"""
        return self._distribution.rvs(*args, **kwargs).astype(int)

In [8]:
import numpy as np
print(np.linspace(1, 11, num=5))

[ 1.   3.5  6.   8.5 11. ]


In [49]:
[i*0.1 for i in range(1,11)]

[0.1,
 0.2,
 0.30000000000000004,
 0.4,
 0.5,
 0.6000000000000001,
 0.7000000000000001,
 0.8,
 0.9,
 1.0]

In [52]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    # 'max_features': [i for i in range(1,9)],
    'max_features': [.5, .8, 1.0],
    'n_estimators': randint(10, 30),
    'max_samples': [.5, .8, 1.0],
    'base_estimator__max_depth': randint(3, 10)
    # 'max_samples': 0,
    # 'base_estimator__max_depth': [i for i in range(1,31)],
    # 'base_estimator__max_leaf_nodes': [i for i in range(1,31)],
    # 'base_estimator__min_samples_leaf': [i for i in range(1,31)]
    # 'base_estimator__min_samples_split': [i for i in range(1,31)]
    # 'base_estimator__min_impurity_decrease': [i for i in range(1,31)],
}

model_randomCV = RandomizedSearchCV(bagged_trees, param_distributions=param_dist, 
                                    # n_iter=10, cv=5, verbose=1
                                    n_iter=20, scoring="neg_mean_absolute_error")

We see that the predictor provided by the bagging regressor does not need
much hyperparameter tuning compared to a single decision tree.

In [54]:
model_randomCV.fit(data_train, target_train)
predictions2 = model_randomCV.predict(data_test)

bagged_randomCV_error = mean_absolute_error(target_test, predictions2)
bagged_randomCV_error

39.33134336333586