# 📝 Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model generalization performance.

Here we use a small subset of the Adult Census dataset to make the code faster
to execute. Once your code works on the small subset, try to change
`train_size` to a larger value (e.g. 0.8 for 80% instead of 20%).

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42
)

In [2]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
    [
        (
            "cat_preprocessor",
            categorical_preprocessor,
            selector(dtype_include=object),
        )
    ],
    remainder="passthrough",
)

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", HistGradientBoostingClassifier(random_state=42)),
    ]
)

Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you have to train and test the
model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score` on the training set. Use the following
parameters search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the depth
  of each tree.

In [4]:
model.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'classifier', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__verbose_feature_names_out', 'preprocessor__cat_preprocessor', 'preprocessor__cat_preprocessor__categories', 'preprocessor__cat_preprocessor__dtype', 'preprocessor__cat_preprocessor__encoded_missing_value', 'preprocessor__cat_preprocessor__handle_unknown', 'preprocessor__cat_preprocessor__max_categories', 'preprocessor__cat_preprocessor__min_frequency', 'preprocessor__cat_preprocessor__unknown_value', 'classifier__categorical_features', 'classifier__class_weight', 'classifier__early_stopping', 'classifier__interaction_cst', 'classifier__l2_regularization', 'classifier__learning_rate', 'classifier__loss', 'classifier__max_bins', 'classifier__max_depth', 'classifier__max_iter', 'classifier__max_leaf_nodes', 'classifier__min_sample

In [14]:
# Write your code here.
from sklearn.model_selection import cross_val_score

for learning_rate in [0.01, 0.1, 1, 10]:
    for max_leaf_nodes in [3, 10, 30]:
        model.set_params(
            classifier__learning_rate=learning_rate,
            classifier__max_leaf_nodes=max_leaf_nodes
        )
        
        score = cross_val_score(model, data_train, target_train, cv=10)
        print(f"HyperParameters:\n"
              f"learning_rate = {learning_rate}\n"
              f"max_leaf_nodes = {max_leaf_nodes}\n"
              f"\n"
              f"Score: {score.mean():.03f} +- {score.std():.03f}")

HyperParameters:
learning_rate = 0.01
max_leaf_nodes = 3

Score: 0.7892103419635216:.03f +- 0.00478067693913335:.03f
HyperParameters:
learning_rate = 0.01
max_leaf_nodes = 10

Score: 0.8132675512190211:.03f +- 0.004429520795858263:.03f
HyperParameters:
learning_rate = 0.01
max_leaf_nodes = 30

Score: 0.8418313841300737:.03f +- 0.00605394275689578:.03f
HyperParameters:
learning_rate = 0.1
max_leaf_nodes = 3

Score: 0.8490997869020254:.03f +- 0.009405972169472813:.03f
HyperParameters:
learning_rate = 0.1
max_leaf_nodes = 10

Score: 0.8609740213433563:.03f +- 0.007829696250144406:.03f
HyperParameters:
learning_rate = 0.1
max_leaf_nodes = 30

Score: 0.8588240599359029:.03f +- 0.007314769536569154:.03f
HyperParameters:
learning_rate = 1
max_leaf_nodes = 3

Score: 0.8556509765592228:.03f +- 0.010216587672922805:.03f
HyperParameters:
learning_rate = 1
max_leaf_nodes = 10

Score: 0.8505311718710674:.03f +- 0.011258936432316798:.03f
HyperParameters:
learning_rate = 1
max_leaf_nodes = 30

Score:

Now use the test set to score the model using the best parameters that we
found using cross-validation. You will have to refit the model over the full
training set.

In [15]:
# Write your code here.
model.set_params(
    classifier__learning_rate = 0.1, 
    classifier__max_leaf_nodes = 30
)

model.fit(data, target)

score = cross_val_score(model, data, target, cv=10)

print(f"HyperParameters:\n"
              f"learning_rate = {model.get_params()['classifier__learning_rate']}\n"
              f"max_leaf_nodes = {model.get_params()['classifier__max_leaf_nodes']}\n"
              f"\n"
              f"Score: {score.mean():.03f} +- {score.std():.03f}")

HyperParameters:
learning_rate = 0.1
max_leaf_nodes = 30

Score: 0.874 +- 0.004
