# Hyper-parameter tuning

First, let's fetch the "titanic" dataset directly from OpenML.

In [None]:
import pandas as pd

In this dataset, the missing values are stored with the following character `"?"`. We will notify it to Pandas when reading the CSV file.

In [None]:
df = pd.read_csv(
    "https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv",
    na_values='?'
)
df.head()

The classification task is to predict whether or not a person will survive the Titanic disaster.

In [None]:
X_df = df.drop(columns='survived')
y = df['survived']

We will split the data into a training and a testing set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, random_state=42, stratify=y
)

## The typical machine-learning pipeline

The titanic dataset is composed of mixed data types (i.e. numerical and categorical data). Therefore, we need to define a preprocessing pipeline for each data type and use a `ColumnTransformer` to process each type separetely.

First, let's define the different column depending of their data types.

In [None]:
num_cols = ['age', 'fare']
cat_col = ['sex', 'embarked', 'pclass']

Then, define the two preprocessing pipelines.

In [None]:
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder

cat_pipe = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
)
num_pipe = SimpleImputer(strategy='mean')

Combine both preprocessing using a `ColumnTransformer`.

In [None]:
from sklearn.compose import ColumnTransformer
preprocessing = ColumnTransformer(
    [('cat_preprocessor', cat_pipe, cat_col),
     ('num_preprocessor', num_pipe, num_cols)]
)

Finally, let's create a pipeline made of the preprocessor and a random forest classifier.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

model = Pipeline([
    ('preprocessing', preprocessing),
    ('clf', RandomForestClassifier(n_jobs=-1, random_state=42))
])

# Influence of parameters tuning

Machine-learning algorithms rely on parameters which will affect the performance of the final model. Scikit-learn provides default values for these parameters. However, using these default parameters does not necessarily lead to the a model with the best performance.

Let's set some parameters which will may change the performance of the classifier.

In [None]:
model.get_params()

In [None]:
model.set_params(clf__n_estimators=2, clf__max_depth=2)
_ = model.fit(X_train, y_train)
print(f'Accuracy score on the training data: '
      f'{model.score(X_train, y_train):.3f}')
print(f'Accuracy score on the testing data: '
      f'{model.score(X_test, y_test):.3f}')

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
    <li>By analyzing the training and testing scores, what can you say about the model? Is it under- or over-fitting?</li>
    </ul>
</div>

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
    <li>What if we don't limit the depth of the trees in the forest?</li>
    </ul>
</div>

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
    <li>And for the case where the forest is composed of a large number of deep trees and each tree has no depth limit?</li>
    </ul>
</div>

# Use a grid-search instead

The previous is really tedious and we are not sure to cover all possible cases. Instead, we could make an automatic search to discover all possible combination of hyper-parameters and check what would be the performance of the model. One tool for search exhaustive search is called `GridSearchCV`.

With grid-search, we need to specify the set of values we wish to test. The `GridSearchCV` will create a grid with all the possible combinations.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'clf__n_estimators': [5, 50, 100],
    'clf__max_depth': [3, 5, 8, None]
}
grid = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, cv=5)

The obtain estimator is used as a normal estimator using `fit`.

In [None]:
grid.fit(X_train, y_train)

We can check the results of all combination by looking at the `cv_results_` attributes.

In [None]:
df_results = pd.DataFrame(grid.cv_results_)
columns_to_keep = [
    'param_clf__max_depth',
    'param_clf__n_estimators',
    'mean_test_score',
    'std_test_score',
]
df_results = df_results[columns_to_keep]
df_results.sort_values(by='mean_test_score', ascending=False)

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
    <li>What might be a limitation of using a grid-search with several parmaters and several values for each parameter?</li>
    </ul>
</div>

An alternative to the `RandomizedSearchCV`. In this case, the parameters values will be drawn from some predefined distribution. Then, we will make some successive drawing anch check the performance.

In [None]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'clf__n_estimators': randint(1, 100),
    'clf__max_depth': randint(2, 15),
    'clf__max_features': [1, 2, 3, 4, 5],
    'clf__min_samples_split': [2, 3, 4, 5, 10, 30],
}
search = RandomizedSearchCV(
    model, param_distributions=param_distributions,
    n_iter=20, n_jobs=-1, cv=5, random_state=42
)

In [None]:
_ = search.fit(X_train, y_train)

In [None]:
df_results = pd.DataFrame(search.cv_results_)
columns_to_keep = [
    "param_" + param_name for param_name in param_distributions]
columns_to_keep += [
    'mean_test_score',
    'std_test_score',
]
df_results = df_results[columns_to_keep]
df_results = df_results.sort_values(by="mean_test_score", ascending=False)
df_results.head(5)

In [None]:
df_results.tail(5)

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <p>Build a machine-learning pipeline using a <tt>HistGradientBoostingClassifier</tt> and fine tune your model on the Titanic dataset using a <tt>RandomizedSearchCV</tt>.</p>
    <p>You may want to set the parameter distributions is the following manner:</p>
    <ul>
    <li><tt>learning_rate</tt> with values ranging from 0.001 to 0.5 following a reciprocal distribution.</li>
    <li><tt>l2_regularization</tt> with values ranging from 0.0 to 0.5 following a uniform distribution.</li>
    <li><tt>max_leaf_nodes</tt> with integer values ranging from 5 to 30 following a uniform distribution.</li>
    <li><tt>min_samples_leaf</tt> with integer values ranging from 5 to 30 following a uniform distribution.</li>
    </ul>
</div>

In [None]:
# TODO