# Introduction to scikit-learn: basic model hyper-parameters tuning

The process to learn a predictive model is driven by a set of internal
parameters and a set of training data. These internal parameters are called
hyper-parameters and are specific for each family of models. In addition,
a set of parameters are optimal for a specific dataset and thus they need
to be optimized.

This notebook shows:
* the influence of changing model parameters;
* how to tune these hyper-parameters;
* how to evaluate the model performance together with hyper-parameters
  tuning.

In [1]:
import pandas as pd

df = pd.read_csv(
    "https://www.openml.org/data/get_csv/1595261/adult-census.csv")
# Or use the local copy:
# df = pd.read_csv('../datasets/adult-census.csv')

In [2]:
target_name = "class"
target = df[target_name].to_numpy()
target

array([' <=50K', ' <=50K', ' >50K', ..., ' <=50K', ' <=50K', ' >50K'],
      dtype=object)

In [3]:
data = df.drop(columns=[target_name, "fnlwgt"])
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States


Once the dataset is loaded, we split it into a training and testing sets.

In [4]:
from sklearn.model_selection import train_test_split

df_train, df_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

Then, we define the preprocessing pipeline to transform differently
the numerical and categorical data.

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

binary_encoding_columns = ['sex']
one_hot_encoding_columns = [
    'workclass', 'education', 'marital-status', 'occupation',
    'relationship', 'race', 'native-country']
scaling_columns = [
    'age', 'capital-gain', 'capital-loss', 'hours-per-week',
    'education-num']

preprocessor = ColumnTransformer([
    ('binary-encoder', OrdinalEncoder(), binary_encoding_columns),
    ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
     one_hot_encoding_columns),
    ('standard-scaler', StandardScaler(), scaling_columns)])

Finally, we use a linear classifier (i.e. logistic regression) to predict
whether or not a person earn more than 50,000 dollars a year.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    preprocessor, LogisticRegression(max_iter=1000, solver='lbfgs'))
model.fit(df_train, target_train)
print(f"The accuracy score using a {model.__class__.__name__} is "
      f"{model.score(df_test, target_test):.2f}")

The accuracy score using a Pipeline is 0.86


## The issue of finding the best model parameters

In the previous example, we created a `LogisticRegression` classifier using
the default parameters by omitting setting explicitly these parameters.

For this classifier, the parameter `C` governes the penalty; in other
words, how much our model should "trust" (or fit) the training data.

Therefore, the default value of `C` is never certified to give the best
performing model.

We can make a quick experiment by changing the value of `C` and see the
impact of this parameter on the model performance.

In [7]:
C = 1
model = make_pipeline(
    preprocessor,
    LogisticRegression(C=C, max_iter=1000, solver='lbfgs'))
model.fit(df_train, target_train)
print(f"The accuracy score using a {model.__class__.__name__} is "
      f"{model.score(df_test, target_test):.2f} with C={C}")

The accuracy score using a Pipeline is 0.86 with C=1


In [8]:
C = 1e-5
model = make_pipeline(
    preprocessor,
    LogisticRegression(C=C, max_iter=1000, solver='lbfgs'))
model.fit(df_train, target_train)
print(f"The accuracy score using a {model.__class__.__name__} is "
      f"{model.score(df_test, target_test):.2f} with C={C}")

The accuracy score using a Pipeline is 0.77 with C=1e-05


## Finding the best model hyper-parameters via exhaustive parameters search

We see that the parameter `C` as a significative impact on the model
performance. This parameter should be tuned to get the best cross-validation
score, so as to avoid over-fitting problems.

In short, we will set the parameter, train our model on some data, and
evaluate the model performance on some left out data. Ideally, we will select
the parameter leading to the optimal performance on the testing set.
Scikit-learn provides a `GridSearchCV` estimator which will handle the
cross-validation and hyper-parameter search for us.

In [9]:
from sklearn.model_selection import GridSearchCV

model = make_pipeline(
    preprocessor, LogisticRegression(max_iter=1000, solver='lbfgs'))

We will see that we need to provide the name of the parameter to be set.
Thus, we can use the method `get_params()` to have the list of the parameters
of the model which can set during the grid-search.

In [10]:
print(
    "The hyper-parameters are for a logistic regression model are:")
for param_name in LogisticRegression().get_params().keys():
    print(param_name)

The hyper-parameters are for a logistic regression model are:
C
class_weight
dual
fit_intercept
intercept_scaling
l1_ratio
max_iter
multi_class
n_jobs
penalty
random_state
solver
tol
verbose
warm_start


In [11]:
print("The hyper-parameters are for the full-pipeline are:")
for param_name in model.get_params().keys():
    print(param_name)

The hyper-parameters are for the full-pipeline are:
memory
steps
verbose
columntransformer
logisticregression
columntransformer__n_jobs
columntransformer__remainder
columntransformer__sparse_threshold
columntransformer__transformer_weights
columntransformer__transformers
columntransformer__verbose
columntransformer__binary-encoder
columntransformer__one-hot-encoder
columntransformer__standard-scaler
columntransformer__binary-encoder__categories
columntransformer__binary-encoder__dtype
columntransformer__one-hot-encoder__categorical_features
columntransformer__one-hot-encoder__categories
columntransformer__one-hot-encoder__drop
columntransformer__one-hot-encoder__dtype
columntransformer__one-hot-encoder__handle_unknown
columntransformer__one-hot-encoder__n_values
columntransformer__one-hot-encoder__sparse
columntransformer__standard-scaler__copy
columntransformer__standard-scaler__with_mean
columntransformer__standard-scaler__with_std
logisticregression__C
logisticregression__class_weig

The parameter `'logisticregression__C'` is the parameter for which we would
like different values. Let see how to use the `GridSearchCV` estimator for
doing such search.

In [12]:
import time
import numpy as np

param_grid = {'logisticregression__C': (0.1, 1.0, 10.0)}
model_grid_search = GridSearchCV(model, param_grid=param_grid,
                                 n_jobs=4, cv=5)
start = time.time()
model_grid_search.fit(df_train, target_train)
elapsed_time = time.time() - start
print(
    f"The accuracy score using a {model_grid_search.__class__.__name__} is "
    f"{model_grid_search.score(df_test, target_test):.2f} in "
    f"{elapsed_time:.3f} seconds")

The accuracy score using a GridSearchCV is 0.86 in 12.139 seconds


The `GridSearchCV` estimator takes a `param_grid` parameter which defines
all possible parameters combination. Once the grid-search fitted, it can be
used as any other predictor by calling `predict` and `predict_proba`.
Internally, it will use the model with the best parameters found during
`fit`. You can know about these parameters by looking at the `best_params_`
attribute.

In [13]:
print(
    f"The best set of parameters is: {model_grid_search.best_params_}"
)

The best set of parameters is: {'logisticregression__C': 1.0}


With the `GridSearchCV` estimator, the parameters need to be specified
explicitely. Instead, one could randomly generate (following a specific
distribution) the parameter candidates. The `RandomSearchCV` allows for such
stochastic search. It is used similarly to the `GridSearchCV` but the
sampling distributions need to be specified instead of the parameter values.

In [14]:
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'logisticregression__C': uniform(loc=50, scale=100)}
model_grid_search = RandomizedSearchCV(
    model, param_distributions=param_distributions, n_iter=3,
    n_jobs=4, cv=5)
model_grid_search.fit(df_train, target_train)
print(
    f"The accuracy score using a {model_grid_search.__class__.__name__} is "
    f"{model_grid_search.score(df_test, target_test):.2f}")
print(
    f"The best set of parameters is: {model_grid_search.best_params_}"
)

The accuracy score using a RandomizedSearchCV is 0.86
The best set of parameters is: {'logisticregression__C': 106.63992846066772}


## Notes on search efficiency

Be aware that sometimes, scikit-learn provides some `EstimatorCV` classes
which will perform internally the cross-validation in such way that it will
more computationally efficient. We can give the example of the
`LogisticRegressionCV` which can be used to find the best `C` in a more
efficient way than what we previously did with the `GridSearchCV`.

In [15]:
from sklearn.linear_model import LogisticRegressionCV

# define the different Cs to try out
param_grid = {"C": (0.1, 1.0, 10.0)}

model = make_pipeline(
    preprocessor,
    LogisticRegressionCV(Cs=param_grid['C'], max_iter=1000,
                         solver='lbfgs', n_jobs=4, cv=5))
start = time.time()
model.fit(df_train, target_train)
elapsed_time = time.time() - start
print(f"Time elapsed to train LogisticRegressionCV: "
      f"{elapsed_time:.3f} seconds")

Time elapsed to train LogisticRegressionCV: 5.696 seconds


The `fit` time for the `CV` version of `LogisticRegression` give a speed-up
x2. This speed-up is provided by re-using the values of coefficients to
warm-start the estimator for the different `C` values.

## Exercises:

- Build a machine learning pipeline:
      * preprocess the categorical columns using an `OrdinalEncoder` and let
        the numerical columns as they are.
      * use an `HistGradientBoostingClassifier` as a predictive model.
- Make an hyper-parameters search using `RandomizedSearchCV` and tuning the
  parameters:
      * `learning_rate` with values ranging from 0.001 to 0.5. You can use
        an exponential distribution to sample the possible values.
      * `l2_regularization` with values ranging from 0 to 0.5. You can use
        a uniform distribution.
      * `max_leaf_nodes` with values ranging from 5 to 30. The values should
        be integer following a uniform distribution.
      * `min_samples_leaf` with values ranging from 5 to 30. The values
        should be integer following a uniform distribution.

In case you have issues of with unknown categories, try to precompute the
list of possible categories ahead of time and pass it explicitly to the
constructor of the encoder:

```python
categories = [data[column].unique()
              for column in data[categorical_columns]]
OrdinalEncoder(categories=categories)
```

## Combining evaluation and hyper-parameters search

Cross-validation was used for searching the best model parameters. We
previously evaluate model performance through cross-validation as well. If we
would like to combine both aspects, we need to perform a "nested"
cross-validation. The "outer" cross-validation is applied to assess the
model while the "inner" cross-validation set the hyper-parameters of the
model on the data set provided by the "outer" cross-validation. In practice,
it is equivalent of including, `GridSearchCV`, `RandomSearchCV`, or any
`EstimatorCV` in a `cross_val_score` or `cross_validate` function call.

In [16]:
from sklearn.model_selection import cross_val_score

model = make_pipeline(
    preprocessor,
    LogisticRegressionCV(max_iter=1000, solver='lbfgs', cv=5))
score = cross_val_score(model, data, target, n_jobs=4, cv=5)
print(
    f"The accuracy score is: {score.mean():.2f} +- {score.std():.2f}"
)
print(f"The different scores obtained are: \n{score}")

The accuracy score is: 0.85 +- 0.00
The different scores obtained are: 
[0.8512642  0.84962637 0.84797297 0.85298935 0.85503686]


Be aware that such training might involve a variation of the hyper-parameters
of the model. When analyzing such model, you should not only look at the
overall model performance but look at the hyper-parameters variations as
well.