# Hyperparameter Tuning

Scikit has many approaches to optimizing or tuning the hyperparameters of models. Let's take a look at how we can use `GridSearchCV` to search over a space of possible hyperparamter combinations.

## Create data

Let's create a dummy binary classification dataset.

In [1]:
import numpy as np
from sklearn.datasets import make_classification

np.random.seed(37)

X, y = make_classification(**{
    'n_samples': 2000,
    'n_features': 20,
    'n_informative': 2,
    'n_redundant': 2,
    'n_repeated': 0,
    'n_classes': 2,
    'n_clusters_per_class': 2,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (2000, 20), y shape (2000,)


## Tuning Logistic Regression

Let's try to tune a logistic regression model. The logistic regression model will be referred to as the `estimator`; it is this estimator's possible hyperparamters that we want to optimize. When tuning hyperparameters, we also need a way to split the data, and here, we will use `StratifiedKFold`. Another important input to the grid search is the `param_grid` argument, which is a dictionary specifying the search space of each hyperparameter. Here, our search space is simple, it is over the `regularization strength`. Lastly, we need an optimization criteria, and we specify that through the [scoring argument](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

p = {
    'solver': 'sag',
    'penalty': 'l2',
    'random_state': 37,
    'max_iter': 100
}
estimator = LogisticRegression(**p)

p = {
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
}
cv = StratifiedKFold(**p)

p = {
    'estimator': estimator,
    'cv': cv,
    'param_grid': {
        'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
}
model = GridSearchCV(**p)

model.fit(X, y)

Fitting 5 folds for each of 11 candidates, totalling 55 fits


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             estimator=LogisticRegression(random_state=37, solver='sag'),
             n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
                               0.9, 1.0]},
             refit='auc',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

The `best_params_` property gives the best combination of hyperparameters.

In [3]:
model.best_params_

{'C': 0.4}

The `best_score_` property gives the best score.

In [4]:
model.best_score_

0.9644498503712592

To retrieve the best estimator induced by the search and scoring criteria, access `best_estimator_`.

In [5]:
model.best_estimator_

LogisticRegression(C=0.4, random_state=37, solver='sag')

## Tuning Random Forest

Here, we tune a `RandomForestClassifier`.

In [6]:
from sklearn.ensemble import RandomForestClassifier

p = {
    'random_state': 37
}
estimator = RandomForestClassifier(**p)

p = {
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
}
cv = StratifiedKFold(**p)

p = {
    'estimator': estimator,
    'cv': cv,
    'param_grid': {
        'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'criterion': ['gini', 'entropy']
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
}
model = GridSearchCV(**p)

model.fit(X, y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             estimator=RandomForestClassifier(random_state=37), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90,
                                          100]},
             refit='auc',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

In [7]:
model.best_params_

{'criterion': 'entropy', 'n_estimators': 50}

In [8]:
model.best_score_

0.9763199132478311

In [9]:
model.best_estimator_

RandomForestClassifier(criterion='entropy', n_estimators=50, random_state=37)

## Tuning with a pipeline

Our estimator can also be a pipeline. For each processor in the pipeline, we can also specify the parameter grid.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
pca = PCA()
rf = RandomForestClassifier(**{
    'random_state': 37
})
pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

cv = StratifiedKFold(**{
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
})

model = GridSearchCV(**{
    'estimator': pipeline,
    'cv': cv,
    'param_grid': {
        'scaler__feature_range': [(0, 1), (0, 2)],
        'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
        'rf__n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'rf__criterion': ['gini', 'entropy']
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
})

model.fit(X, y)

Fitting 5 folds for each of 320 candidates, totalling 1600 fits


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                       ('pca', PCA()),
                                       ('rf',
                                        RandomForestClassifier(random_state=37))]),
             n_jobs=-1,
             param_grid={'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
                         'rf__criterion': ['gini', 'entropy'],
                         'rf__n_estimators': [10, 20, 30, 40, 50, 60, 70, 80,
                                              90, 100],
                         'scaler__feature_range': [(0, 1), (0, 2)]},
             refit='auc',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

In [11]:
model.best_params_

{'pca__n_components': 5,
 'rf__criterion': 'entropy',
 'rf__n_estimators': 100,
 'scaler__feature_range': (0, 2)}

In [12]:
model.best_score_

0.9718524009350235

In [13]:
model.best_estimator_

Pipeline(steps=[('scaler', MinMaxScaler(feature_range=(0, 2))),
                ('pca', PCA(n_components=5)),
                ('rf',
                 RandomForestClassifier(criterion='entropy', random_state=37))])

## Validation with tuning

In some cases, you might want to validate the hyperparameter tuning as a part of your learning process. In this example, we show an example of how to so. Here are some things to note in this example.

- The data generated will be multiclass.
- We will implement custom scorers. The average precision score does not natively handle the multi-class label, and we will have to transform the ground truth lables into a one-hot encoded vector.

Now let's generate some data.

In [14]:
X, y = make_classification(**{
    'n_samples': 1000,
    'n_features': 10,
    'n_clusters_per_class': 1,
    'n_classes': 3,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (1000, 10), y shape (1000,)


Below, we create a `model` that is a grid search based on random forest. Note how we use the `make_scorer()` method to create custom scorers. 

In [15]:
from sklearn.metrics import roc_auc_score, average_precision_score, make_scorer
from sklearn.preprocessing import OneHotEncoder

def apr_score(y_true, y_pred, average='micro'):
    encoder = OneHotEncoder()
    Y = encoder.fit_transform(y_true.reshape(-1, 1)).todense()
    
    return average_precision_score(Y, y_pred, average=average)

def get_model():
    scaler = MinMaxScaler()
    pca = PCA()
    rf = RandomForestClassifier(**{
        'random_state': 37
    })
    pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

    cv = StratifiedKFold(**{
        'n_splits': 5,
        'shuffle': True,
        'random_state': 37
    })
    
    auc_scorer = make_scorer(
        roc_auc_score, 
        greater_is_better=True, 
        needs_proba=True, 
        multi_class='ovo')
    apr_scorer_macro = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='macro')
    apr_scorer_micro = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='micro')
    apr_scorer_weighted = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='weighted')

    model = GridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': {
            'scaler__feature_range': [(0, 1), (0, 2)],
            'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
            'rf__n_estimators': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
            'rf__criterion': ['gini', 'entropy']
        },
        'scoring': {
            'auc': auc_scorer,
            'apr_scorer_macro': apr_scorer_macro,
            'apr_scorer_micro': apr_scorer_micro,
            'apr_scorer_weighted': apr_scorer_weighted
        },
        'verbose': 5,
        'refit': 'apr_scorer_micro',
        'error_score': np.NaN,
        'n_jobs': -1
    })
    return model

Now we can perform stratified, k-fold cross-validation while incorporating hyperparameter tuning as a part of the validation process.

In [16]:
import warnings
import pandas as pd

warnings.filterwarnings('ignore')

results = []

for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=10).split(X, y):
    X_tr, X_te = X[tr], X[te]
    y_tr, y_te = y[tr], y[te]
    
    model = get_model()
    model.fit(X_tr, y_tr)
    
    y_pred = model.predict_proba(X_te)
    
    auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
    auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
    apr_macro = apr_score(y_te, y_pred, average='macro')
    apr_micro = apr_score(y_te, y_pred, average='micro')
    apr_weighted = apr_score(y_te, y_pred, average='weighted')
    
    results.append({
        'auc_ovr': auc_ovr,
        'auc_ovo': auc_ovo,
        'apr_macro': apr_macro,
        'apr_micro': apr_micro,
        'apr_weighted': apr_weighted
    })
    
rdf = pd.DataFrame(results)

Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
Fitting 5 folds for each of 448 candidates, totalling 2240 fits


In [17]:
rdf.mean()

auc_ovr         0.998931
auc_ovo         0.998932
apr_macro       0.997529
apr_micro       0.997535
apr_weighted    0.997533
dtype: float64

## tune-sklearn

[tune-sklearn](https://github.com/ray-project/tune-sklearn) is a drop-in replacement for scikit-learn's hyperparameter tuning. This API promises to find hyperpameters in a shorter amount of time and smarter way. 

In [18]:
from tune_sklearn import TuneGridSearchCV

def get_model():
    scaler = MinMaxScaler()
    pca = PCA()
    rf = RandomForestClassifier(**{
        'random_state': 37
    })
    pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

    cv = StratifiedKFold(**{
        'n_splits': 5,
        'shuffle': True,
        'random_state': 37
    })
    
    auc_scorer = make_scorer(
        roc_auc_score, 
        greater_is_better=True, 
        needs_proba=True, 
        multi_class='ovo')
    apr_scorer_macro = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='macro')
    apr_scorer_micro = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='micro')
    apr_scorer_weighted = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='weighted')

    model = TuneGridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': {
            'scaler__feature_range': [(0, 1)],
            'pca__n_components': [2, 3, 4, 5],
            'rf__criterion': ['gini', 'entropy']
        },
        'scoring': {
            'auc': auc_scorer,
            'apr_scorer_macro': apr_scorer_macro,
            'apr_scorer_micro': apr_scorer_micro,
            'apr_scorer_weighted': apr_scorer_weighted
        },
        'verbose': 1,
        'refit': 'apr_scorer_micro',
        'error_score': np.NaN,
        'n_jobs': -1,
        'early_stopping': 'MedianStoppingRule',
        'max_iters': 10
    })
    return model

In [19]:
results = []

for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=5).split(X, y):
    X_tr, X_te = X[tr], X[te]
    y_tr, y_te = y[tr], y[te]
    
    model = get_model()
    model.fit(X_tr, y_tr)
    
    y_pred = model.predict_proba(X_te)
    
    auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
    auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
    apr_macro = apr_score(y_te, y_pred, average='macro')
    apr_micro = apr_score(y_te, y_pred, average='micro')
    apr_weighted = apr_score(y_te, y_pred, average='weighted')
    
    results.append({
        'auc_ovr': auc_ovr,
        'auc_ovo': auc_ovo,
        'apr_macro': apr_macro,
        'apr_micro': apr_micro,
        'apr_weighted': apr_weighted
    })
    
rdf = pd.DataFrame(results)

In [20]:
rdf.mean()

auc_ovr         0.998309
auc_ovo         0.998308
apr_macro       0.996224
apr_micro       0.996087
apr_weighted    0.996239
dtype: float64

## Pipelines, column transformers, grid search

### Simple

In [21]:
df = pd.DataFrame({
    'text': ['pizza apple orange', 'potato tomato greens pizza', 'computer monitor', 'mouse keyboard'],
    'hand': ['left', 'right', np.nan, 'left'],
    'gender': ['m', 'f', 'f', 'm'],
    'age': [22.2, 32.3, 44.4, 55.5],
    'y': [1, 1, 0, 0]
})

df

Unnamed: 0,text,hand,gender,age,y
0,pizza apple orange,left,m,22.2,1
1,potato tomato greens pizza,right,f,32.3,1
2,computer monitor,,f,44.4,0
3,mouse keyboard,left,m,55.5,0


In [22]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer

p0 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='constant', fill_value='')), 
    ('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
    ('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')), 
    ('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scale', StandardScaler())
])

t = ColumnTransformer([
    ('text', p0, [0]),
    ('hand', p1, [1]), 
    ('gender', p2, [2]),
    ('age', p4, [3])
], remainder='drop')


T = t.fit_transform(df)

In [23]:
t_fields = t.named_transformers_['text'].named_steps['vectorize'].get_feature_names()
h_fields = list(t.named_transformers_['hand'].named_steps['ohe'].get_feature_names())
g_fields = list(t.named_transformers_['gender'].named_steps['ohe'].get_feature_names())
o_fields = ['age']

fields = t_fields + h_fields + g_fields + o_fields
fields

['apple',
 'computer',
 'greens',
 'keyboard',
 'monitor',
 'mouse',
 'orange',
 'pizza',
 'potato',
 'tomato',
 'x0_right',
 'x0_m',
 'age']

In [24]:
pd.DataFrame(T, columns=fields)

Unnamed: 0,apple,computer,greens,keyboard,monitor,mouse,orange,pizza,potato,tomato,x0_right,x0_m,age
0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,-1.308967
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,-0.502835
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.462927
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.348874


### With model

In [25]:
p0 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='constant', fill_value='')), 
    ('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
    ('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')), 
    ('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scale', StandardScaler())
])

t = ColumnTransformer([
    ('text', p0, [0]),
    ('hand', p1, [1]), 
    ('gender', p2, [2]),
    ('age', p4, [3])
], remainder='drop')

m = Pipeline(steps=[
    ('preprocess', t),
    ('regressor', LogisticRegression())
])

X, y = df[[c for c in df.columns if c != 'y']], df['y']

m.fit(X, y);

In [26]:
m.predict_proba(X)[:,1]

array([0.80466472, 0.78462158, 0.24182963, 0.16886457])

In [27]:
pd.concat([
    pd.Series(m.named_steps['regressor'].intercept_, ['intercept']),
    pd.Series(m.named_steps['regressor'].coef_[0], fields)
])

intercept   -0.333244
apple        0.195329
computer    -0.241834
greens       0.215375
keyboard    -0.168860
monitor     -0.241834
mouse       -0.168860
orange       0.195329
pizza        0.410704
potato       0.215375
tomato       0.215375
x0_right     0.215375
x0_m         0.026470
age         -0.703700
dtype: float64

### With grid search

In [28]:
N = 10
df = pd.DataFrame({
    'text': ['pizza apple orange', 'potato tomato greens pizza', 'computer monitor', 'mouse keyboard'] * N,
    'hand': ['left', 'right', np.nan, 'left'] * N,
    'gender': ['m', 'f', 'f', 'm'] * N,
    'age': [22.2, 32.3, 44.4, 55.5] * N,
    'y': [1, 1, 0, 0] * N
})

X, y = df[[c for c in df.columns if c != 'y']], df['y']

df.shape, X.shape, y.shape

((40, 5), (40, 4), (40,))

In [29]:
p0 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='constant', fill_value='')), 
    ('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
    ('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')), 
    ('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scale', StandardScaler())
])

t = ColumnTransformer([
    ('text', p0, [0]),
    ('hand', p1, [1]), 
    ('gender', p2, [2]),
    ('age', p4, [3])
], remainder='drop')

e = Pipeline(steps=[
    ('preprocess', t),
    ('regressor', LogisticRegression())
])

cv = StratifiedKFold(**{
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
})

m = GridSearchCV(**{
    'estimator': e,
    'cv': cv,
    'param_grid': {
        'regressor__random_state': [29, 37]
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
})

m.fit(X, y);

Fitting 5 folds for each of 2 candidates, totalling 10 fits


In [30]:
m.predict_proba(X)[:,1]

array([0.95717104, 0.94940778, 0.06120815, 0.03221605, 0.95717104,
       0.94940778, 0.06120815, 0.03221605, 0.95717104, 0.94940778,
       0.06120815, 0.03221605, 0.95717104, 0.94940778, 0.06120815,
       0.03221605, 0.95717104, 0.94940778, 0.06120815, 0.03221605,
       0.95717104, 0.94940778, 0.06120815, 0.03221605, 0.95717104,
       0.94940778, 0.06120815, 0.03221605, 0.95717104, 0.94940778,
       0.06120815, 0.03221605, 0.95717104, 0.94940778, 0.06120815,
       0.03221605, 0.95717104, 0.94940778, 0.06120815, 0.03221605])

In [31]:
pd.concat([
    pd.Series(m.best_estimator_.named_steps['regressor'].intercept_, ['intercept']),
    pd.Series(m.best_estimator_.named_steps['regressor'].coef_[0], fields)
])

intercept   -0.796608
apple        0.428280
computer    -0.612042
greens       0.505914
keyboard    -0.322175
monitor     -0.612042
mouse       -0.322175
orange       0.428280
pizza        0.934194
potato       0.505914
tomato       0.505914
x0_right     0.505914
x0_m         0.106105
age         -1.532901
dtype: float64