# Hyperparameter Tuning

Scikit has many approaches to optimizing or tuning the hyperparameters of models. Let's take a look at how we can use `GridSearchCV` to search over a space of possible hyperparamter combinations.

## Create data

Let's create a dummy binary classification dataset.

In [1]:
import numpy as np
from sklearn.datasets import make_classification

np.random.seed(37)

X, y = make_classification(**{
    'n_samples': 2000,
    'n_features': 20,
    'n_informative': 2,
    'n_redundant': 2,
    'n_repeated': 0,
    'n_classes': 2,
    'n_clusters_per_class': 2,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (2000, 20), y shape (2000,)


## Tuning Logistic Regression

Let's try to tune a logistic regression model. The logistic regression model will be referred to as the `estimator`; it is this estimator's possible hyperparamters that we want to optimize. When tuning hyperparameters, we also need a way to split the data, and here, we will use `StratifiedKFold`. Another important input to the grid search is the `param_grid` argument, which is a dictionary specifying the search space of each hyperparameter. Here, our search space is simple, it is over the `regularization strength`. Lastly, we need an optimization criteria, and we specify that through the [scoring argument](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

p = {
    'solver': 'sag',
    'penalty': 'l2',
    'random_state': 37,
    'max_iter': 100
}
estimator = LogisticRegression(**p)

p = {
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
}
cv = StratifiedKFold(**p)

p = {
    'estimator': estimator,
    'cv': cv,
    'param_grid': {
        'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
}
model = GridSearchCV(**p)

model.fit(X, y)

Fitting 5 folds for each of 11 candidates, totalling 55 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  55 | elapsed:    2.9s remaining:    1.5s
[Parallel(n_jobs=-1)]: Done  48 out of  55 | elapsed:    2.9s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed:    2.9s finished


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             estimator=LogisticRegression(random_state=37, solver='sag'),
             n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
                               0.9, 1.0]},
             refit='auc',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

The `best_params_` property gives the best combination of hyperparameters.

In [3]:
model.best_params_

{'C': 0.4}

The `best_score_` property gives the best score.

In [4]:
model.best_score_

0.9644498503712592

To retrieve the best estimator induced by the search and scoring criteria, access `best_estimator_`.

In [5]:
model.best_estimator_

LogisticRegression(C=0.4, random_state=37, solver='sag')

## Tuning Random Forest

Here, we tune a `RandomForestClassifier`.

In [6]:
from sklearn.ensemble import RandomForestClassifier

p = {
    'random_state': 37
}
estimator = RandomForestClassifier(**p)

p = {
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
}
cv = StratifiedKFold(**p)

p = {
    'estimator': estimator,
    'cv': cv,
    'param_grid': {
        'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'criterion': ['gini', 'entropy']
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
}
model = GridSearchCV(**p)

model.fit(X, y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of 100 | elapsed:    1.4s remaining:    1.6s
[Parallel(n_jobs=-1)]: Done  90 out of 100 | elapsed:    3.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.4s finished


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             estimator=RandomForestClassifier(random_state=37), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90,
                                          100]},
             refit='auc',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

In [7]:
model.best_params_

{'criterion': 'entropy', 'n_estimators': 50}

In [8]:
model.best_score_

0.9763199132478311

In [9]:
model.best_estimator_

RandomForestClassifier(criterion='entropy', n_estimators=50, random_state=37)

## Validation with tuning

In some cases, you might want to validate the hyperparameter tuning as a part of your learning process. In this example, we show an example of how to so. Here are some things to note in this example.

- The data generated will be multiclass.
- We will implement custom scorers. The average precision score does not natively handle the multi-class label, and we will have to transform the ground truth lables into a one-hot encoded vector.

Now let's generate some data.

In [10]:
X, y = make_classification(**{
    'n_samples': 1000,
    'n_features': 10,
    'n_clusters_per_class': 1,
    'n_classes': 3,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (1000, 10), y shape (1000,)


Below, we create a `model` that is a grid search based on random forest. Note how we use the `make_scorer()` method to create custom scorers. 

In [11]:
from sklearn.metrics import roc_auc_score, average_precision_score, make_scorer
from collections import Counter

def apr_score(y_true, y_pred, average='micro'):
    get_vector = lambda idx, n_clazzes: [1 if i == idx else 0 for i in range(n_clazzes)]
    clazzes = sorted(Counter(y).keys())
    n_clazzes = len(clazzes)
    Y = np.array([get_vector(y_val, n_clazzes) for y_val in y_true])
    
    return average_precision_score(Y, y_pred, average=average)

def get_model():
    p = {
    'random_state': 37
    }
    estimator = RandomForestClassifier(**p)

    p = {
        'n_splits': 5,
        'shuffle': True,
        'random_state': 37
    }
    cv = StratifiedKFold(**p)
    
    auc_scorer = make_scorer(roc_auc_score, greater_is_better=True, needs_proba=True, multi_class='ovo')
    apr_scorer_macro = make_scorer(apr_score, greater_is_better=True, needs_proba=True, average='macro')
    apr_scorer_micro = make_scorer(apr_score, greater_is_better=True, needs_proba=True, average='micro')
    apr_scorer_weighted = make_scorer(apr_score, greater_is_better=True, needs_proba=True, average='weighted')

    p = {
        'estimator': estimator,
        'cv': cv,
        'param_grid': {
            'n_estimators': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
            'criterion': ['gini', 'entropy']
        },
        'scoring': {
            'auc': auc_scorer,
            'apr_scorer_macro': apr_scorer_macro,
            'apr_scorer_micro': apr_scorer_micro,
            'apr_scorer_weighted': apr_scorer_weighted
        },
        'verbose': 0,
        'refit': 'apr_scorer_micro',
        'error_score': np.NaN,
        'n_jobs': -1
    }
    model = GridSearchCV(**p)
    return model

Now we can perform stratified, k-fold cross-validation while incorporating hyperparameter tuning as a part of the validation process.

In [13]:
import pandas as pd

results = []

for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=10).split(X, y):
    X_tr, X_te = X[tr], X[te]
    y_tr, y_te = y[tr], y[te]
    
    model = get_model()
    model.fit(X_tr, y_tr)
    
    y_pred = model.predict_proba(X_te)
    
    auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
    auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
    apr_macro = apr_score(y_te, y_pred, average='macro')
    apr_micro = apr_score(y_te, y_pred, average='micro')
    apr_weighted = apr_score(y_te, y_pred, average='weighted')
    
    results.append({
        'auc_ovr': auc_ovr,
        'auc_ovo': auc_ovo,
        'apr_macro': apr_macro,
        'apr_micro': apr_micro,
        'apr_weighted': apr_weighted
    })
    
rdf = pd.DataFrame(results)

In [14]:
rdf.mean()

auc_ovr         0.998297
auc_ovo         0.998296
apr_macro       0.996420
apr_micro       0.996217
apr_weighted    0.996436
dtype: float64