# Hyperparameter Tuning

Scikit has many approaches to optimizing or tuning the hyperparameters of models. Let's take a look at how we can use `GridSearchCV` to search over a space of possible hyperparamter combinations.

## Create data

Let's create a dummy binary classification dataset.

In [None]:
import numpy as np
from sklearn.datasets import make_classification

np.random.seed(37)

X, y = make_classification(**{
    'n_samples': 2000,
    'n_features': 20,
    'n_informative': 2,
    'n_redundant': 2,
    'n_repeated': 0,
    'n_classes': 2,
    'n_clusters_per_class': 2,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

## Tuning Logistic Regression

Let's try to tune a logistic regression model. The logistic regression model will be referred to as the `estimator`; it is this estimator's possible hyperparamters that we want to optimize. When tuning hyperparameters, we also need a way to split the data, and here, we will use `StratifiedKFold`. Another important input to the grid search is the `param_grid` argument, which is a dictionary specifying the search space of each hyperparameter. Here, our search space is simple, it is over the `regularization strength`. Lastly, we need an optimization criteria, and we specify that through the [scoring argument](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

p = {
    'solver': 'sag',
    'penalty': 'l2',
    'random_state': 37,
    'max_iter': 100
}
estimator = LogisticRegression(**p)

p = {
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
}
cv = StratifiedKFold(**p)

p = {
    'estimator': estimator,
    'cv': cv,
    'param_grid': {
        'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
}
model = GridSearchCV(**p)

model.fit(X, y)

The `best_params_` property gives the best combination of hyperparameters.

In [None]:
model.best_params_

The `best_score_` property gives the best score.

In [None]:
model.best_score_

To retrieve the best estimator induced by the search and scoring criteria, access `best_estimator_`.

In [None]:
model.best_estimator_

## Tuning Random Forest

Here, we tune a `RandomForestClassifier`.

In [None]:
from sklearn.ensemble import RandomForestClassifier

p = {
    'random_state': 37
}
estimator = RandomForestClassifier(**p)

p = {
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
}
cv = StratifiedKFold(**p)

p = {
    'estimator': estimator,
    'cv': cv,
    'param_grid': {
        'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'criterion': ['gini', 'entropy']
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
}
model = GridSearchCV(**p)

model.fit(X, y)

In [None]:
model.best_params_

In [None]:
model.best_score_

In [None]:
model.best_estimator_

## Tuning with a pipeline

Our estimator can also be a pipeline. For each processor in the pipeline, we can also specify the parameter grid.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
pca = PCA()
rf = RandomForestClassifier(**{
    'random_state': 37
})
pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

cv = StratifiedKFold(**{
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
})

model = GridSearchCV(**{
    'estimator': pipeline,
    'cv': cv,
    'param_grid': {
        'scaler__feature_range': [(0, 1), (0, 2)],
        'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
        'rf__n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'rf__criterion': ['gini', 'entropy']
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
})

model.fit(X, y)

In [None]:
model.best_params_

In [None]:
model.best_score_

In [None]:
model.best_estimator_

## Validation with tuning

In some cases, you might want to validate the hyperparameter tuning as a part of your learning process. In this example, we show an example of how to so. Here are some things to note in this example.

- The data generated will be multiclass.
- We will implement custom scorers. The average precision score does not natively handle the multi-class label, and we will have to transform the ground truth lables into a one-hot encoded vector.

Now let's generate some data.

In [None]:
X, y = make_classification(**{
    'n_samples': 1000,
    'n_features': 10,
    'n_clusters_per_class': 1,
    'n_classes': 3,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

Below, we create a `model` that is a grid search based on random forest. Note how we use the `make_scorer()` method to create custom scorers. 

In [None]:
from sklearn.metrics import roc_auc_score, average_precision_score, make_scorer
from sklearn.preprocessing import OneHotEncoder

def apr_score(y_true, y_pred, average='micro'):
    encoder = OneHotEncoder()
    Y = encoder.fit_transform(y_true.reshape(-1, 1)).todense()
    
    return average_precision_score(Y, y_pred, average=average)

def get_model():
    scaler = MinMaxScaler()
    pca = PCA()
    rf = RandomForestClassifier(**{
        'random_state': 37
    })
    pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

    cv = StratifiedKFold(**{
        'n_splits': 5,
        'shuffle': True,
        'random_state': 37
    })
    
    auc_scorer = make_scorer(
        roc_auc_score, 
        greater_is_better=True, 
        needs_proba=True, 
        multi_class='ovo')
    apr_scorer_macro = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='macro')
    apr_scorer_micro = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='micro')
    apr_scorer_weighted = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='weighted')

    model = GridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': {
            'scaler__feature_range': [(0, 1), (0, 2)],
            'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
            'rf__n_estimators': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
            'rf__criterion': ['gini', 'entropy']
        },
        'scoring': {
            'auc': auc_scorer,
            'apr_scorer_macro': apr_scorer_macro,
            'apr_scorer_micro': apr_scorer_micro,
            'apr_scorer_weighted': apr_scorer_weighted
        },
        'verbose': 5,
        'refit': 'apr_scorer_micro',
        'error_score': np.NaN,
        'n_jobs': -1
    })
    return model

Now we can perform stratified, k-fold cross-validation while incorporating hyperparameter tuning as a part of the validation process.

In [None]:
import pandas as pd

results = []

for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=10).split(X, y):
    X_tr, X_te = X[tr], X[te]
    y_tr, y_te = y[tr], y[te]
    
    model = get_model()
    model.fit(X_tr, y_tr)
    
    y_pred = model.predict_proba(X_te)
    
    auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
    auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
    apr_macro = apr_score(y_te, y_pred, average='macro')
    apr_micro = apr_score(y_te, y_pred, average='micro')
    apr_weighted = apr_score(y_te, y_pred, average='weighted')
    
    results.append({
        'auc_ovr': auc_ovr,
        'auc_ovo': auc_ovo,
        'apr_macro': apr_macro,
        'apr_micro': apr_micro,
        'apr_weighted': apr_weighted
    })
    
rdf = pd.DataFrame(results)

In [None]:
rdf.mean()

In [None]:
from tune_sklearn import TuneGridSearchCV

def get_model():
    scaler = MinMaxScaler()
    pca = PCA()
    rf = RandomForestClassifier(**{
        'random_state': 37
    })
    pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

    cv = StratifiedKFold(**{
        'n_splits': 5,
        'shuffle': True,
        'random_state': 37
    })
    
    auc_scorer = make_scorer(
        roc_auc_score, 
        greater_is_better=True, 
        needs_proba=True, 
        multi_class='ovo')
    apr_scorer_macro = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='macro')
    apr_scorer_micro = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='micro')
    apr_scorer_weighted = make_scorer(
        apr_score, 
        greater_is_better=True, 
        needs_proba=True, 
        average='weighted')

    model = TuneGridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': {
            'scaler__feature_range': [(0, 1), (0, 2)],
            'pca__n_components': [2, 3, 4, 5, 6, 7, 8, 9, 10],
            'rf__criterion': ['gini', 'entropy']
        },
        'scoring': {
            'auc': auc_scorer,
            'apr_scorer_macro': apr_scorer_macro,
            'apr_scorer_micro': apr_scorer_micro,
            'apr_scorer_weighted': apr_scorer_weighted
        },
        'verbose': 1,
        'refit': 'apr_scorer_micro',
        'error_score': np.NaN,
        'n_jobs': -1,
        'early_stopping': 'MedianStoppingRule',
        'max_iters': 10
    })
    return model

In [None]:
results = []

for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=5).split(X, y):
    X_tr, X_te = X[tr], X[te]
    y_tr, y_te = y[tr], y[te]
    
    model = get_model()
    model.fit(X_tr, y_tr)
    
    y_pred = model.predict_proba(X_te)
    
    auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
    auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
    apr_macro = apr_score(y_te, y_pred, average='macro')
    apr_micro = apr_score(y_te, y_pred, average='micro')
    apr_weighted = apr_score(y_te, y_pred, average='weighted')
    
    results.append({
        'auc_ovr': auc_ovr,
        'auc_ovo': auc_ovo,
        'apr_macro': apr_macro,
        'apr_micro': apr_micro,
        'apr_weighted': apr_weighted
    })
    
rdf = pd.DataFrame(results)

In [None]:
rdf.mean()