# Machine learning notes. Hyperparameter tuning 

> 'Introduction on how to tune hyperparameters in machine learning models'


- toc:true
- branch: master
- badges: false
- comments: false
- author: Alexandros Giavaras
- categories: [machine-learning, hyperparameters, sklearn, grid-search, random-search, informed-search]

## Overview

Machine learning models typically involve a parameter set that it is learnt during the training process. However, machine learning models also contain parameters that are not learnt during training and must be specified before the training process. In this notebook, we will review some commonly used methods to establish good hyperparameters values. In particular, we will review

- Grid search
- Random search
- Informed search

## Hyperparameter tuning

TODO: write intro

Hyperparameters are typically set before the modeling process begins. For example the number of clusters is a hyperparameter that the application needs to establish before running the algorithm. Thus, the crucial elemet that distinguishes parameters from hyperparameters is that the former are learnt by the model whilst the latter are set by the application. 

In the sequel, we will differentiate hyperparameters into two categories. Namely, parameters that affect the model performance and parameters that do not. an example of the latter category is the number of cpu cores that we want to use when training the model. Although, this impacts the training time, it does not impact how the model performs on unseen data or in other words the model's performance. 

### Setting and hyperparameter values

now that we have the needed definitions out of the way, we turn our attention to the main topic topic of this notebook. Namely, how do we set the optimal values for the hyperparameters we need to tune. Unfortunately, there is not a clear answer to this question. Hyperparameters, are specific to each algorithm. However, there are available some general guidelines and tips that we can follow. Let's review some of the top tips.

First, we need to identify which hyperparameters values are in conflict. For example, if we are using ```sklearn``` logistic regression model, the ```solver``` and ```penalty``` parameters have options that are in conflict; the ```elasticnet``` penalty is only supported by the ```saga``` solver. 

Another point to be aware of is that some hyperparameter values are simply silly. For example setting the number of clusters equal to one when performing K-means clustering or equal to the number of points in the  dataset, does not sound very meaningful. Similarly, setting the number of neighbors in a kNN algorithm equal to one is not very wise. 

We will review the following methods

- Learning curves (accuracy vs hyperparameter value) 
- Grid search
- Random search
- Informed search

#### Learning curves

TODO....Learning curves can be used when we have one hyperparameter to tune.

#### Grid search

Grid search performs an exhaustive search over a specified parameter values for an estimator. We can visualize this as a two dimensional grid. At each point of the grid, a different combination of parameters is examined. For example, consider an artificial model with three hyperparamters $a, b$ and $c$. Each of these parameters has the following values sets; $a \in [a_1, a_2, a_3], b \in [b_1, b_2], c \in [c_1, c_2, c_3]$. Grid search performs exhaustive search by forming all the possible triplets and fitting the model using the identified values. Let's see how to perform grid search in ```sklearn```. Overall the steps of using grid search in ```sklearn``` are as follows   

- Choose the algorithm to tune the hyperparameters (```estimator```)
- Define which hyperparameters to tune (```param_grid```)
- Define the range of values for each hyperparameter 
- Decide of the cross-validation scheme to use (```cv```)
- Define the score function to be used when deciding which model is the best (```scoring```)

The following example, taken from <a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py">here</a>, shows how to use grid search 

In [38]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

In [33]:
# Loading the Digits dataset
digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

In [34]:
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]},]
                    


In [35]:
clf = GridSearchCV(estimator=SVC(), param_grid=tuned_parameters, scoring='precision_macro')
clf.fit(X_train, y_train)    

GridSearchCV(estimator=SVC(),
             param_grid=[{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
                          'kernel': ['rbf']},
                         {'C': [1, 10, 100, 1000], 'kernel': ['linear']}],
             scoring='precision_macro')

In [36]:
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
    
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"  % (mean, std * 2, params))

print()

print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()

Best parameters set found on development set:

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.986 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.959 (+/-0.028) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.982 (+/-0.026) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.983 (+/-0.026) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.983 (+/-0.026) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.974 (+/-0.012) for {'C': 1, 'kernel': 'linear'}
0.974 (+/-0.012) for {'C': 10, 'kernel': 'linear'}
0.974 (+/-0.012) for {'C': 100, 'kernel': 'linear'}
0.974 (+/-0.012) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             

The output from ```GridSearchCV``` can be categorized into three different groups

- Results log: ```cv_results_```
- Best results: ```best_index_```, ```best_params_``` and ``` best_score_```
- Other extra information such as ```refit_time_```, and ```scorer_```

#### Random search

TODO

#### Informed search

TODO

## References

1. ```Hyperparameter tuning in Python``` course from <a href="https://www.datacamp.com/">Datacamp</a> 