___
<h1> Machine Learning </h1>
<h2> M. Sc. in Electrical and Computer Engineering </h2>
<h3> Instituto Superior de Engenharia / Universidade do Algarve </h3>

[MEEC](https://ise.ualg.pt/en/curso/1477) / [ISE](https://ise.ualg.pt) / [UAlg](https://www.ualg.pt)

Pedro J. S. Cardoso (pcardoso@ualg.pt)
___

# Grid Search
## Simple Grid Search
In this part we'll do some simple grid search (by hand)

In [None]:
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

import pandas as pd
import matplotlib.pyplot as plt

Let us start by doing it without a validation set

In [None]:
digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target, 
                                                    random_state=100)

Find the best parameters according with the training and validation sets

In [None]:
best_score = 0

list_of_params = [.001, 0.01, 0.1, 1, 10, 100]

for gamma in list_of_params:
    for C in list_of_params:
        
        svm = SVC(gamma=gamma, C=C).fit(X_train, y_train)
        
        score = svm.score(X_test, y_test)
        
        if score > best_score:
            best_model = svm
            best_score = score
            best_params = {'C': C, 'gamma': gamma}

print(f'Over the test set, the best params are: {best_params} with a score of: {best_score}')

## The Danger of Overfitting the Parameters and the Validation Set

We tried many different parameters and selected the one with best accuracy on the test set, but this accuracy won’t necessarily carry over to new data. Because we used the test data to adjust the parameters, we can no longer use it to assess how good the model is. 

This is the same reason we needed to split the data into training and test sets in the first place; we need an independent
dataset to evaluate, one that was not used to create the model.

Let us start by building a training (60%), a testing (20%), and a validation (20%) set

In [None]:
digits = load_digits()

X_train_validate, X_test, y_train_validate, y_test = train_test_split(digits.data,
                                                    digits.target, 
                                                    test_size=.2,
                    (X_test, y_test)                                random_state=0)

X_train, X_validate, y_train, y_validate = train_test_split(X_train_validate,
                                                    y_train_validate, 
                                                    test_size=.25, # this will give 20% of the original set
                                                    random_state=0)

Do the same grid search...

In [None]:
best_score = 0

list_of_params = [.001, 0.01, 0.1, 1, 10, 100]

for gamma in list_of_params:
    for C in list_of_params:
        svm = SVC(gamma=gamma, C=C).fit(X_train, y_train)
        score = svm.score(X_validate, y_validate)
        
        if score > best_score:
            best_score = score
            best_params = {'C': C, 'gamma': gamma}
            
print(f'Over the validation set, the best params are: {best_params} with a score of: {best_score}')   

Now we rebuild the model on the combined training and validation set and test it with the test set: How does it behave on the test set?

In [None]:
svm = SVC(**best_params).fit(X_train_validate, y_train_validate)
score = svm.score(X_test, y_test)

print(f'Over the test set the score is: {score}')

Not bad ehm!?

## Grid Search with Cross-Validation

For a better estimate of the generalization performance, instead of using a single split into a training and a validation set, we can use cross-validation to evaluate the performance of each parameter combination.

In [None]:
digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target, 
                                                    random_state=100)

In [None]:
best_score = 0

labels=[]
values=[]

list_of_params = [.001, 0.01, 0.1, 1, 10, 100]

for gamma in list_of_params:
    for C in list_of_params:
        svm = SVC(gamma=gamma, C=C)
        score_array = cross_val_score(estimator=svm, 
                                      X=X_train, 
                                      y=y_train, 
                                      cv=10)
        
        mean_score = score_array.mean()
        
        labels.append(f'C={C}/gamma={gamma}')
        values.append(score_array)
        
        if mean_score > best_score:
            best_score = mean_score
            best_score_array = score_array
            best_params = {'C': C, 'gamma': gamma}

        
        
print(f'best param are {best_params} with a mean score {best_score} \n(the scores values where {best_score_array})')

In [None]:
SVC(**best_params).fit(X_train, y_train).score(X_test, y_test)

The all set of values are

In [None]:
values

Lets us plot these values

In [None]:
fig, ax = plt.subplots(figsize=(15,5))

ax.boxplot(values)
ax.set_xticklabels(labels=labels, rotation=75)
plt.ylabel("score")

plt.show()


## GridSearchCV
Because grid search with cross-validation is such a commonly used method to adjust parameters, scikit-learn provides the GridSearchCV class, which implements it in the form of an estimator. 

(https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [None]:
list_of_params = [.001, 0.01, 0.1, 1, 10, 100]
param_grid = {
        'C': list_of_params, 
        'gamma': list_of_params
    } 

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target, 
                                                    random_state=0)
clf = GridSearchCV(estimator=SVC(), 
                 param_grid=param_grid, 
                 cv=5, 
                 return_train_score=True).fit(X_train, y_train)

print(f'best score: { clf.best_score_}\n best params: {clf.best_params_}\n score over test: {clf.score(X_test, y_test)}\n\n Best estimator: {clf.best_estimator_}')

Now, we can use `pandas` to have a better look at the results

In [None]:
results = pd.DataFrame(clf.cv_results_)

In [None]:
results.sort_values(by='mean_test_score', ascending=False).head(10)

And now with data never seen

In [None]:
clf = SVC(**clf.best_params_).fit(X_train, y_train)
clf.score(X_test, y_test)

## Search over spaces that are not grids

In [None]:
param_grid = [
                {
                    'kernel': ['rbf'], 
                    'C': [1, 10, 100],
                    'gamma': [0.001, 0.01, 0.1]
                },
                {
                    'kernel': ['linear'],
                    'C': [0.001, 0.01, 0.1, 1, 10, 100]
                }
            ]

clf = GridSearchCV(
            estimator=SVC(), 
            param_grid=param_grid, 
            cv=5, 
            return_train_score=True).fit(X_train, y_train)

print(f'best score: { clf.best_score_}\n best params: {clf.best_params_}\n score over test: {clf.score(X_test, y_test)}\n\n Best estimator: {clf.best_estimator_}')