<a href="https://colab.research.google.com/github/lijerrymagic/MLColabAssignment/blob/master/Copy_of_7_hw_grid_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework: Grid search for hyperparameter tuning



* Name: Zeyu Li
* Net ID: zl3719

## Introduction

For models with a single hyperparameter controlling bias-variance (for example: $k$ in $k$ nearest neighbors), we used Scikit-learn's `KFoldCV` to test a range of values for the hyperparameter, and to select the best one. 



When we have *multiple* hyperparameters to tune, we can use `GridSearchCV` to select the best *combination* of them.

For example, in this week's lesson (in the notebook on bias and variance of SVM), we saw three ways to tune the bias-variance of an SVM classifier:

* Changing the kernel
* Changing $C$, the inverse of the regularization penalty weight
* For an RBF kernel, changing $\gamma$, the inverse of the kernel bandwidth


To get the best performance from an SVM classifier, we need to find the best *combination* of these hyperparameters.

This notebook shows how to use `GridSearchCV` to tune an SVM classifier.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

import numpy as np
import pandas as pd

## Get the data

We will work with a subset of the MNIST handwritten digits data. First, we will get the data, and assign a small subset of samples to training and test sets.

In [2]:
from sklearn.datasets import fetch_openml

In [3]:
X, y = fetch_openml('mnist_784', version=1, return_X_y=True )

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                     train_size=10000, test_size=3000)

## Run grid search

Then, we will define a *parameter grid* with all the combinations of hyperparameters that we want to test.

In [41]:
param_grid = [
  {'C': [10000000], 'gamma': [0.0000000000001], 'kernel': ['rbf']},
 ]
param_grid

[{'C': [10000000], 'gamma': [1e-13], 'kernel': ['rbf']}]

We will pass the parameter grid to a `GridSearchCV`, along with the number of CV folds to use. 

Also, we set:

* `verbose` to a large positive number, so that we get plenty of logging output, and
* `refit` to `True`, so that after testing all of the hyperparameter combinations, it will re-fit an SVM classifier with the hyperparameters that had the best mean validation score.


In [42]:
clf = GridSearchCV(SVC(), param_grid, cv=3, refit=True, verbose=100)
clf.fit(X_train, y_train)

Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] C=10000000, gamma=1e-13, kernel=rbf .............................
[CV] . C=10000000, gamma=1e-13, kernel=rbf, score=0.912, total=  16.7s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   16.7s remaining:    0.0s
[CV] C=10000000, gamma=1e-13, kernel=rbf .............................
[CV] . C=10000000, gamma=1e-13, kernel=rbf, score=0.918, total=  17.1s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   33.8s remaining:    0.0s
[CV] C=10000000, gamma=1e-13, kernel=rbf .............................
[CV] . C=10000000, gamma=1e-13, kernel=rbf, score=0.910, total=  17.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   50.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   50.8s finished


GridSearchCV(cv=3, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [10000000], 'gamma': [1e-13],
                          'kernel': ['rbf']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=100)

## Review results

Finally, we'll print the results of the cross validation. For each combination of parameters, we can see:

* the validation score for each fold
* the mean validation score
* the standard deviation of the validation score
* the rank, by mean validation score

(in the report, the "test" scores are validation scores.)

In [43]:
pd.DataFrame(clf.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_gamma,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,9.537126,0.098427,7.408562,0.085891,10000000,1e-13,rbf,"{'C': 10000000, 'gamma': 1e-13, 'kernel': 'rbf'}",0.912418,0.917792,0.909691,0.9133,0.003366,1


## Evaluate performance of the re-fitted model

We can see the "best" parameters, with which the model was re-fitted:

In [44]:
print(clf.best_params_)

{'C': 10000000, 'gamma': 1e-13, 'kernel': 'rbf'}


And we can evaluate the re-fitted model on the test set. (Note that the `GridSearchCV` only used the training set; we have not used the test set at all for model fitting.)

In [45]:
y_pred = clf.predict(X_test)

In [46]:
accuracy_score(y_pred, y_test)

0.9243333333333333

## Assignment
 
The results of a `GridSearchCV` are only as good as the combinations of hyperparameters we test in the grid. 
 
* If the range of hyperparameter values is too narrow (it excludes good values), the model accuracy will be lower that it would be with a better choice of hyperparameters.
* If the search space is large with a fine resolution, the grid search will take a very long time.
* If the search space is large with a coarse resolution, we may not find a good combination of hyperparameters.

In the demo above, I did not use a good parameter grid. For your assignment, try to improve the parameter grid, and re-run the notebook with your modified parameter grid.

Explain the results. In particular, explain: if *I* would run your notebook, with exactly the parameter grid you defined, would I be confident that the SVM performance is about as good as it can be? Why?

**Answer:** No. The performance will vary if the same notebook runs in diffrent machine, because the training set and testing set it chooses are randomly decided, so there will be different between the training set used to train the model, thus the performance of model for testing set will also be different.



Also answer the following question: suppose instead of using a `GridSearchCV`, I would separately run one `KFoldCV` over a range of values of $C$, one `KFoldCV` over a range of values of $\gamma$, and one `KFoldCV` for two values of `kernel`. In other words, I would independently select a best value for each hyperparameter. Would this be a good strategy? Why or why not?

**Answer:** I do not think this would be a good strategy and we cannot independently select a best value for each hyperparameter. Because the hyperparameter are somehow correlated which means if I just change one of them, we cannot guarantee that the performance of our model trained will increase. So a better way would be identifying the correlation between those two parameter and select them by changing both parameters together and observe the performance throught that.

Submit the PDF version of the notebook, including your explanation.

