# GridSearchCV

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.

A hyperparameter is a parameter whose value is used to control the learning process.

In [1]:
import pandas as pd
import numpy as np
from sklearn import svm, datasets

In [2]:
iris = datasets.load_iris()

In [5]:
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [7]:
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df.head(2)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2


In [8]:
df["flower"] = iris.target
df.head(2)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0


In [9]:
df.shape

(150, 5)

In [10]:
df["flower_name"] = df["flower"].apply(lambda x: iris.target_names[x])
df.head(2)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower,flower_name
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa


In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

### Hyperparameter

Here, the hyperparameters - `kernel`, `C` and `gamma` value is selected randomly. We don't if it is the best value for this model

In [13]:
model = svm.SVC(kernel="rbf", C=30, gamma="auto")
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9333333333333333

### random_state & K-Fold CV

We also didn't define the `random_state`. So the value will change every time we split our data. There is a solution for this to select the optimum value, that's called `K-fold cross-validation`

In [14]:
from sklearn.model_selection import cross_val_score

In [16]:
kernels = ["linear", "rbf", "poly"]
C = [1, 10, 20]
avg_scores = {}

for k in kernels:
    for cval in C:
        estimator = svm.SVC(kernel=k, C=cval, gamma="auto")
        cv_scores = cross_val_score(estimator, iris.data, iris.target, cv=5)
        avg_scores[f"kernel: {k} - C: {cval}"] = np.average(cv_scores)

avg_scores

{'kernel: linear - C: 1': 0.9800000000000001,
 'kernel: linear - C: 10': 0.9733333333333334,
 'kernel: linear - C: 20': 0.9666666666666666,
 'kernel: rbf - C: 1': 0.9800000000000001,
 'kernel: rbf - C: 10': 0.9800000000000001,
 'kernel: rbf - C: 20': 0.9666666666666668,
 'kernel: poly - C: 1': 0.9666666666666666,
 'kernel: poly - C: 10': 0.9666666666666666,
 'kernel: poly - C: 20': 0.9533333333333334}

#### observation

We got the best score when:
- kernel=_linear_ & C=_1_
- kernel=_rbf_ & C=_1_
- kernel=_rbf_ & C=_10_

`for` loop is costly and if we want to check score for more parameters, it'll be not efficient

### GridSearchCV

`sklearn` provides this API to check different parameters

In [17]:
from sklearn.model_selection import GridSearchCV

In [19]:
estimator = svm.SVC(gamma="auto")
param_grid = {
    "C": [1, 10, 20],
    "kernel": ["rbf", "linear", "poly", "sigmoid"]
}

clf = GridSearchCV(estimator, param_grid, cv=5)
clf.fit(iris.data, iris.target)

GridSearchCV(cv=5, estimator=SVC(gamma='auto'),
             param_grid={'C': [1, 10, 20],
                         'kernel': ['rbf', 'linear', 'poly', 'sigmoid']})

Let's see what's in the `clf`

In [20]:
dir(clf)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_is_fitted',
 '_check_n_features',
 '_estimator_type',
 '_format_results',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_pairwise',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run_search',
 '_validate_data',
 'best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 'classes_',
 'cv',
 'cv_results_',
 'decision_function',
 'error_score',
 'estimator',
 'fit',
 'get_params',
 'iid',
 'inverse_transform',
 'multimetric_',
 'n_features_in_',
 'n_jobs',
 'n_splits_',
 'param_grid',
 'pre_

There are many `dunders` which is not needed here. Looking for the necessary keywords from this

`cv_results_`, `best_params_`, `best_score_`

In [21]:
clf.cv_results_

{'mean_fit_time': array([0.0023982 , 0.00140042, 0.00460062, 0.00579367, 0.00120282,
        0.00079565, 0.00679917, 0.00239835, 0.00119948, 0.00099969,
        0.01438723, 0.00220275]),
 'std_fit_time': array([4.89531658e-04, 4.88755724e-04, 4.02848906e-03, 7.10777087e-03,
        3.98220897e-04, 3.97897292e-04, 4.66316855e-03, 4.89706923e-04,
        3.99589709e-04, 2.61174468e-07, 1.27236448e-02, 4.07695882e-04]),
 'mean_score_time': array([0.00119915, 0.00080023, 0.00080037, 0.00120611, 0.00059609,
        0.00059977, 0.00100036, 0.00059972, 0.00079932, 0.00059962,
        0.00119977, 0.00079546]),
 'std_score_time': array([4.00114130e-04, 4.00115153e-04, 4.00188266e-04, 4.04217292e-04,
        4.86754390e-04, 4.89706738e-04, 1.18155591e-06, 4.89667827e-04,
        3.99661217e-04, 4.89589937e-04, 4.00043738e-04, 3.97811687e-04]),
 'param_C': masked_array(data=[1, 1, 1, 1, 10, 10, 10, 10, 20, 20, 20, 20],
              mask=[False, False, False, False, False, False, False, False,
  

So, `cv_results_` is giving us a dictionary. We can make a dataframe from this for visualization

In [23]:
result_df = pd.DataFrame(clf.cv_results_)
result_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002398,0.0004895317,0.001199,0.0004,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.0014,0.0004887557,0.0008,0.0004,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.004601,0.004028489,0.0008,0.0004,1,poly,"{'C': 1, 'kernel': 'poly'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,6
3,0.005794,0.007107771,0.001206,0.000404,1,sigmoid,"{'C': 1, 'kernel': 'sigmoid'}",0.333333,0.1,0.0,0.033333,0.0,0.093333,0.125433,10
4,0.001203,0.0003982209,0.000596,0.000487,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
5,0.000796,0.0003978973,0.0006,0.00049,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
6,0.006799,0.004663169,0.001,1e-06,10,poly,"{'C': 10, 'kernel': 'poly'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,6
7,0.002398,0.0004897069,0.0006,0.00049,10,sigmoid,"{'C': 10, 'kernel': 'sigmoid'}",0.333333,0.1,0.0,0.033333,0.0,0.093333,0.125433,10
8,0.001199,0.0003995897,0.000799,0.0004,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,5
9,0.001,2.611745e-07,0.0006,0.00049,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,6


In [25]:
result_df.columns

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_C', 'param_kernel', 'params', 'split0_test_score',
       'split1_test_score', 'split2_test_score', 'split3_test_score',
       'split4_test_score', 'mean_test_score', 'std_test_score',
       'rank_test_score'],
      dtype='object')

`mean_test_score` is the column that I need. Looking for the highest value in that column

In [27]:
result_df[["param_kernel", "param_C", "mean_test_score"]]

Unnamed: 0,param_kernel,param_C,mean_test_score
0,rbf,1,0.98
1,linear,1,0.98
2,poly,1,0.966667
3,sigmoid,1,0.093333
4,rbf,10,0.98
5,linear,10,0.973333
6,poly,10,0.966667
7,sigmoid,10,0.093333
8,rbf,20,0.966667
9,linear,20,0.966667


In [28]:
clf.best_score_

0.9800000000000001

In [29]:
clf.best_params_

{'C': 1, 'kernel': 'rbf'}

### computation cost of `GridSearchCV`

Now we have tried for small number of values of `C`. But if we tried for 50 value or more, it'll also very costly.

```py
param_grid = {
    "C": [i for i in range(1, 51)]
}
```

There is anothe method in `sklearn` to tackle this problem called `RandomizedSearchCV`

### `RandomizedSearchCV`

In [31]:
from sklearn.model_selection import RandomizedSearchCV

In [33]:
estimator = svm.SVC(gamma="auto")
param_distribution = {
    "C": [1, 10, 20],
    "kernel": ["rbf", "linear"]
}

rs = RandomizedSearchCV(estimator, param_distribution, cv=5, n_iter=2)
rs.fit(iris.data, iris.target)

RandomizedSearchCV(cv=5, estimator=SVC(gamma='auto'), n_iter=2,
                   param_distributions={'C': [1, 10, 20],
                                        'kernel': ['rbf', 'linear']})

Here, the main difference is `n_iter`
- In `GridSearchCV`, the values were calculated for every combinations
- In `RandomizedSearchCV`, I can define how many random combinations it'll try using `n_iter`

In [34]:
rs_df = pd.DataFrame(rs.cv_results_)
rs_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kernel,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002002,0.001549,0.000995,0.000633,linear,20,"{'kernel': 'linear', 'C': 20}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,2
1,0.001604,0.000486,0.000799,0.000748,rbf,20,"{'kernel': 'rbf', 'C': 20}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,1


In [35]:
rs.best_params_

{'kernel': 'rbf', 'C': 20}

This method suits best if the computation power of machine is less.