<a href="https://colab.research.google.com/github/jibintom/Machine-Learning-Codebasics-/blob/main/a15.%20Hyper%20parameter%20Tuning%20(GridSearchCV)/Hyper_parameter_Tuning_(GridSearchCV).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finding best model and hyper parameter tunning using GridSearchCV
**For iris flower dataset in sklearn library, we are going to find out best model and best hyper parameters using GridSearchCV**

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from sklearn import svm, datasets
iris=datasets.load_iris()

In [2]:
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df["flower"]=iris.target
df["flower"]=df["flower"].apply(lambda x: iris.target_names[x])
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


**Approach 1: Use train_test_split and manually tune parameters by trial and error**

In [3]:
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

In [4]:
model=svm.SVC(kernel="rbf",C=10, gamma="auto")
model.fit(x_train,y_train)

SVC(C=10, gamma='auto')

In [5]:
model.score(x_test,y_test)

0.9666666666666667

The issue in this approach is based on our train, test data, and hyperparameter our score varies so we don't know is iteration is the best fit one so in order to overcome this we use the **k-fold cross validation technique**

### Approach 2: Use K Fold Cross validation
Manually try suppling models with different parameters to cross_val_score function with 5 fold cross validation and  cross_val_score return the score for each iteration

In [6]:
from sklearn.model_selection import cross_val_score

In [7]:
cross_val_score(svm.SVC(kernel="rbf", C=10, gamma="auto"), iris.data, iris.target, cv=5)

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [8]:
cross_val_score(svm.SVC(kernel="linear", C=10, gamma="auto"), iris.data, iris.target, cv=5)

array([1.        , 1.        , 0.9       , 0.96666667, 1.        ])

In [9]:
cross_val_score(svm.SVC(kernel="rbf", C=5, gamma="auto"), iris.data, iris.target, cv=5)

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

Here we have done cross_val_score for 5 folds(cv=5) and tried this method on different values of **C** and **kernel**

**Above approach is tiresome and very manual. We can use for loop as an alternative**

In [10]:
import numpy as np
kernels=['rbf', 'linear']
c=[1,10,20]
avg_score={}

for kval in kernels:
  for cval in c:
    cv_score=cross_val_score(svm.SVC(kernel=kval, C=cval, gamma="auto"), iris.data, iris.target, cv=5)
    avg_score[kval + "_"+ str(cval)]=np.average(cv_score)

avg_score

{'rbf_1': 0.9800000000000001,
 'rbf_10': 0.9800000000000001,
 'rbf_20': 0.9666666666666668,
 'linear_1': 0.9800000000000001,
 'linear_10': 0.9733333333333334,
 'linear_20': 0.9666666666666666}

So this way we can find out the optimal score using the hyperparameter tuning. But we can see that this approach also has some issues, for example, if we have four parameters, then we have to run like four loops and it will be too many iterations and it's just not convenient. Luckily, Sklearn provides an API called **Grid Search CV** which will do the exact same thing.

### Approach 3: Use GridSearchCV
GridSearchCV does exactly same thing as for loop above but in a single line of code

In [11]:
from sklearn.model_selection import GridSearchCV

clf=GridSearchCV(svm.SVC(gamma="auto"), {                                      # 1. Mode                                                                             # 2. parameter grid
    "kernel":["linear","rbf"],                                                  #  2. Parameter Grid
     "C":[1,5,10]
}, cv=5, return_train_score=False)

clf.fit(iris.data, iris.target)

GridSearchCV(cv=5, estimator=SVC(gamma='auto'),
             param_grid={'C': [1, 5, 10], 'kernel': ['linear', 'rbf']})

In [12]:
#View the results

clf.cv_results_ 

{'mean_fit_time': array([0.00100803, 0.00088663, 0.00070939, 0.00088878, 0.00066977,
        0.00088038]),
 'std_fit_time': array([2.33137142e-04, 3.07100900e-05, 1.22807427e-04, 7.39228944e-05,
        2.87629660e-05, 5.96013259e-05]),
 'mean_score_time': array([0.00051241, 0.00043683, 0.0003561 , 0.00040326, 0.00036163,
        0.00048008]),
 'std_score_time': array([1.65440877e-04, 1.96861573e-05, 2.43227151e-05, 6.96508775e-06,
        2.72291971e-05, 7.34144349e-05]),
 'param_C': masked_array(data=[1, 1, 5, 5, 10, 10],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['linear', 'rbf', 'linear', 'rbf', 'linear', 'rbf'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'kernel': 'linear'},
  {'C': 1, 'kernel': 'rbf'},
  {'C': 5, 'kernel': 'linear'},
  {'C': 5, 'kernel': 'rbf'},
  {'C': 10, 'kernel'

clf.cv_results_ are not easy to visualize but luckily sklearn allows you to download it to a data frame

In [13]:
df=pd.DataFrame(clf.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001008,0.000233,0.000512,0.000165,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.000887,3.1e-05,0.000437,2e-05,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.000709,0.000123,0.000356,2.4e-05,5,linear,"{'C': 5, 'kernel': 'linear'}",1.0,1.0,0.933333,0.966667,1.0,0.98,0.026667,1
3,0.000889,7.4e-05,0.000403,7e-06,5,rbf,"{'C': 5, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
4,0.00067,2.9e-05,0.000362,2.7e-05,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,6
5,0.00088,6e-05,0.00048,7.3e-05,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1


In [14]:
#by trimming the unwanted data from dataframe we have
df=df[["param_C","param_kernel","mean_test_score"]]
df

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,linear,0.98
1,1,rbf,0.98
2,5,linear,0.98
3,5,rbf,0.98
4,10,linear,0.973333
5,10,rbf,0.98


From this the 1st 3 parameters give the best performance

In [15]:
# all the properties of GridSearchCV

dir(clf)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_estimator_type',
 '_format_results',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_pairwise',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run_search',
 '_select_best_index',
 '_validate_data',
 'best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 'classes_',
 'cv',
 'cv_results_',
 'decision_function',
 'error_score',
 'estimator',
 'fit',
 'get_params',
 'inverse_transform',
 'multimetric_',
 'n_features_

In [16]:
clf.best_score_

0.9800000000000001

In [17]:
clf.best_params_

{'C': 1, 'kernel': 'linear'}

In [18]:
clf.best_estimator_

SVC(C=1, gamma='auto', kernel='linear')

**Use RandomizedSearchCV to reduce number of iterations and with random combination of parameters. This is useful when you have too many parameters to try and your training time is longer. It helps reduce the cost of computation**

### RandomizedSearchCV

In [19]:
from sklearn.model_selection import RandomizedSearchCV

rscv=RandomizedSearchCV(svm.SVC(gamma="auto"), {
    "kernel":["linear","rbf"],
    "C":[1,10,20]
}, cv=10, return_train_score=False, n_iter=3)

rscv.fit(iris.data,iris.target)

RandomizedSearchCV(cv=10, estimator=SVC(gamma='auto'), n_iter=3,
                   param_distributions={'C': [1, 10, 20],
                                        'kernel': ['linear', 'rbf']})

In [20]:
pd.DataFrame(rscv.cv_results_)[["param_C","param_kernel","mean_test_score"]]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,20,linear,0.966667
1,10,rbf,0.973333
2,1,linear,0.973333


### Different models with different hyperparameters

In [21]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [22]:
model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}

In [23]:
model_params.items()

dict_items([('svm', {'model': SVC(gamma='auto'), 'params': {'C': [1, 10, 20], 'kernel': ['rbf', 'linear']}}), ('random_forest', {'model': RandomForestClassifier(), 'params': {'n_estimators': [1, 5, 10]}}), ('logistic_regression', {'model': LogisticRegression(solver='liblinear'), 'params': {'C': [1, 5, 10]}})])

In [29]:
scores=[]

for model_name, mp in model_params.items():
  clf=GridSearchCV(mp["model"], mp["params"], cv=5, return_train_score=False)
  clf.fit(iris.data,iris.target)
  scores.append({
      "model": model_name,
      "best score": clf.best_score_,
      "best parameter": clf.best_params_
  })

df=pd.DataFrame(scores, columns=["model","best score","best parameter"])
df

Unnamed: 0,model,best score,best parameter
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.96,{'n_estimators': 10}
2,logistic_regression,0.966667,{'C': 5}
