# Hyper Parameter tuning and Model Selection

## Hyperparameter Tuning
* Hyperparameter tuning is the process of selecting the optimal values for a machine learning model’s hyperparameters.
* Hyperparameters are settings that control the learning process of the model, such as the learning rate, the number of neurons in a neural network, or the kernel size in a support vector machine.
* The goal of hyperparameter tuning is to find the values that lead to the best performance on a given task.
* To Automate this process, we use, two methods,
    1) **GridSearchCV**
    2) **RandomizedSearchCV**

## Model Selection
**This can be done by passing a JSON or dictinory (containing model_names along with parameters) to any Hyperparameter Tuning method and measures the average score of each model in a proper dataframe.**

In [55]:
import pandas as pd 
from sklearn.datasets import load_iris

iris = load_iris()
print(f"Directory of iris dataset = {dir(iris)}")
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df['flower'] = iris.target
df['flower'] = df['flower'].apply(lambda x:iris.target_names[x])
df.head()

Directory of iris dataset = ['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'frame', 'target', 'target_names']


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


#### Approach 1: Using train_test_split method

In [56]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(iris.data,iris.target,test_size=0.2)

from sklearn.svm import SVC
svm = SVC(C=10,gamma='auto',kernel='rbf')
svm.fit(x_train,y_train)
svm.score(x_test,y_test)

0.9666666666666667

#### Approach 2: Using cross_val_score method

In [57]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

score1 = cross_val_score(SVC(C=1,kernel='linear',gamma='auto'), iris.data, iris.target, cv=5)
score2 = cross_val_score(SVC(C=10,kernel='rbf',gamma='auto'), iris.data, iris.target, cv=5)
score3 = cross_val_score(SVC(C=10,kernel='linear',gamma='auto'), iris.data, iris.target, cv=5)

print(score1)
print(score2)
print(score3)

[0.96666667 1.         0.96666667 0.96666667 1.        ]
[0.96666667 1.         0.96666667 0.96666667 1.        ]
[1.         1.         0.9        0.96666667 1.        ]


#### Approach 3: Using GridSearchCV

In [58]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

gs = GridSearchCV(SVC(),{
    "C":[1,5,10],
    "kernel":['linear','rbf'],
    "gamma":['scale','auto']
},cv=5,return_train_score=False)

gs.fit(iris.data,iris.target)
results1 = pd.DataFrame(gs.cv_results_)
results1[['param_C','param_kernel','param_gamma','mean_test_score']]

Unnamed: 0,param_C,param_kernel,param_gamma,mean_test_score
0,1,linear,scale,0.98
1,1,rbf,scale,0.966667
2,1,linear,auto,0.98
3,1,rbf,auto,0.98
4,5,linear,scale,0.98
5,5,rbf,scale,0.986667
6,5,linear,auto,0.98
7,5,rbf,auto,0.98
8,10,linear,scale,0.973333
9,10,rbf,scale,0.98


> * **Use RandomizedSearchCV to reduce number of iterations and with random combination of parameters.**
> * **This is useful when you have too many parameters to try and your training time is longer. It helps reduce the cost of computation**

#### Approach 4: Using RandomizedSearchCV method

In [59]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC

rs = RandomizedSearchCV(SVC(),{
    "C":[1,5,10],
    "kernel":['linear','rbf'],
    "gamma":['scale','auto']
},cv=5,n_iter=3,return_train_score=False)

rs.fit(iris.data,iris.target)
results2 = pd.DataFrame(rs.cv_results_)
results2[['param_C','param_kernel','param_gamma','mean_test_score']]

Unnamed: 0,param_C,param_kernel,param_gamma,mean_test_score
0,1,linear,auto,0.98
1,1,rbf,auto,0.98
2,5,linear,auto,0.98


#### Ques: How to determine the best model from HyperParameter Tuning?

In [60]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

models = {
    "SVM_Model":{
        "model": SVC(gamma='auto'),
        "param":{
            "C":[1,5,10],
            "kernel":['linear','rbf']
        }
    },
    "Linear_regression":{
        "model": LinearRegression(),
        "param":{}
    },
    "Logistic_regression":{
        "model": LogisticRegression(max_iter=200),
        "param": {
            'C': [1,5,10]
        }
    },
    "Random_Forest_Model":{
        "model":RandomForestClassifier(),
        "param":{
            "n_estimators":[1,5,10]
        }
    }
}

Scores=[]

for model_name,md in models.items():
    gs1 = GridSearchCV(md['model'],md['param'],cv=5,return_train_score=False)
    gs1.fit(iris.data,iris.target)
    Scores.append({
        'model':model_name,
        'best_param':gs1.best_params_,
        'best_score': gs1.best_score_
    })

scores_data = pd.DataFrame(Scores)
scores_data

Unnamed: 0,model,best_param,best_score
0,SVM_Model,"{'C': 1, 'kernel': 'linear'}",0.98
1,Linear_regression,{},0.322561
2,Logistic_regression,{'C': 1},0.973333
3,Random_Forest_Model,{'n_estimators': 10},0.953333


**Based on above, I can conclude that SVM with C=1 and kernel='rbf' is the best model for solving my problem of iris flower classification**

<h2 style="color:green">Exercise</h2>

## Machine Learning Finding Optimal Model and Hyperparameters

**For digits dataset in sklearn.dataset, please try following classifiers and find out the one that gives best performance. Also find the optimal parameters for that classifier.**

* svm
* RandomForestClassifier
* LogisticRegression
* GaussianNB
* MultinomialNB
* DecisionTreeClassifier

In [61]:
import pandas as pd
from sklearn.datasets import load_digits

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import GridSearchCV

model_details = {
    "SVM_Model":{
        "model": SVC(gamma='auto'),
        "param":{
            "C":[1,5,10],
            "kernel":['linear','rbf']
        }
    },
    "Random_Forest_Model":{
        "model":RandomForestClassifier(),
        "param":{
            "n_estimators":[1,5,10]
        }
    },
    "Logistic_regression":{
        "model": LogisticRegression(max_iter=200),
        "param": {
            'C': [1,5,10]
        }
    },
    "Gaussian_NB":{
        "model":GaussianNB(),
        "param":{}
    },
    "Multinomial_NB":{
        "model":MultinomialNB(),
        "param":{
            "alpha":[1,5,10],
            "force_alpha":[True,False]
        }
    },
    "Decision_tree_Model":{
        "model":DecisionTreeClassifier(),
        "param":{
            "criterion":["gini","entropy","log_loss"],
            "splitter":["best","random"],
        }
    }
}

digits = load_digits()
print(f"Directory of digits dataset = {dir(digits)}")

scores = []
for model_name,md in model_details.items():
    gs2 = GridSearchCV(md['model'],md['param'], cv=5,return_train_score=False)
    gs2.fit(digits.data, digits.target)
    scores.append({
        "model":model_name,
        "best_score":gs2.best_score_,
        "best_params":gs2.best_params_
    })

scores_df = pd.DataFrame(scores)
scores_df

Directory of digits dataset = ['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']


Unnamed: 0,model,best_score,best_params
0,SVM_Model,0.947697,"{'C': 1, 'kernel': 'linear'}"
1,Random_Forest_Model,0.886506,{'n_estimators': 10}
2,Logistic_regression,0.914321,{'C': 1}
3,Gaussian_NB,0.806928,{}
4,Multinomial_NB,0.874246,"{'alpha': 10, 'force_alpha': True}"
5,Decision_tree_Model,0.809717,"{'criterion': 'log_loss', 'splitter': 'best'}"


**SVM model with parameter {'C': 1, 'kernel': 'linear'} is the best model for the digits dataset**