# Advanced Hyperparameter Tuning 

### 1.Introduction 

This project is based on the popular wine quality dataset from kaggle. 


The aim of the project is to learn and use hyperparameter optimization using advanced libraries and autoML

In [1]:
import pandas as pd 

PATH = r"D:\\Coding_Stuff\\GitHub\\Machine_Learning_and_Optimization\\2. Advanced Hyperparameter Tuning\\diabetes.csv"

data = pd.read_csv(PATH)

data.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
# check for enteries with zero as value in each column 

for column in data.columns.tolist():
    print(f"{column}:", len(data[data[column]==0][column]))


Pregnancies: 111
Glucose: 5
BloodPressure: 35
SkinThickness: 227
Insulin: 374
BMI: 11
DiabetesPedigreeFunction: 0
Age: 0
Outcome: 500


All columns except `Glucose` and `Insulin` can be zero, because there will be some `Glucose` and `Insulin` content in patients. Zero value can be the result of wrong registering of data.

In [3]:
# replace zero value in glucose and Insulin with median value

import numpy as np 

data['Glucose'] = np.where(data['Glucose']==0, data['Glucose'].median(), data['Glucose'])
data['Insulin'] = np.where(data['Insulin']==0, data['Insulin'].median(), data['Insulin'])

data.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35,30.5,33.6,0.627,50,1
1,1,85.0,66,29,30.5,26.6,0.351,31,0
2,8,183.0,64,0,30.5,23.3,0.672,32,1
3,1,89.0,66,23,94.0,28.1,0.167,21,0
4,0,137.0,40,35,168.0,43.1,2.288,33,1


In [4]:
# check for enteries with zero as value in each column 

for column in data.columns.tolist():
    print(f"{column}:", len(data[data[column]==0][column]))


Pregnancies: 111
Glucose: 0
BloodPressure: 35
SkinThickness: 227
Insulin: 0
BMI: 11
DiabetesPedigreeFunction: 0
Age: 0
Outcome: 500


In [5]:
# assign data labels

x = data.iloc[:, :-1]
y = data.iloc[:, -1]

#splitting data
from sklearn.model_selection import  train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, stratify = y)

In [6]:
from sklearn.ensemble import RandomForestClassifier
rf_estimator = RandomForestClassifier(n_estimators=10).fit(X_train,y_train)
prediction = rf_estimator.predict(X_test)

In [7]:
from sklearn.metrics import  classification_report, confusion_matrix, accuracy_score

print("<-------------------Confusion metrics results is ------------->\n : {}".format(confusion_matrix(y_test, prediction)))
print("<------------------Classification report is---------------> \n: {}".format(classification_report(y_test, prediction)))
print("<------------------ Accuracy score----------------> \n: {}".format(accuracy_score(y_test, prediction)))

<-------------------Confusion metrics results is ------------->
 : [[85 15]
 [24 30]]
<------------------Classification report is---------------> 
:               precision    recall  f1-score   support

           0       0.78      0.85      0.81       100
           1       0.67      0.56      0.61        54

    accuracy                           0.75       154
   macro avg       0.72      0.70      0.71       154
weighted avg       0.74      0.75      0.74       154

<------------------ Accuracy score----------------> 
: 0.7467532467532467


### Hyperparameter tuning (manual)

The hyperparametrs associated with RandomForest are:-

### The main parameters used by a Random Forest
- criterion = the function used to evaluate the quality of a split
- max_depth = maximum number of levels allowed in each tree
- max_features = maximum number of features considered when splitting a node.
- min_samples_leaf = minimum number of samples which can be stored in a tree leaf.
- min_samples_split = minimum number of samples necessary in a node to cause node splitting.
- n_estimarors = number of trees in the ensemble.

### 1.Random Search

In [8]:
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
max_depth = [int(x) for x in np.linspace(start=10,stop=1000, num=10)]
max_depth

[10, 120, 230, 340, 450, 560, 670, 780, 890, 1000]

In [9]:
from sklearn.model_selection import  RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]

max_features = ['sqrt','log2']

max_depth = [int(x) for x in np.linspace(start=10,stop=1000, num=10)]

min_samples_split = [2,3,4,5,7,9]

min_sample_leaf = [2,4,6,8]

search_space = {
    'n_estimators':n_estimators,
    'max_features':max_features,
    'max_depth':max_depth,
    'min_samples_split':min_samples_split,
    'min_samples_leaf':min_sample_leaf,
    'criterion':['entropy','gini']}

search_space

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
 'max_features': ['sqrt', 'log2'],
 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000],
 'min_samples_split': [2, 3, 4, 5, 7, 9],
 'min_samples_leaf': [2, 4, 6, 8],
 'criterion': ['entropy', 'gini']}

In [10]:
rf_estimator = RandomForestClassifier()
random_search = RandomizedSearchCV(estimator = rf_estimator, param_distributions = search_space, n_iter = 100, cv=3, verbose = 2, random_state = 100, n_jobs=1)

random_search.fit(X_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=1800; total time=   2.8s
[CV] END criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=1800; total time=   2.7s
[CV] END criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=1800; total time=   2.8s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.2s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.2s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.2s
[CV] END criterion=entropy, max_depth=120, max_features=log2, min_samples_leaf=4, min_samples_split=7, n_e

In [11]:
random_search.best_params_

{'n_estimators': 200,
 'min_samples_split': 2,
 'min_samples_leaf': 6,
 'max_features': 'sqrt',
 'max_depth': 1000,
 'criterion': 'gini'}

In [12]:
random_search.best_estimator_


In [13]:
rf_estimator = random_search.best_estimator_

y_pred = rf_estimator.predict(X_test)
print("<-------------------Confusion metrics results is ------------->\n : {}".format(confusion_matrix(y_test, y_pred)))
print("<------------------Classification report is---------------> \n: {}".format(classification_report(y_test, y_pred)))
print("<------------------ Accuracy score---------------->\n : {}".format(accuracy_score(y_test, y_pred)))

<-------------------Confusion metrics results is ------------->
 : [[87 13]
 [21 33]]
<------------------Classification report is---------------> 
:               precision    recall  f1-score   support

           0       0.81      0.87      0.84       100
           1       0.72      0.61      0.66        54

    accuracy                           0.78       154
   macro avg       0.76      0.74      0.75       154
weighted avg       0.77      0.78      0.77       154

<------------------ Accuracy score---------------->
 : 0.7792207792207793


### 3.Automated Hyperparameter Tuning

- Bayesian Optimization 
- Evolutionary Algorithm (TOPT)
- Gradient Descent 

### 1. Hyperpot

In [17]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.model_selection import  cross_val_score
# hp is used to define search space

space = {
    'criterion': hp.choice('criterion',['entropy','gini']),
    'max_depth': hp.quniform('max_depth', 10,1200,20),
    'max_features':hp.choice('max_features',['log2','sqrt']),
    'min_samples_split':hp.uniform('min_samples_split',0,0.5),
    'min_samples_split': hp.uniform('min_samples_split', 0, 1),
    'n_estimators': hp.choice('n_estimators',[10,50,300])
}

space

{'criterion': <hyperopt.pyll.base.Apply at 0x1efbf501f60>,
 'max_depth': <hyperopt.pyll.base.Apply at 0x1efbf5024d0>,
 'max_features': <hyperopt.pyll.base.Apply at 0x1efbf501600>,
 'min_samples_split': <hyperopt.pyll.base.Apply at 0x1efbf5017b0>,
 'n_estimators': <hyperopt.pyll.base.Apply at 0x1efbf500160>}

In [22]:
def objective(space):
    model = RandomForestClassifier(criterion=space['criterion'],
    max_features = space['max_features'],
    min_samples_leaf = space['min_samples_split'],
    n_estimators = space['n_estimators'])

    accuracy = cross_val_score(model, X_train, y_train, cv=5).mean()

    # we need to maximize accuracy, hence we return it as negative
    return {'loss': -accuracy, 'status': STATUS_OK}



In [23]:
trials = Trials()

best = fmin(fn=objective,
        space = space,
        algo = tpe.suggest,
        max_evals = 80,
        trials = trials)

100%|██████████| 80/80 [02:59<00:00,  2.24s/trial, best loss: -0.7850459816073571]


In [26]:
best

{'criterion': 1,
 'max_depth': 80.0,
 'max_features': 1,
 'min_samples_split': 0.002491760751933584,
 'n_estimators': 1}

In [30]:
best_random_forest = RandomForestClassifier(criterion='gini',
                                            max_depth = int(best['max_depth']),
                                            max_features = best['max_features'],
                                            min_samples_split = best['min_samples_split'],
                                            n_estimators = best['n_estimators'])

best_random_forest.fit(X_train,y_train)

In [31]:
predictionforest = best_random_forest.predict(X_test)
print("<-------------------Confusion metrics results is ------------->\n : {}".format(confusion_matrix(y_test, predictionforest)))
print("<------------------Classification report is---------------> \n: {}".format(classification_report(y_test, predictionforest)))
print("<------------------ Accuracy score----------------> : {}".format(accuracy_score(y_test, predictionforest)))

<-------------------Confusion metrics results is ------------->
 : [[73 27]
 [27 27]]
<------------------Classification report is---------------> 
:               precision    recall  f1-score   support

           0       0.73      0.73      0.73       100
           1       0.50      0.50      0.50        54

    accuracy                           0.65       154
   macro avg       0.61      0.61      0.61       154
weighted avg       0.65      0.65      0.65       154

<------------------ Accuracy score----------------> : 0.6493506493506493
