<a href="https://colab.research.google.com/github/nathsmo/Tutorial_experimentation/blob/master/Grid_search_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Grid Search for model tuning

## Nathalia Morales 
###  Tutorial by Rohan Joseph  from "Towards Data Science"


* Link to tutorial: https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e

  
 * Objetive:
   * The  objective of this analysis is to find pattern and discover statistics. Even if Machine Learning models are used to try to predict the data it is only for investigation purposes. This is for research and leisure purposes only, no real application is to be intended from this investigation. If you have questions about the data please ask the owner of the dataset.
   
* Disclaimer:
  * The author of this report is not owner not intends to take profit or recognition from the data used in this report. This report is used as a demonstration of technical skills and not for lucrative purposes. All of the findings in this report are meant for teaching purposes only. All the models in this report are meant for teaching purposes only.  The data used in this report is publicly available for anyone in the link above or specified in the tutorial from origin.


## Libraries Used

In [0]:
import pandas as pd
import numpy as np

## Step 1: Import the dataset and view the top 10 rows.

In [8]:
#import data
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',header=None)

#set column names
data.columns = ['Sample Code Number','Clump Thickness','Uniformity of Cell Size',
                                                        'Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size',
                                                        'Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
#view top 10 rows
data.head(10)

Unnamed: 0,Sample Code Number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4
6,1018099,1,1,1,1,2,10,3,1,1,2
7,1018561,2,1,2,1,2,1,3,1,1,2
8,1033078,2,1,1,1,2,1,1,1,5,2
9,1033078,4,2,1,1,2,1,2,1,1,2


## Step 2: Clean the data and rename the class values as 0/1 for model building (where 1 represents a malignant case). Also, let’s observe the distribution of the class.

In [9]:
data = data.drop(['Sample Code Number'],axis=1) #Drop 1st column
data = data[data['Bare Nuclei'] != '?'] #Remove rows with missing data
data['Class'] = np.where(data['Class'] ==2,0,1) #Change the Class representation
data['Class'].value_counts() #Class distribution

0    444
1    239
Name: Class, dtype: int64

There are 444 benign and 239 malignant cases.



## Step 3: 

Before building a classification model, we'll build a Dummy Classifier to determine the ‘baseline’ performance. This answers the question — ‘What would be the success rate of the model, if one were simply guessing?’ The dummy classifier we are using will simply predict the majority class.

In [10]:
#Split data into attributes and class
X = data.drop(['Class'],axis=1)
y = data['Class']

#perform training and test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#Dummy Classifier
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy= 'most_frequent').fit(X_train,y_train)
y_pred = clf.predict(X_test)

#Distribution of y test
print('y actual : \n' +  str(y_test.value_counts()))

#Distribution of y predicted
print('y predicted : \n' + str(pd.Series(y_pred).value_counts()))

y actual : 
0    103
1     68
Name: Class, dtype: int64
y predicted : 
0    171
dtype: int64


* here we can see that the model predicted that all the values are of the first type because it is the mayority class. Models may tend to do this is they aren't tunned correctly.

## 4. Calculate the evaluation metrics of this model.

In [11]:
# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Dummy Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

Accuracy Score : 0.6023391812865497
Precision Score : 0.0
Recall Score : 0.0
F1 Score : 0.0
Confusion Matrix : 
[[103   0]
 [ 68   0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


* The accuracy of the model is 60.2%, but this is a case where accuracy may not be the best metric to evaluate the model. So, let’s take a look at the other evaluation metrics.

## 5. Now that we have the baseline accuracy, let’s build a Logistic regression model with default parameters and evaluate the model.



In [13]:
#Logistic regression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression().fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score


print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Logistic Regression Classifier Confusion matrix
from sklearn.metrics import confusion_matrix

print('\n Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

Accuracy Score : 0.9473684210526315
Precision Score : 0.9836065573770492
Recall Score : 0.8823529411764706
F1 Score : 0.9302325581395349

 Confusion Matrix : 
[[102   1]
 [  8  60]]




* By fitting the Logistic Regression model with the default parameters, we have a much ‘better’ model. The accuracy is 94.7% and at the same time, the Precision is a staggering 98.3%. Now, let’s take a look at the confusion matrix again for this model results again :

* Let’s try to minimize the false negatives by using Grid Search to find the optimal parameters. Grid search can be used to improve any specific evaluation metric.

## 6. Grid Search to maximize Recall


In [0]:
#Grid Search
from sklearn.model_selection import GridSearchCV

clf = LogisticRegression()
grid_values = {'penalty': ['l1', 'l2'],'C':[0.001,.009,0.01,.09,1,5,10,25]}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
grid_clf_acc.fit(X_train, y_train)

#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))

#Logistic Regression (Grid Search) Confusion matrix
confusion_matrix(y_test,y_pred_acc)

The hyperparameters we tuned are:

* Penalty: l1 or l2 which species the norm used in the penalization.
* C: Inverse of regularization strength- smaller values of C specify stronger regularization.

Also, in Grid-search function, we have the scoring parameter where we can specify the metric to evaluate the model on (We have chosen recall as the metric). From the confusion matrix below, we can see that the number of false negatives has reduced, however, it is at the cost of increased false positives. The recall after grid search has jumped from 88.2% to 91.1%, whereas the precision has dropped to 87.3% from 98.3%.

# Personal Final Thoughts

* Grid Search is a great way to tune in a machine learning model in order to obtain better results in the prediction of the data. 
* I will try the Grid Search with another dataset and other models soon to see it's results in varying circumstances.
* I need to understand this "Grid Search" mode deeply to understand it's implication in the model's results.

This tutorial was done on June 26, 2019.


## Updates