# SVMs, Neural Nets, Ensembles

I implement SVMs, Neural Nets, and Ensembling methods to classify patients as either having or not having diabetic retinopathy. Additional details about the dataset [here](http://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set). 

In [4]:
import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

In [5]:
%matplotlib inline

In [6]:
# Read the data from csv file
col_names = []
for i in range(20):
    if i == 0:
        col_names.append('quality')
    if i == 1:
        col_names.append('prescreen')
    if i >= 2 and i <= 7:
        col_names.append('ma' + str(i))
    if i >= 8 and i <= 15:
        col_names.append('exudate' + str(i))
    if i == 16:
        col_names.append('euDist')
    if i == 17:
        col_names.append('diameter')
    if i == 18:
        col_names.append('amfm_class')
    if i == 19:
        col_names.append('label')

data = pd.read_csv("messidor_features.txt", names = col_names)
print(data.shape)
data.head(10)

(1151, 20)


Unnamed: 0,quality,prescreen,ma2,ma3,ma4,ma5,ma6,ma7,exudate8,exudate9,exudate10,exudate11,exudate12,exudate13,exudate14,exudate15,euDist,diameter,amfm_class,label
0,1,1,22,22,22,19,18,14,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1,0
1,1,1,24,24,22,18,16,13,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0,0
2,1,1,62,60,59,54,47,33,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0,1
3,1,1,55,53,53,50,43,31,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0,0
4,1,1,44,44,44,41,39,27,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0,1
5,1,1,44,43,41,41,37,29,28.3564,6.935636,2.305771,0.323724,0.0,0.0,0.0,0.0,0.502831,0.126741,0,1
6,1,0,29,29,29,27,25,16,15.448398,9.113819,1.633493,0.0,0.0,0.0,0.0,0.0,0.541743,0.139575,0,1
7,1,1,6,6,6,6,2,1,20.679649,9.497786,1.22366,0.150382,0.0,0.0,0.0,0.0,0.576318,0.071071,1,0
8,1,1,22,21,18,15,13,10,66.691933,23.545543,6.151117,0.496372,0.0,0.0,0.0,0.0,0.500073,0.116793,0,1
9,1,1,79,75,73,71,64,47,22.141784,10.054384,0.874633,0.09978,0.023386,0.0,0.0,0.0,0.560959,0.109134,0,1


### 1. Data prep

In [7]:
data_Y = data['label']
data_X = data.drop(['label'],axis=1)
print(data_X.shape)
print(data_Y.shape)
print(data_X.head())

(1151, 19)
(1151,)
   quality  prescreen  ma2  ma3  ma4  ma5  ma6  ma7   exudate8   exudate9  \
0        1          1   22   22   22   19   18   14  49.895756  17.775994   
1        1          1   24   24   22   18   16   13  57.709936  23.799994   
2        1          1   62   60   59   54   47   33  55.831441  27.993933   
3        1          1   55   53   53   50   43   31  40.467228  18.445954   
4        1          1   44   44   44   41   39   27  18.026254   8.570709   

   exudate10  exudate11  exudate12  exudate13  exudate14  exudate15    euDist  \
0   5.270920   0.771761   0.018632   0.006864   0.003923   0.003923  0.486903   
1   3.325423   0.234185   0.003903   0.003903   0.003903   0.003903  0.520908   
2  12.687485   4.852282   1.393889   0.373252   0.041817   0.007744  0.530904   
3   9.118901   3.079428   0.840261   0.272434   0.007653   0.001531  0.483284   
4   0.410381   0.000000   0.000000   0.000000   0.000000   0.000000  0.475935   

   diameter  amfm_class  
0  0.

### 2. Support Vector Machines (SVM) and Pipelines

For some classification algorithms, like KNN, SVMs, and Neural Nets, scaling of the data is critical for the algorithm to operate correctly.  

In each fold of the cross validation, the data will be separated in to training and test sets. The scaling (calculating mean and std, for instance) should happen based on the values in the _traning set only_. Then the test set can be scaled using the values found on the training set. 

In each fold of the cross validation, the training phase will use _only_ the training data for scaling and training the model. Then the testing phase will scale the test data into the scaled space (found on the training data) and run the test data through the trained classifier, to return an accuracy measurement for each fold. 

In [8]:
from sklearn.preprocessing import StandardScaler 
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

scaler = StandardScaler()
clf = SVC(kernel='linear')
pipe = Pipeline([('scaler',scaler), ('svc', clf)])
scores = cross_val_score(pipe, data_X, data_Y, cv = 5)

print("Accuracy:", scores.mean()*100)

Accuracy: 72.28646715603239


In [9]:
# for the 'svm' part of the pipeline, tune the 'kernel' hyperparameter
param_grid = {'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid']}

from sklearn.model_selection import GridSearchCV 

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid_search_data = grid_search.fit(data_X, data_Y)

print('Best Kernel:', grid_search_data.best_params_)

Best Kernel: {'svc__kernel': 'linear'}


The accuracy increases with a better choice of kernel.

In [10]:
scores2 = cross_val_score(grid_search, data_X, data_Y, cv = 5)
print("Accuracy", scores2.mean()*100)

Accuracy 72.28646715603239


Let's see if we can get the accuracy even higher by tuning additional hyperparameters. SVMs have a parameter called 'C' that is the cost for a misclassification.

In [11]:
param_grid2 = {'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid'], 'svc__C': range(50,101,10)}

grid_search2 = GridSearchCV(pipe, param_grid2, cv=5, scoring='accuracy')
scores3 = cross_val_score(grid_search2, data_X, data_Y, cv = 5)
print("Accuracy", scores3.mean()*100)

Accuracy 74.54357236965933


### 3. Neural Networks (NN)

In [12]:
from sklearn.neural_network import MLPClassifier
scaler2 = StandardScaler()
clf2 = MLPClassifier()
pipe_nn = Pipeline([('scaler2', scaler2), ('MLP', clf2)])

param_grid_nn = {'MLP__hidden_layer_sizes': [(10,),(20,),(30,),(40,),(50,),(60,)], 'MLP__activation':['logistic', 'tanh', 'relu']}
grid_search_nn = GridSearchCV(pipe_nn, param_grid_nn, cv=5, scoring='accuracy')

scores_nn = cross_val_score(grid_search_nn, data_X, data_Y, cv = 5)
print("Accuracy", scores_nn.mean()*100)

Accuracy 73.06794654620742


### 4. Ensemble Classifiers

Ensemble classifiers combine the predictions of multiple base estimators to improve the accuracy of the predictions. One of the key assumptions that ensemble classifiers make is that the base estimators are built independently (so they are diverse).

**A. Random Forests**

In [13]:
from sklearn.ensemble import RandomForestClassifier

clf3 = RandomForestClassifier()
pipe_rf = Pipeline([('rf', clf3)])

param_grid_rf = {'rf__max_depth': range(35,56), 'rf__min_samples_leaf':[8,10,12], 'rf__max_features':['sqrt','log2']}
grid_search_rf = GridSearchCV(pipe_rf, param_grid_rf, cv=5, scoring='accuracy')

scores_rf = cross_val_score(grid_search_rf, data_X, data_Y, cv = 5)
print("Accuracy", scores_rf.mean()*100)

Accuracy 66.80707698099002


**B. AdaBoost**

Random Forests are a kind of ensemble classifier where many estimators are built independently in parallel. In contrast, there is another method of creating an ensemble classifier called *boosting*. Here the classifiers are trained one-by-one in sequence and each time the sampling of the training set depends on the performance of previously generated models.

In [14]:
from sklearn.ensemble import AdaBoostClassifier

clf4 = AdaBoostClassifier()
pipe_ada = Pipeline([('ada', clf4)])
param_grid_ada = {'ada__n_estimators': range(50,251,25)}
grid_search_ada = GridSearchCV(pipe_ada, param_grid_ada, cv=5, scoring='accuracy')

scores_ada = cross_val_score(grid_search_ada, data_X, data_Y, cv = 5)
print("Accuracy:", scores_ada.mean()*100)

Accuracy: 71.32806324110673


### 5. Deploying a final model

In [15]:
import pickle

from sklearn.neural_network import MLPClassifier
scaler2 = StandardScaler()
clf2 = MLPClassifier()
pipe_nn = Pipeline([('scaler2', scaler2), ('MLP', clf2)])

param_grid_nn = {'MLP__hidden_layer_sizes': [(10,),(20,),(30,),(40,),(50,),(60,)], 'MLP__activation':['logistic', 'tanh', 'relu']}
grid_search_nn = GridSearchCV(pipe_nn, param_grid_nn, cv=5, scoring='accuracy')
grid_search_nn_data = grid_search_nn.fit(data_X, data_Y)

print('Best Params:', grid_search_nn_data.best_params_)

# replacing final_model with my final model
final_model = grid_search_nn_data

filename = 'finalized_model.sav'
pickle.dump(final_model, open(filename, 'wb'))

Best Params: {'MLP__activation': 'tanh', 'MLP__hidden_layer_sizes': (50,)}


In [16]:
# using this as the new record to classify
record = [ 0.05905386, 0.2982129, 0.68613149, 0.75078865, 0.87119216, 0.88615694,
  0.93600623, 0.98369184, -0.47426472, -0.57642756, -0.53115361, -0.42789774,
 -0.21907738, -0.20090532, -0.21496782, -0.2080998, 0.06692373, -2.81681183,
 -0.7117194 ]
 
# loading the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

# prediction
prediction = loaded_model.predict([record])
if prediction == 1:
    print("Positive for disease")
elif prediction == 0:
    print("Negative for disease")

Positive for disease
