# Lab 9
## Data analysis 2
### Task 2 Construct a SVM classifier

In this task, we will build a SVM classifier based on the diabetes dataset. The classifier can
be used to predict of a patient has (label = ‘1’) diabetes or not (label = ‘0’).

1. Download the data file “diabetes_with_head.csv” from GCULearn

2. Import the required libraries

In [2]:
# load libraries

import pandas as pd
import seaborn as sns 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import metrics

3. Import the ‘diabetes_with_head’ dataset 

In [3]:
df2 = pd.read_csv("../lab9/diabetes_with_head.csv")
print(df2.shape)

df2.head()

(768, 9)


Unnamed: 0,pregnant,glucose,bp,skin,insulin,BMI,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


4. Data exploring and pre-processing
The imported ‘diabetes_with_head’ dataset is the same one we used in week 8 lab.
So the data exploring process will not be repeated here. As revealed in week 8 lab,
there is no duplicated rows in this dataset, no null values, while zero values in the
"glucose","bp","skin","insulin","BMI" columns should be treated as invalid/ missing
values. Same as week 8, we will impute those invalid values with column mean:

In [4]:
# count number of missing values in each of the columns
print((df2[["glucose", "bp", "skin", "insulin", "BMI"]] == 0).sum())

# mark zero values as missing (with the value of Nan)
df2[["glucose", "bp", "skin", "insulin", "BMI"]] = df2[["glucose", "bp", "skin", "insulin", "BMI"]].replace(0, np.NaN)

# check the number of NaN values in each column
print(df2.isnull().sum())

# print the first 10 rows of data
print(df2.head(10))

# fill missing values with mean column values
df2.fillna(df2.mean(), inplace=True)

# check if there is still any NaN values in the dataset
print(df2.isnull().sum())

# check the imputated the first 10 rows of data
print(df2.head(10))



glucose      5
bp          35
skin       227
insulin    374
BMI         11
dtype: int64
pregnant      0
glucose       5
bp           35
skin        227
insulin     374
BMI          11
pedigree      0
age           0
label         0
dtype: int64
   pregnant  glucose    bp  skin  insulin   BMI  pedigree  age  label
0         6    148.0  72.0  35.0      NaN  33.6     0.627   50      1
1         1     85.0  66.0  29.0      NaN  26.6     0.351   31      0
2         8    183.0  64.0   NaN      NaN  23.3     0.672   32      1
3         1     89.0  66.0  23.0     94.0  28.1     0.167   21      0
4         0    137.0  40.0  35.0    168.0  43.1     2.288   33      1
5         5    116.0  74.0   NaN      NaN  25.6     0.201   30      0
6         3     78.0  50.0  32.0     88.0  31.0     0.248   26      1
7        10    115.0   NaN   NaN      NaN  35.3     0.134   29      0
8         2    197.0  70.0  45.0    543.0  30.5     0.158   53      1
9         8    125.0  96.0   NaN      NaN   NaN     0.2

Also, as explored in week 8, variables in this data are not significantly correlated, so
no further processes

5. Specify the input variables X and the target variable y.

We will use the first eight columns as input variable and the last column, i.e., label, as
target variable to build a SVM classifier.

In [5]:
# split dataset into input variable x and target variable y
x = df2.drop(['label'], axis=1)
y = df2.label 

# check that the class variable has been removed
print(x.head())

# view target values
print(y[0:5])


   pregnant  glucose    bp      skin     insulin   BMI  pedigree  age
0         6    148.0  72.0  35.00000  155.548223  33.6     0.627   50
1         1     85.0  66.0  29.00000  155.548223  26.6     0.351   31
2         8    183.0  64.0  29.15342  155.548223  23.3     0.672   32
3         1     89.0  66.0  23.00000   94.000000  28.1     0.167   21
4         0    137.0  40.0  35.00000  168.000000  43.1     2.288   33
0    1
1    0
2    1
3    0
4    1
Name: label, dtype: int64


6. Specify the training, validation and test datasets
The dataset will be split into three pieces (70% as training set, 15% as validation set
and 15 as test set) by using the ‘train_test_split’ function from sklearn twice. The
validation data will be used to check the accuracy of the model during the tuning
process. The test data will be used to test the accuracy of the model obtained after
the tuning process.

In [6]:
## Specify the training, validation and test dataset

# Specify the training set: 70% for training

x_train, x_tmp, y_train,y_tmp = train_test_split(x, y, test_size=0.3, random_state=1)

print("sixe of the training x: ", x_train.shape)

# split the rest of the 30% dataset further into validation data set and test dataset

x_validation, x_test, y_validation, y_test = train_test_split(x_tmp, y_tmp, test_size=0.5, random_state=1)

print("size of the validation x: ", x_validation.shape)

print("size of the test x: ", x_test.shape)

sixe of the training x:  (537, 8)
size of the validation x:  (115, 8)
size of the test x:  (116, 8)


7. Construct the SVM classifier clf1. Use the ‘fit’ function and pass in the specified
training data as parameters to train the SVM classifier.

In [7]:
# create SVM classifier object
clf1 = SVC()

# Train the SVM Classifier
clf1 = clf1.fit(x_train, y_train)

8. Validate the accuracy of the model
Once the model is trained, we can use the ‘predict’ function on our model to make
predictions on the validation data and calculate the accuracy score.

In [8]:
# predict the response for validation dataset
y_pred1 = clf1.predict(x_validation)

# validate the model accuracy, how often is the claddifier correct?

print("Accuracy: ", metrics.accuracy_score(y_validation, y_pred1))

Accuracy:  0.7565217391304347


Which means, the SVM classifier clf1 has an accuracy of approximately 75.65%.

9. Tune the SVM model
AS described on https://scikitlearn.org/stable/modules/generated/sklearn.svm.SVC.html, sklearn.svm.SVC has
multiple parameters as shown below:

In step 7, the SVM classifier clf1 was created with all parameters set to default
values. Now, let us change the kernel type from the default one, which is ‘rbf’, to
‘linear’:

In [9]:
## tune the svm model using different hyper-paremater

# create another svm classifier object with 'linear'
clf2 = SVC(kernel = 'linear')

# train SVM classifier
clf2 = clf2.fit(x_train, y_train)

# predict the response for the validation dataset
y_pred2 = clf2.predict(x_validation)

# validate the model accuracy, how often is the classifier correct?

print("Accuracy: ", metrics.accuracy_score(y_validation, y_pred2))

Accuracy:  0.782608695652174


This demonstrated that the new SVM classifier clf2 has an accuracy of
approximately 78.26%.
Between clf1 and clf2, clf2 achieved better validation result, so clf2 has been take
as the tuned SVM model for this dataset.

10. Test the tuned model with the specified test data

In [10]:
# Test the tuned model with the specified test dataset
y_pred = clf2.predict(x_test)

# validate the model accuracy, how often is the classifie correct 
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))

Accuracy:  0.7758620689655172


So the test accuracy of the tuned model is 77.59%.

11. Further model tuning
Try to tune the SVM model classifier further using different values for the parameters,
check the accuracy to be achieved with validation data.
Select the SVM classifier with best accuracy on validation data as the tuned SVM
model. Using this tuned model to predict the label for test data and calculate the
accuracy.

# Lab 9
## Data analysis 2
### Task 3 Construct a NN classifier

In this task, a NN classifier will be created with the diabetes dataset imported in task 2. Data
import and data prepare will be conducted in the same way as step 1-6 in task 2. Then a NN
classifier is created:

1. Construct a NN classifier mlp with one hidden layer including 15 neurons. Use the
‘fit’ function and pass in the specified training data as parameters to train the NN
classifier. Then predict the response for training data and validation data.

In [12]:
# import the NN classifier

from sklearn.neural_network import MLPClassifier

# create a NN classifier object
mlp = MLPClassifier(hidden_layer_sizes=(15))

# train the NN classifier
mlp.fit(x_train, y_train)

# predict the response for the validation data
predict_vali = mlp.predict(x_validation)



2. get the accuracy for the validation dataset

In [13]:
# get the prediction accuracy for the validation dataset
print("Accuracy: ", metrics.accuracy_score(y_validation, predict_vali))

Accuracy:  0.7304347826086957


Which means, the NN classifier mlp has an accuracy of approximately 73.04%.

3. Tune the NN model
As described on https://scikitlearn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html,
sklearn.neural_network.MLPClassifier has multiple parameters as shown below:

In step 2, the NN classifier mlp was created with one hidden layer including 15
neurons and all other parameters set to default values. Try to tune the NN model
classifier further using different values for the parameters, check the accuracy to be
achieved with validation data.
Select the NN classifier with best accuracy on validation data as the tuned NN model.
Using this tuned model to predict the label for test data and calculate the accuracy.
Comparing the accuracy of the tuned NN model with the accuracy of the tuned SVM
model obtained in task2, identify which model performance better with this dataset.