# Binary classification using KNN and SVM

Description of experiment:

- it is assumed you already have a dataset that has been normalized with min-max to scale all the features from 0..1
- dataset contains two classes (normal=0, abnormal=1) 
- 2000 rows of normal data, 200 rows of abnormal data
- The procedure is shown in the graphic below. Both the normal traffic and the anomaly traffic data are downsampled to, on the one hand, keep some anomaly traffic for the final validation stage; on the second hand, the downsampling allows us to obtain a balanced data set for training the two-classes problems.
- The remaining data not used in the training and testing of the models is preserved for the final validation stage.
- Once the data is downsampled, it is normalized and a 10-fold cross validation is carried out independently for the two-classes problem and for the one-class problem, although the same random seed is used to obtain the same partitions in each case for comparison reasons. 
- In the two-classes problem, all the partitions may include normal and anomaly instances. However, in the one-class problem, the partitions are prepared only with the normal traffic instances; the anomaly instances are used to measure the performance of the models obtained for each fold. Interestingly, the normalization for the one-class problem is determined exclusively with data from normal traffic only.
- A final validation stage includes all the data that has not been used in training and testing; this is an unbalanced data set containing instances from normal traffic and from anomalies. 
- The aim of this validation stage is to compare the behavior of the different modeling techniques included in this comparison, so conclusions could be extracted.
- the "normal" traffic (2000 lines) is the "negative class", while the "abnormal" traffic (200 lines) is the "positive class"



## Design of experiment
<img src=https://raw.githubusercontent.com/nickjeffrey/sklearn/master/images/fig03_experimental_setup.png>

In [1]:
# This jupyter notebook is based on # Stat479: Machine Learning -- L02: kNN in Python
# https://github.com/rasbt/stat479-machine-learning-fs18/blob/master/02_knn/02_knn_demo.ipynb


# 1 - choose the learning algorithm (KNN, SVM)


In [2]:
# adjust the modeltype=KNN|SVM variable to run different algorithms

#modeltype="KNN"   #adjust this variable to run the subsequent steps using the SVM or KNN algorithms
modeltype="SVM"  #adjust this variable to run the subsequent steps using the SVM or KNN algorithms

# 2 - Import required packages



In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math  #get square root function

# KNN and SVM algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV


# 3 - Load Dataset into a Pandas DataFrame


In [4]:
#df_data = pd.read_csv('c:/temp/data4.csv')
df_data = pd.read_csv('c:/temp/data5.csv')  #overwrote 100 lines of abnormal data with normal data to skew results
#df_data = pd.read_csv('https://raw.githubusercontent.com/nickjeffrey/sklearn/master/dataset.csv')

In [5]:
# look at the top few rows of the data (should show the abnormal class in column 35)
df_data.head()

Unnamed: 0,seconds_since_epoch,datestamp,sp1_watts_generated,sp2_watts_generated,sp3_watts_generated,iot_gateway_watt_consumption,water_pump_watt_consumption,valve1_watt_consumption,valve2_watt_consumption,valve3_watt_consumption,...,temps3_soil_temperature_C,latency_iotgateway_ms,latency_logcollector_ms,packetloss_iotgateway,packetloss_logcollector,auth_success_mqtt_to_hmi,auth_failure_mqtt_to_hmi,auth_success_ssh_to_iogateway,auth_failure_ssh_to_iotgateway,class
0,1673334002,0:00,0.0,0.0,0.0,0.928144,0.0,0.323232,0.12,0.1,...,0.436082,0.918065,0.27334,0.541436,0.328748,0.47619,0.35,0.272727,0.272727,abnormal
1,1673334062,0:01,0.0,0.0,0.0,0.622754,0.0,0.030303,0.16,0.1,...,0.241837,0.970968,0.579357,0.353591,0.418157,0.666667,0.6,0.909091,0.363636,abnormal
2,1673334122,0:02,0.0,0.0,0.0,0.11976,0.0,0.151515,0.0,0.31,...,0.152186,0.088387,0.096992,0.266575,0.284732,0.095238,0.3,0.636364,0.454545,abnormal
3,1673334182,0:03,0.0,0.0,0.0,0.646707,0.0,0.262626,0.08,0.16,...,0.037631,0.8,0.391079,0.361878,0.364512,0.095238,0.3,0.909091,0.818182,abnormal
4,1673334242,0:04,0.0,0.0,0.0,0.532934,0.0,0.323232,0.31,0.15,...,0.376314,0.861935,0.200207,0.407459,0.134801,0.380952,0.65,0.181818,0.545455,abnormal


In [6]:
# look at the bottom few rows of the data (should show the normal class in column 35)
df_data.tail()

Unnamed: 0,seconds_since_epoch,datestamp,sp1_watts_generated,sp2_watts_generated,sp3_watts_generated,iot_gateway_watt_consumption,water_pump_watt_consumption,valve1_watt_consumption,valve2_watt_consumption,valve3_watt_consumption,...,temps3_soil_temperature_C,latency_iotgateway_ms,latency_logcollector_ms,packetloss_iotgateway,packetloss_logcollector,auth_success_mqtt_to_hmi,auth_failure_mqtt_to_hmi,auth_success_ssh_to_iogateway,auth_failure_ssh_to_iotgateway,class
2195,1673453753,9:15,0.595125,0.59749,0.586031,0.152174,0.0,0.3,0.366667,0.366667,...,0.184911,0.315044,0.048364,0.0,0.0,0.0,0.0,0.0,0.0,normal
2196,1673453813,9:16,0.497272,0.554383,0.540378,0.021739,0.0,0.3,0.4,0.4,...,0.199704,0.249558,0.266003,0.0,0.0,0.0,0.0,0.166667,1.0,normal
2197,1673453873,9:17,0.508549,0.521099,0.510004,0.021739,0.0,0.533333,0.466667,0.466667,...,0.199704,0.024779,0.201991,0.0,0.0,0.0,0.0,0.166667,0.0,normal
2198,1673453933,9:18,0.600218,0.56275,0.541833,0.141304,0.0,0.366667,0.533333,0.5,...,0.147929,0.097345,0.307255,0.0,0.0,0.0,0.0,0.5,1.0,normal
2199,1673453993,9:19,0.565296,0.595125,0.575664,0.0,0.0,0.333333,0.533333,0.3,...,0.207101,0.113274,0.056899,0.0,0.0,0.0,0.0,0.5,1.0,normal


In [7]:
# show number of rows in dataset
print ( len(df_data) )


2200


In [8]:
#view dimensions of dataset (rows and columns)
df_data.shape  

(2200, 35)

In [9]:
# check to see if there are any missing values from the dataset

# all of the results should be zero, which would indicate there are not any null values in the dataset
# if there are any results greater than zero, it would indicate that some pieces of data are missing and should be cleaned up.
df_data.isnull().sum()

seconds_since_epoch               0
datestamp                         0
sp1_watts_generated               0
sp2_watts_generated               0
sp3_watts_generated               0
iot_gateway_watt_consumption      0
water_pump_watt_consumption       0
valve1_watt_consumption           0
valve2_watt_consumption           0
valve3_watt_consumption           0
ms1_watt_consumption              0
ms2_watt_consumption              0
ms3_watt_consumption              0
temp1_watt_consumption            0
temp2_watt_consumption            0
temp3_watt_consumption            0
battery_watt_hours                0
valve1_litres                     0
valve2_litres                     0
valve3_litres                     0
ms1_soil_moisture_pct             0
ms2_soil_moisture_pct             0
ms3_soil_moisture_pct             0
temps1_soil_temperature_C         0
temps2_soil_temperature_C         0
temps3_soil_temperature_C         0
latency_iotgateway_ms             0
latency_logcollector_ms     

In [10]:
# show frequency distribution of values in variables
# this shows how many different values are in each feature

#for var in df_data.columns:
#    print(df_data[var].value_counts())

    

In [11]:
#show the names of the columns (also called feature names)
df_data.columns

Index(['seconds_since_epoch', 'datestamp', 'sp1_watts_generated',
       'sp2_watts_generated', 'sp3_watts_generated',
       'iot_gateway_watt_consumption', 'water_pump_watt_consumption',
       'valve1_watt_consumption', 'valve2_watt_consumption',
       'valve3_watt_consumption', 'ms1_watt_consumption',
       'ms2_watt_consumption', 'ms3_watt_consumption',
       'temp1_watt_consumption', 'temp2_watt_consumption',
       'temp3_watt_consumption', 'battery_watt_hours', 'valve1_litres',
       'valve2_litres', 'valve3_litres', 'ms1_soil_moisture_pct',
       'ms2_soil_moisture_pct', 'ms3_soil_moisture_pct',
       'temps1_soil_temperature_C', 'temps2_soil_temperature_C',
       'temps3_soil_temperature_C', 'latency_iotgateway_ms',
       'latency_logcollector_ms', 'packetloss_iotgateway',
       'packetloss_logcollector', 'auth_success_mqtt_to_hmi',
       'auth_failure_mqtt_to_hmi', 'auth_success_ssh_to_iogateway',
       'auth_failure_ssh_to_iotgateway', 'class'],
      dtype='obje

In [12]:
#show summary info about dataset
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 35 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   seconds_since_epoch             2200 non-null   int64  
 1   datestamp                       2200 non-null   object 
 2   sp1_watts_generated             2200 non-null   float64
 3   sp2_watts_generated             2200 non-null   float64
 4   sp3_watts_generated             2200 non-null   float64
 5   iot_gateway_watt_consumption    2200 non-null   float64
 6   water_pump_watt_consumption     2200 non-null   float64
 7   valve1_watt_consumption         2200 non-null   float64
 8   valve2_watt_consumption         2200 non-null   float64
 9   valve3_watt_consumption         2200 non-null   float64
 10  ms1_watt_consumption            2200 non-null   float64
 11  ms2_watt_consumption            2200 non-null   float64
 12  ms3_watt_consumption            22

In [13]:
# show data types 
df_data.dtypes

seconds_since_epoch                 int64
datestamp                          object
sp1_watts_generated               float64
sp2_watts_generated               float64
sp3_watts_generated               float64
iot_gateway_watt_consumption      float64
water_pump_watt_consumption       float64
valve1_watt_consumption           float64
valve2_watt_consumption           float64
valve3_watt_consumption           float64
ms1_watt_consumption              float64
ms2_watt_consumption              float64
ms3_watt_consumption              float64
temp1_watt_consumption            float64
temp2_watt_consumption            float64
temp3_watt_consumption            float64
battery_watt_hours                float64
valve1_litres                     float64
valve2_litres                     float64
valve3_litres                     float64
ms1_soil_moisture_pct             float64
ms2_soil_moisture_pct             float64
ms3_soil_moisture_pct             float64
temps1_soil_temperature_C         

In [14]:
# drop any redundant columns from the dataset which does not have any predictive power. 

#In this example, seconds_since_epoch and datestamp do not have any predictive value because they are just timestamps
df_data.drop('seconds_since_epoch', axis=1, inplace=True)
df_data.drop('datestamp', axis=1, inplace=True)


# These columns are for ping packet loss.  
# If the value is ever >0, the data will be in the "abnormal" class.
# In other words, this allows the learning model to "cheat" by ignoring all the other features if this value is ever >0
# So, this particular data feature should be tracked not with machine learning, but with a simple threshold-based detection.
df_data.drop('packetloss_iotgateway', axis=1, inplace=True)
df_data.drop('packetloss_logcollector', axis=1, inplace=True)


# Same issue as the previous ping features.
# These colums are for machine-to-machine data transfers, if the authentication failures are ever >0, the data is abnormal
# In other words, this allows the learning model to "cheat" by ignoring all the other features if this value is ever >0
# So, this particular data feature should be tracked not with machine learning, but with a simple threshold-based detection.
df_data.drop('auth_failure_mqtt_to_hmi', axis=1, inplace=True)
df_data.drop('auth_failure_ssh_to_iotgateway', axis=1, inplace=True)


In [15]:
#Look at the dataset again, you should see several columns have been dropped
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   sp1_watts_generated            2200 non-null   float64
 1   sp2_watts_generated            2200 non-null   float64
 2   sp3_watts_generated            2200 non-null   float64
 3   iot_gateway_watt_consumption   2200 non-null   float64
 4   water_pump_watt_consumption    2200 non-null   float64
 5   valve1_watt_consumption        2200 non-null   float64
 6   valve2_watt_consumption        2200 non-null   float64
 7   valve3_watt_consumption        2200 non-null   float64
 8   ms1_watt_consumption           2200 non-null   float64
 9   ms2_watt_consumption           2200 non-null   float64
 10  ms3_watt_consumption           2200 non-null   float64
 11  temp1_watt_consumption         2200 non-null   float64
 12  temp2_watt_consumption         2200 non-null   f

In [16]:
# look at the dimensions (rows and columns) of the dataset again after removing a few colums
#view dimensions of dataset (rows and columns)
df_data.shape  

(2200, 29)

# 4 - Get Features into a NumPy Array

In [17]:

#X = df_data[['latency_iotgateway_ms', 'latency_logcollector_ms']].values
X = df_data.values  #assign the entire dataframe to X

# Drop the "class" column from this array because we only want the data with predictive values, 
# and the "class" column is the binary classifier

# [rows,colums], so in this example, do nothing with the rows (before the first comma), 
# use negative indexing -1 to drop the last column
X = X[:,:-1]



In [18]:
# sanity check, look at the first 3 rows, all columns
X[:3, :]



array([[0.0, 0.0, 0.0, 0.928143713, 0.0, 0.323232323, 0.12, 0.1,
        0.428571429, 0.142857143, 0.285714286, 0.285714286, 0.714285714,
        0.428571429, 0.380846883, 0.309499832, 0.183244326, 0.167167167,
        0.706149126, 0.740298706, 0.413726406, 0.349450549, 0.151098901,
        0.436081904, 0.918064516, 0.273340249, 0.476190476, 0.272727273],
       [0.0, 0.0, 0.0, 0.622754491, 0.0, 0.03030303, 0.16, 0.1,
        0.428571429, 0.142857143, 0.142857143, 0.571428571, 0.571428571,
        0.571428571, 0.37946425, 0.075528701, 0.213618158, 0.168501835,
        0.619114312, 0.725030004, 0.143400253, 0.156593407, 0.235164835,
        0.241837299, 0.970967742, 0.579356846, 0.666666667, 0.909090909],
       [0.0, 0.0, 0.0, 0.119760479, 0.0, 0.151515152, 0.0, 0.31, 0.0,
        0.142857143, 0.571428571, 0.285714286, 0.857142857, 0.285714286,
        0.378639521, 0.205438066, 0.012016021, 0.302969636, 0.402761104,
        0.5210028, 0.045554592, 0.418681319, 0.170879121, 0.152185944,

# 5 - Get Class Labels into a NumPy array


In [19]:
# This will add a new column called "ClassLabel", which converts the "normal|abnormal" 
# alphabetic content of the "class" column to an integer

label_dict = {'normal': 0, 'abnormal': 1}

df_data['ClassLabel'] = df_data['class'].map(label_dict)


In [20]:
# look at the top few rows of the data (should show the abnormal class in the last column)
df_data.head()

Unnamed: 0,sp1_watts_generated,sp2_watts_generated,sp3_watts_generated,iot_gateway_watt_consumption,water_pump_watt_consumption,valve1_watt_consumption,valve2_watt_consumption,valve3_watt_consumption,ms1_watt_consumption,ms2_watt_consumption,...,ms3_soil_moisture_pct,temps1_soil_temperature_C,temps2_soil_temperature_C,temps3_soil_temperature_C,latency_iotgateway_ms,latency_logcollector_ms,auth_success_mqtt_to_hmi,auth_success_ssh_to_iogateway,class,ClassLabel
0,0.0,0.0,0.0,0.928144,0.0,0.323232,0.12,0.1,0.428571,0.142857,...,0.413726,0.349451,0.151099,0.436082,0.918065,0.27334,0.47619,0.272727,abnormal,1
1,0.0,0.0,0.0,0.622754,0.0,0.030303,0.16,0.1,0.428571,0.142857,...,0.1434,0.156593,0.235165,0.241837,0.970968,0.579357,0.666667,0.909091,abnormal,1
2,0.0,0.0,0.0,0.11976,0.0,0.151515,0.0,0.31,0.0,0.142857,...,0.045555,0.418681,0.170879,0.152186,0.088387,0.096992,0.095238,0.636364,abnormal,1
3,0.0,0.0,0.0,0.646707,0.0,0.262626,0.08,0.16,0.714286,0.571429,...,0.043287,0.141758,0.437912,0.037631,0.8,0.391079,0.095238,0.909091,abnormal,1
4,0.0,0.0,0.0,0.532934,0.0,0.323232,0.31,0.15,0.857143,0.428571,...,0.031882,0.166484,0.339011,0.376314,0.861935,0.200207,0.380952,0.181818,abnormal,1


In [21]:
# look at the bottom few rows of the data (should show the normal class in the last column)
df_data.tail()

Unnamed: 0,sp1_watts_generated,sp2_watts_generated,sp3_watts_generated,iot_gateway_watt_consumption,water_pump_watt_consumption,valve1_watt_consumption,valve2_watt_consumption,valve3_watt_consumption,ms1_watt_consumption,ms2_watt_consumption,...,ms3_soil_moisture_pct,temps1_soil_temperature_C,temps2_soil_temperature_C,temps3_soil_temperature_C,latency_iotgateway_ms,latency_logcollector_ms,auth_success_mqtt_to_hmi,auth_success_ssh_to_iogateway,class,ClassLabel
2195,0.595125,0.59749,0.586031,0.152174,0.0,0.3,0.366667,0.366667,0.0,0.0,...,0.073948,0.192308,0.199704,0.184911,0.315044,0.048364,0.0,0.0,normal,0
2196,0.497272,0.554383,0.540378,0.021739,0.0,0.3,0.4,0.4,0.0,0.0,...,0.002061,0.162722,0.199704,0.199704,0.249558,0.266003,0.0,0.166667,normal,0
2197,0.508549,0.521099,0.510004,0.021739,0.0,0.533333,0.466667,0.466667,0.0,0.0,...,0.148988,0.184911,0.140533,0.199704,0.024779,0.201991,0.0,0.166667,normal,0
2198,0.600218,0.56275,0.541833,0.141304,0.0,0.366667,0.533333,0.5,0.0,0.0,...,0.095769,0.147929,0.192308,0.147929,0.097345,0.307255,0.0,0.5,normal,0
2199,0.565296,0.595125,0.575664,0.0,0.0,0.333333,0.533333,0.3,0.25,0.0,...,0.002182,0.177515,0.170118,0.207101,0.113274,0.056899,0.0,0.5,normal,0


In [22]:

# define the Class Labels (should be 0 normal or 1 for abnormal)
y = df_data['ClassLabel'].values

#show the first 5 rows of the y array (which holds the classifier 0 or 1)
y[:5]  

array([1, 1, 1, 1, 1], dtype=int64)

In [23]:
#show the last 5 rows of the y array (which holds the classifier 0 or 1)
y[-5:]  

array([0, 0, 0, 0, 0], dtype=int64)

# 6 - Shuffle Dataset and Create Training and Test Subsets


In [24]:
# count the size of the dataset (number of rows)

indices = np.arange(y.shape[0])
indices

array([   0,    1,    2, ..., 2197, 2198, 2199])

In [25]:
# randomize the dataset before splitting
# this is a seeded deterministic shuffle 
# in this example, a seed value of 123 is given, which makes the output deterministic, so the experiment is reproducible
# any seed value can be chosen, but it should remain consistent so the results can be reproduced

rnd = np.random.RandomState(123)
shuffled_indices = rnd.permutation(indices)
shuffled_indices

array([ 809,  403,  304, ..., 1766, 1122, 1346])

In [26]:
# before we shuffle the data, downsample the "abnormal" or "positive" class down, 
# by splitting the 200 lines to 160/40 (aka 80%/20%)

# shuffle within the positive and negative classes to avoid bias
# create a new array that only includes the first 200 rows
X_pos = X[:200]  #implies from beginning up to (but not including) 200
X_neg = X[200:]  #implies from 200 to end
y_pos = y[:200]  #contains the "ClassLabel" feature which will be 0 for negative, 1 for positive
y_neg = y[200:]  #contains the "ClassLabel" feature which will be 0 for negative, 1 for positive



# shuffle the positive indices (this is the "anomaly" data)
indices_pos = np.arange(X_pos.shape[0])
pos_shuffled_indices = rnd.permutation(indices_pos)
X_pos_shuffled = X_pos[pos_shuffled_indices]
y_pos_shuffled = y_pos[pos_shuffled_indices]


# shuffle the negative indices (this is the "normal" data)
indices_neg = np.arange(X_neg.shape[0])
neg_shuffled_indices = rnd.permutation(indices_neg)
X_neg_shuffled = X_neg[neg_shuffled_indices]
y_neg_shuffled = y_neg[neg_shuffled_indices]



# grab the first 40 lines (20%) of the abormal (positive) class for test data 
X_test_pos, y_test_pos = X_pos_shuffled[:40],y_pos_shuffled[:40] 

# grab the last 160 lines (80%) of the abnormal (positive) class for training data
X_train_pos, y_train_pos = X_pos_shuffled[40:],y_pos_shuffled[40:]  #only the starting row is shown, last row (200) is implied

# we want the abnormal/normal classes to be balanced, so grab the same amount of data (160 lines) as the previous step
# rows 0 to 160 will be the first 160 lines of the "normal" or negative class
# first value is included, last value is not included :160 means 0 is implied, 160 is *not* included
X_train_neg, y_train_neg = X_neg_shuffled[:160],y_neg_shuffled[:160]

# the remaining lines are for test
X_test_neg, y_test_neg = X_neg_shuffled[160:],y_neg_shuffled[160:] 






In [27]:
# sanity check to visualize the labels, make sure the classLabel boundaries are correct
# this confirms that we split up the data correcly into training data and test data in the previous step

print ("y testing data for positive/abnormal class, should be all ones:")
print (y_test_pos)   # should output all 1
print ("")
print ("y testing data for negative/normal class, should be all zeros:")
print (y_test_neg)   # should output all 0
print ("")
print ("y training data for positive/abnormal class, should be all ones:")
print (y_train_pos)  # should output all 1
print ("")
print ("y training data for negative/normal class, should be all zeros:")
print (y_train_neg)  # should output all 0

y testing data for positive/abnormal class, should be all ones:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1]

y testing data for negative/normal class, should be all zeros:
[0 0 0 ... 0 0 0]

y training data for positive/abnormal class, should be all ones:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1]

y training data for negative/normal class, should be all zeros:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0

In [28]:
# create the training set and test set

#concatentate the 2 python arrays of positive and negative classes into a single array that will be used to train the model
X_train = np.concatenate((X_train_pos,X_train_neg))
y_train = np.concatenate((y_train_pos,y_train_neg))

X_test = np.concatenate((X_test_pos,X_test_neg))
y_test = np.concatenate((y_test_pos,y_test_neg))

In [29]:
# sanity check, look at the "train" and "test" data
print ("\n X_train data:\n", X_train)
print ("\n y_train data:\n", y_train)
print ("\n X_test data:\n" , X_test)
print ("\n y_test data:\n" , y_test)


 X_train data:
 [[0.0 0.0 0.0 ... 0.153627312 0.0 0.166666667]
 [0.0 0.0 0.0 ... 0.153008299 0.047619048 0.272727273]
 [0.0 0.0 0.0 ... 0.408713693 0.047619048 0.090909091]
 ...
 [0.0 0.0 0.0 ... 0.089615932 0.0 0.0]
 [0.598035649 0.541287741 0.553473991 ... 0.129445235 0.0 0.833333333]
 [0.551655147 0.584758094 0.576755184 ... 0.032716927 0.0 0.333333333]]

 y_train data:
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0

# 7 - perform cross-validation (10-fold)

In [30]:
# sklearn has a cross-validation function that we will use here
# this function will combine your datasets and perform cross-validation
from sklearn.model_selection import cross_val_score

# xxx - when do we tune n_neighbors??
if (modeltype == "KNN"):
    #use KNN, start with sklearn default n_neighbors=2, will tune later
    # clf is short for classifier, which refers to the algorithm being used KNN, SVM, etc
    clf = KNeighborsClassifier(n_neighbors=2)  
elif (modeltype == "SVM"):
    # the random_state is a seed value that we use for reproducability
    #use SVM algorithm, start with default C=1, will tune later
    clf = svm.SVC(kernel='linear', C=1, random_state=42) 
else:
   print ("ERROR: Please set modeltype variable at the top of this notebook to KNN or SVM")
   
    
# this where all the training and validation happens
# cv=number of cross-validations you want to do
# this folds the data cv times, then returns the accuracy of each fold, showing average performance of cv splits
scores = cross_val_score(clf, X_train, y_train, cv=10)



# 8 - check initial algorithm accuracy before hyperparameter tuning

In [31]:
# print the scores for every cross-validation fold (cv=10)
# this will output an array containing the score for each fold
print ("Scores for each cross-validation fold for",modeltype,"model before hyperparameter optimization:")
scores


Scores for each cross-validation fold for SVM model before hyperparameter optimization:


array([0.875  , 0.8125 , 0.78125, 0.75   , 0.90625, 0.75   , 0.8125 ,
       0.71875, 0.71875, 0.8125 ])

In [32]:
# average the scores of each cross-validation fold
scores_average = np.mean(scores)
print ("Mean average score using", modeltype, "model before hyperparameter optimization:", scores_average)

Mean average score using SVM model before hyperparameter optimization: 0.79375


In [33]:
# perform the final validation

# train the function with the entire training set (without any splitting)
clf.fit (X_train,y_train)

SVC(C=1, kernel='linear', random_state=42)

In [34]:
# evaluate on the test set using the score function
# this returns a really high accuracy value, because this particular dataset is "unbalanced", 
# because there is way more of the negative class than the positive class.
# In other words, because this dataset is unbalanced, this accuracy result is misleading
clf.score (X_test,y_test)

0.5952127659574468

# 9 - find optimal hyperparameters

In [35]:
# perform a GridSearchCV function to find the optimal hyperparameter (K in KNN or C for SVM)

# define the grid we are going to search
if (modeltype == "KNN"):
    parameters = {"n_neighbors":range(1,50)}   #find an optimal value for n_neighbors in the range of 1 to 50
    knn =  KNeighborsClassifier()  #instance of base classifier we want to search
    clf = GridSearchCV(knn, parameters)
elif (modeltype == "SVM"):
    parameters = {'kernel':('linear', 'rbf'), 'C':range(1,11)}  #last element in range is not used, so this implies 1-10
    # combine the classifer model with the GridSearchCV classifier to calculate accuracy for all possibilities
    svc = svm.SVC()  #instance of base classifier we want to search
    clf = GridSearchCV(svc, parameters)
else:
   print ("ERROR: Please set modeltype variable at the top of this notebook to KNN or SVM")
# fit the model   
print ("Performing GridSearchCV to find optimal hyperparameter within this range:")
clf.fit (X_train,y_train)
    

Performing GridSearchCV to find optimal hyperparameter within this range:


GridSearchCV(estimator=SVC(),
             param_grid={'C': range(1, 11), 'kernel': ('linear', 'rbf')})

In [36]:
# xxxx
# the optimal hyperparameter was calculated in the previous step.

if (modeltype == "KNN"):
    # for KNN, we are most interested in the value of n_neighbors and metric
    print ("Using KNN algorithm, parameters shown below:\n", clf.get_params())
elif (modeltype == "SVM"):
    # for SVM, we are most interested in the value of param_C and param_kernel
    # this output shows all the possibilities for C (coefficient)
    print ("\nUsing SVM algorithm, parameters shown below:\n")
    print ("\nAll keys from clf.cv_results_ \n" , sorted(clf.cv_results_.keys()))
    print ("\nparam_C \n", clf.cv_results_["param_C"])
    print ("\nparam_kernel \n", clf.cv_results_["param_kernel"])
    print ("\nparams \n", clf.cv_results_["params"])
else:
   print ("ERROR: Please set modeltype variable at the top of this notebook to KNN or SVM")





Using SVM algorithm, parameters shown below:


All keys from clf.cv_results_ 
 ['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_C', 'param_kernel', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']

param_C 
 [1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10]

param_kernel 
 ['linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear'
 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf'
 'linear' 'rbf']

params 
 [{'C': 1, 'kernel': 'linear'}, {'C': 1, 'kernel': 'rbf'}, {'C': 2, 'kernel': 'linear'}, {'C': 2, 'kernel': 'rbf'}, {'C': 3, 'kernel': 'linear'}, {'C': 3, 'kernel': 'rbf'}, {'C': 4, 'kernel': 'linear'}, {'C': 4, 'kernel': 'rbf'}, {'C': 5, 'kernel': 'linear'}, {'C': 5, 'kernel': 'rbf'}, {'C': 6, 'kernel': 'linear'}, {'C': 6, 'kernel': 'rbf'}, {'C': 7, 'kernel': 'linear'}, {'C': 7, 'kernel': 'rbf'}, {'C': 8, 'kernel

# 10 - recalculate scores after hyperparameter tuning to see if the scores improve

In [37]:
# recalculate the scores after the hyperparameter tuning to see if the scores improve



if (modeltype == "KNN"):
    print ("")   #xxx ??? When do we recalculate the scores for KNN  ??? 
elif (modeltype == "SVM"):
    scores = cross_val_score(clf.best_estimator_, X_train, y_train, cv=10)
    scores
else:
    print ("ERROR: Please set modeltype variable at the top of this notebook to KNN or SVM")




In [38]:
# average the scores of each folder
scores_average = np.mean(scores)
print ("Mean average score using", modeltype, "model before hyperparameter tuning:", scores_average)

Mean average score using SVM model before hyperparameter tuning: 0.8625


In [39]:
# perform the final validation

# train the function with the entire training set (without any splitting) using the default hyperparameters
clf.fit (X_train,y_train)                  
print ("\nFit the model with default hyperparameters:\n", clf.fit (X_train,y_train))

# train the function with the entire training set (without any splitting) using the optimized hyperparameters
clf.best_estimator_.fit (X_train,y_train)  
print ("\nFit the model with optimized hyperparameters:\n", clf.best_estimator_.fit (X_train,y_train))




Fit the model with default hyperparameters:
 GridSearchCV(estimator=SVC(),
             param_grid={'C': range(1, 11), 'kernel': ('linear', 'rbf')})

Fit the model with optimized hyperparameters:
 SVC(C=6)


In [40]:
# evaluate on the test set using the score function
# this returns a really high accuracy value, because this particular dataset is "unbalanced", 
# because there is way more of the negative class than the positive class.
# In other words, because this dataset is unbalanced, this accuracy result is misleading
#clf.score (X_test,y_test)                 #uses the default hyperparameters
#clf.best_estimator_.score (X_test,y_test)  #uses the optimal hyperparameters
print (modeltype, "score before hyperparameter optimization:", clf.score (X_test,y_test))
print (modeltype, "score after  hyperparameter optimization:", clf.best_estimator_.score (X_test,y_test))

SVM score before hyperparameter optimization: 0.7601063829787233
SVM score after  hyperparameter optimization: 0.7601063829787233


In [41]:
# look in the classification data dictionary to figure out where the optimal hyperparameter is located
# This command outputs a lot of text, look at the next cell for a shortcut
# This shows the entire dictionary (multidimensional array), so it is hard to parse out what we are looking for
clf.cv_results_

{'mean_fit_time': array([0.00159655, 0.00239367, 0.00179458, 0.00218663, 0.00159554,
        0.00239267, 0.00179524, 0.00259199, 0.00179543, 0.00259132,
        0.00160117, 0.0023891 , 0.00179405, 0.00239391, 0.0017951 ,
        0.00238729, 0.00199342, 0.00260005, 0.00159545, 0.00258703]),
 'std_fit_time': array([0.0004886 , 0.00048827, 0.00041623, 0.0003874 , 0.00048943,
        0.00048918, 0.00039885, 0.00048805, 0.00039871, 0.00048684,
        0.0004931 , 0.00048387, 0.00039826, 0.00048875, 0.00039949,
        0.00049401, 0.00062801, 0.0004938 , 0.00048838, 0.0004971 ]),
 'mean_score_time': array([0.00059772, 0.00139627, 0.00019951, 0.0014039 , 0.00039878,
        0.00099106, 0.00019937, 0.00099897, 0.00060034, 0.00119648,
        0.00039897, 0.00140333, 0.00040579, 0.00099726, 0.00039897,
        0.00119834, 0.00019941, 0.00118995, 0.00039897, 0.00099711]),
 'std_score_time': array([4.88032896e-04, 4.88538707e-04, 3.99017334e-04, 4.82726294e-04,
        4.88402437e-04, 1.29761050e-

In [42]:
# shortcut for previous step to show the optimal hyperparameter
# for SVM, this gives us a value of C=6, which is right near the middle of the CVGridSearch for SVM
# for KNN, this gives us a value of n_neighbors=10, which is within the range of 1-50 that we provided
clf.best_estimator_
print ("The optimized hyperparameter for", modeltype, "model is:", clf.best_estimator_)

The optimized hyperparameter for SVM model is: SVC(C=6)


In [43]:
# now that we know the optimal hyperparmeter C=6, also get the optimal kernel for SVM
# for KNN, show the optimal n_neighbors and metric
print ("Optimized hyperparameters for",modeltype,"model are:")
clf.best_estimator_.get_params()

Optimized hyperparameters for SVM model are:


{'C': 6,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

# 11 - Confusion Matrix for entire test set

In [44]:

# Confusion Matrix

# A confusion matrix is a table that is often used to describe the performance of a 
# classification model (or "classifier") on a set of test data for which the true values are known.
# Scikit-learn provides facility to calculate confusion matrix using the confusion_matrix method.

# Evaluate model
y_pred = clf.best_estimator_.predict(X_test)  #use the optimal hyperparameter calculated earlier
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix\n\n', cm)
print('\nTrue Positives  (TP) = ', cm[0,0])
print('True Negatives  (TN) = ', cm[1,1])
print('False Positives (FP) = ', cm[0,1])
print('False Negatives (FN) = ', cm[1,0])

Confusion matrix

 [[1395  445]
 [   6   34]]

True Positives  (TP) =  1395
True Negatives  (TN) =  34
False Positives (FP) =  445
False Negatives (FN) =  6


# 12 - Accuracy of model for entire training dataset

In [45]:
# Assign values from confusion matrix to True Positive, True Negative, False Positive, False Negative

TP = cm[0,0]    #obtain True  Positive value from confusion matrix
TN = cm[1,1]    #obtain True  Negative value from confusion matrix
FP = cm[0,1]    #obtain False Positive value from confusion matrix
FN = cm[1,0]    #obtain False Negative value from confusion matrix

print ("True Positives: ", TP)
print ("True Negatives: ", TN)
print ("False Positives:", FN)
print ("False Negatives:", TN)

Accuracy = (( TP + TN) / ( TP + TN + FP + FN))
Sensitivity = TP / (TP + FN)
Specificity = TN / (TP + FP)
GeometricMean = math.sqrt(Sensitivity * Specificity)

print ("")
print ("Accuracy:       ", Accuracy)
print ("Sensitivity:    ", Sensitivity)
print ("Specificity:    ", Specificity)
print ("Geometric Mean: ", GeometricMean)

True Positives:  1395
True Negatives:  34
False Positives: 6
False Negatives: 34

Accuracy:        0.7601063829787233
Sensitivity:     0.9957173447537473
Specificity:     0.01847826086956522
Geometric Mean:  0.1356433737736958


# 13 - Confusion matrix for each cross validation fold

In [46]:
# define a function to perform a stratified cross validation
# in this context, "stratified" means that for each split performed by the cross-validation, 
# the distribution is preserved in a consistent manner for each split.
# This makes the splits for each CV more consistent with each other, reducing variablity in the scores for a smaller std dev

def cv_confusion_matrix(clf, X, y, folds=10):
    skf = StratifiedKFold(n_splits=folds)
    cv_iter = skf.split(X, y)
    cms = []   #instantiate an empty list
    for train, test in cv_iter:
        clf.fit(X[train,], y[train])
        cm = confusion_matrix(y[test], clf.predict(X[test]), labels=clf.classes_)
        cms.append(cm)
    #return np.mean(np.array(cms), axis=0)  #just show average of each run
    return (np.array(cms))                  #show the confusion matrix for each fold

In [47]:
# call the above function
#run the function using the optimal hyperparameter with the training data and target with 10 folds
#this returns a 3-dimesional matrix 
cv_confusion_matrix(clf.best_estimator_,X_train,y_train,10)  


array([[[15,  1],
        [ 2, 14]],

       [[12,  4],
        [ 1, 15]],

       [[13,  3],
        [ 1, 15]],

       [[14,  2],
        [ 3, 13]],

       [[14,  2],
        [ 0, 16]],

       [[12,  4],
        [ 2, 14]],

       [[12,  4],
        [ 0, 16]],

       [[13,  3],
        [ 3, 13]],

       [[11,  5],
        [ 1, 15]],

       [[14,  2],
        [ 1, 15]]], dtype=int64)

In [48]:
# put the 3-dimensional matrix into a variable so we can parse out each cross-validation fold
cms = cv_confusion_matrix(clf.best_estimator_,X_train,y_train,10)  

# show the confusion matrix for the first fold  (will provide TP,TN,FP,FN for a single fold)
cms[0]

array([[15,  1],
       [ 2, 14]], dtype=int64)

# 14 - Accuracy of model for each cross-validation fold


To measure the quality of the models the Accuracy, Sensitivity, Specificity, and the Geometric Mean measurements will be used. 

The Accuracy will give some ideas of the performance on the balanced data set, while Sensitivity and Specificity will help in the final validation stage where the data will be clearly unbalanced.

https://lifenscience.com/sensitivity-specificity-accuracy/
Sensitivity, Specificity, and Accuracy are the terms which are most commonly associated with a Binary classification test and they statistically measure the performance of the test.  Sensitivity indicates, how well the test predicts one category and Specificity measures how well the test predicts the other category. Whereas Accuracy is expected to measure how well the test predicts both categories.


https://en.wikipedia.org/wiki/Sensitivity_and_specificity
Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied are considered "positive" and those for which it is not are considered "negative".

Sensitivity (true positive rate) refers to the probability of a positive test, conditioned on truly being positive.

Specificity (true negative rate) refers to the probability of a negative test, conditioned on truly being negative.




Calculate the following for each of the cross-validations: Accuracy, Sensitivity, Specificity, Geometric Mean

TP = True Positive

TN = True Negative

FP = False Positive

FN = False Negative

## Formulas:

Accuracy = ( TP + TN) / ( TP + TN + FP + RN)

Sensitivity = TP / (TP + FN)

Specificity = TN / (TP + FP)

Geometric Mean = $$\sqrt{Sensitivity * Specificity}$$



In [49]:
# show the confusion matrix for each cross-validation fold

for i in range(10):
    #
    # Capture True Postive, True Negative, False Positive, False Negative for each cross-validation fold 
    #
    TP = cms[i][0,0]    #obtain True  Positive value from confusion matrix
    TN = cms[i][1,1]    #obtain True  Negative value from confusion matrix
    FP = cms[i][0,1]    #obtain False Positive value from confusion matrix
    FN = cms[i][1,0]    #obtain False Negative value from confusion matrix
    #
    # Calculate Accuracy, Sensitivity, Specificity, Geometric Mean for each cross-validation fold
    #
    Accuracy = (( TP + TN) / ( TP + TN + FP + FN))
    Sensitivity = TP / (TP + FN)
    Specificity = TN / (TP + FP)
    GeometricMean = math.sqrt(Sensitivity * Specificity)
    #
    # truncate above calculations to 4 decimal places 
    #
    Accuracy      = round(Accuracy,4)
    Sensitivity   = round(Sensitivity,4)
    Specificity   = round(Specificity,4)
    GeometricMean = round(GeometricMean,4)
    #
    # print output
    #
    print ("\n-------- Cross Validation Fold", i ,"--------")
    print ("True Positive:  ", TP)
    print ("True Negative:  ", TN)
    print ("False Positive: ", FN)
    print ("False Negative: ", TN)
    print ("Accuracy:       ", Accuracy)
    print ("Sensitivity:    ", Sensitivity)
    print ("Specificity:    ", Specificity)
    print ("Geometric Mean: ", GeometricMean)


-------- Cross Validation Fold 0 --------
True Positive:   15
True Negative:   14
False Positive:  2
False Negative:  14
Accuracy:        0.9062
Sensitivity:     0.8824
Specificity:     0.875
Geometric Mean:  0.8787

-------- Cross Validation Fold 1 --------
True Positive:   12
True Negative:   15
False Positive:  1
False Negative:  15
Accuracy:        0.8438
Sensitivity:     0.9231
Specificity:     0.9375
Geometric Mean:  0.9303

-------- Cross Validation Fold 2 --------
True Positive:   13
True Negative:   15
False Positive:  1
False Negative:  15
Accuracy:        0.875
Sensitivity:     0.9286
Specificity:     0.9375
Geometric Mean:  0.933

-------- Cross Validation Fold 3 --------
True Positive:   14
True Negative:   13
False Positive:  3
False Negative:  13
Accuracy:        0.8438
Sensitivity:     0.8235
Specificity:     0.8125
Geometric Mean:  0.818

-------- Cross Validation Fold 4 --------
True Positive:   14
True Negative:   16
False Positive:  0
False Negative:  16
Accuracy: 

# 15 - classification report (f1 score) for entire test dataset

In [50]:
# Classification Report

# Another important report is the Classification report. 
# It is a text summary of the precision, recall, F1 score for each class. 
# Scikit-learn provides facility to calculate Classification report using the classification_report method.

# F1 is the harmonic mean rather than the geometric mean
# F1 gives you the average that is closest to the worst-case scenario

#import classification_report
from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      0.76      0.86      1840
           1       0.07      0.85      0.13        40

    accuracy                           0.76      1880
   macro avg       0.53      0.80      0.50      1880
weighted avg       0.98      0.76      0.85      1880

