# CS342 Machine Learning
# Lab 3: _K_-NN classification

## Department of Computer Science, University of Warwick

## Sample solutions

## Please note: 

### 1. the code in this sample solution has not been optimized. This sample solution is for you to check the results you have obtained with your code.
### 2. the code assumes the needed files are located in your working directory.

In the this lab, we will explore the use and implementation of a _K_-NN classifier and _k_-fold validation.

### _K_-NN classification

 
We will use the Diabetes dataset from the UCI Machine Learning Repository (file *diabetes.data*). Our goal is to predict if female patients will test positive for diabetes given 8 attributes, including age and blood pressure. For more details on the dataset see: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Import the dataset ( _diabetes.data_ ) into a Pandas data frame and standardise the attributes: for each attribute, or feature, compute its mean and standard deviation (see Lab 1) and replace each feature value by:

(value - mean)/standard_deviation. 

Note that the last column corresponds to the class label: 1 for the positive class and 0 for the negative class. Also note that the _*.data_ file has no header. By default, Pandas will read the first row of a _.data_ file as the column name. This behaviour can be disabled by modifying the header argument. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [1]:
from __future__ import division
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier as kNN


#import diabetes dataset
dataFrame = pd.read_csv('diabetes.data', header=None)


#standardize attributes (subtract the mean and divide by the standard deviation)
#DO NOT INCLUDE THE CLASS LABEL
means = dataFrame[range(0,8)].mean() #Compute means
stds = dataFrame[range(0,8)].std() #Compute std. deviation

dataFrame_s = (dataFrame - means)/stds #standardise attributes
dataFrame_s[8] = dataFrame[8] #Include the label
# print (dataFrame_s)




Based on our two classes, i.e.,  the negative class and the positive class, write a function that takes as input your predicted targets and the true targets (i.e., the ground truth), and estimates the *Accuracy* of the classifier,
defined as:



\begin{equation*}
 Accuracy = \frac{TP + TN}{TP + FN + FP + TN},
\end{equation*}

where $TP$ = No. of True Positives (model predicts positive and true value is positive), $FP$ = No. of False Positives (model predicts positive and true value is negative), $TN$ = No. of True Negatives (model predicts negative and true value is negative), and $FN$ = No. of False Negatives (model predicts negative and true value is positive).

Perform _K_-NN classification using the scikit implementation (*sklearn.neighbors.KNeighborsClassifier* ) for
_K_ = {1, 2, 3, 4, 5}. Use 10-fold cross-validation ( _sklearn.model_selection_ ) to choose the best value of _K_. Make sure to display the accuracy value of each classifier. Which is the most accurate classifier based on 10-fold cross-validation?

**Hint:** You may find that the *KFold* function within *sklearn.model_selection* is useful to keep track of
the samples assigned to each fold when performing 10-fold cross validation.

In [3]:
from sklearn.neighbors import KNeighborsClassifier as kNN
from sklearn.model_selection import KFold

#define function to  calculate accuracy
def calcAcc(pred,t):
    TPTN = 0
    total = 0

    for index,val in enumerate(pred):
        if val == t.iloc[index]:
            TPTN += 1
        total += 1
        
    return TPTN, total #return TPTN = (TP+TN) and total number of predictions. Accuracy value will be compoted based on these two values


#define function to perform K-NN classification using k-fold validation
def crossValidation(dataFrame,nbrs):
    #kf contains indices of instances in each fold - use KFold to split
    kf = KFold(10)
    TPTN = 0
    total = 0

    #KFold returns a list containing the training instances and validation instances in each list item
    for train, validation in kf.split(dataFrame):
        trainingDF = dataFrame.iloc[train].copy()
        validationDF = dataFrame.iloc[validation].copy()
        n = kNN(n_neighbors=nbrs)
        n.fit(trainingDF[range(0,8)],trainingDF[8])
        
        t1, t2 = calcAcc(n.predict(validationDF.copy()[range(0,8)]),validationDF[8]) 

        TPTN += t1
        total += t2
    
    return TPTN/total



#cross-validate with K = 1, 2, 3, 4, and 5 
k1 = crossValidation(dataFrame_s,1)
print (k1)

k2 = crossValidation(dataFrame_s,2)
print (k2)


k3 = crossValidation(dataFrame_s,3)
print (k3)


k4 = crossValidation(dataFrame_s,4)
print (k4)

k5 = crossValidation(dataFrame_s,5)
print (k5)


0.7083333333333334
0.7174479166666666
0.74609375
0.7330729166666666
0.7421875
