# CS342 Machine Learning
# Lab 3: _K_-NN classification

## Department of Computer Science, University of Warwick

In the this lab, we will explore the use and implementation of a _K_-NN classifier and _k_-fold validation.

# Data files for the lab

If working on one of the DCS machines, the data may be found here:

```/modules/cs342/2020/lab3/data/diabetes.data ```

You may load the data directly from that directory.

The data are also available on the CS342 webpage.


### _K_-NN classification

 
We will use the Diabetes dataset from the UCI Machine Learning Repository (file *diabetes.data*). Our goal is to predict if female patients will test positive for diabetes given 8 attributes, or features, including age and blood pressure. For more details on the dataset see: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Import the dataset ( _diabetes.data_ ) into a Pandas data frame and standardise the features: for each feature, compute its mean and standard deviation (see Lab 1) and replace each feature value by:

(value - mean)/standard_deviation. 

Note that the last column corresponds to the class label: 1 for the positive class and 0 for the negative class. Also note that the _*.data_ file has no header. By default, Pandas will read the first row of a _.data_ file as the column name. This behaviour can be disabled by modifying the header argument. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [56]:
'''from __future__ import division
import pandas as pd


#import diabetes dataset

diabetes = pd.read_csv('diabetes.data', header=None)

for col in diabetes.columns:
    if (col != 8):
        mean = diabetes[col].mean()
        std = diabetes[col].std()
    
        for i in range(0, len(diabetes)):
            diabetes.loc[i, col] = (diabetes[col].values[i] - mean)/std

#print(diabetes)

#standardise features (subtract the mean and divide by the standard deviation)
#DO NOT INCLUDE THE CLASS LABEL'''

from __future__ import division
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier as kNN


#import diabetes dataset
dataFrame = pd.read_csv('diabetes.data', header=None)


#standardize attributes (subtract the mean and divide by the standard deviation)
#DO NOT INCLUDE THE CLASS LABEL
means = dataFrame[range(0,8)].mean() #Compute means
stds = dataFrame[range(0,8)].std() #Compute std. deviation

dataFrame_s = (dataFrame - means)/stds #standardise attributes
dataFrame_s[8] = dataFrame[8] #Include the label
print (dataFrame_s)





            0         1         2         3         4         5         6  \
0    0.639530  0.847771  0.149543  0.906679 -0.692439  0.203880  0.468187   
1   -0.844335 -1.122665 -0.160441  0.530556 -0.692439 -0.683976 -0.364823   
2    1.233077  1.942458 -0.263769 -1.287373 -0.692439 -1.102537  0.604004   
3   -0.844335 -0.997558 -0.160441  0.154433  0.123221 -0.493721 -0.920163   
4   -1.141108  0.503727 -1.503707  0.906679  0.765337  1.408828  5.481337   
..        ...       ...       ...       ...       ...       ...       ...   
763  1.826623 -0.622237  0.356200  1.721613  0.869464  0.115094 -0.908090   
764 -0.547562  0.034575  0.046215  0.405181 -0.692439  0.609757 -0.398023   
765  0.342757  0.003299  0.149543  0.154433  0.279412 -0.734711 -0.684747   
766 -0.844335  0.159683 -0.470426 -1.287373 -0.692439 -0.240048 -0.370859   
767 -0.844335 -0.872451  0.046215  0.655930 -0.692439 -0.201997 -0.473476   

            7  8  
0    1.425067  1  
1   -0.190548  0  
2   -0.105515  1  

Based on our two classes, i.e.,  the negative class and the positive class, write a function that takes as input your predicted classes and the true classes (i.e., the ground truth), and estimates the *Accuracy* of the classifier, defined as:



\begin{equation*}
 Accuracy = \frac{TP + TN}{TP + FN + FP + TN},
\end{equation*}

where $TP$ = No. of True Positives (model predicts positive and true value is positive), $FP$ = No. of False Positives (model predicts positive and true value is negative), $TN$ = No. of True Negatives (model predicts negative and true value is negative), and $FN$ = No. of False Negatives (model predicts negative and true value is positive).

Perform _K_-NN classification using the scikit implementation (*sklearn.neighbors.KNeighborsClassifier* ) for
_K_ = {1, 2, 3, 4, 5}. Use 10-fold cross-validation ( _sklearn.model_selection_ ) to choose the best value of _K_. Make sure to display the accuracy value of each classifier. Which is the most accurate classifier based on 10-fold cross-validation?

**Hint:** You may find that the *KFold* function within *sklearn.model_selection* is useful to keep track of
the samples assigned to each fold when performing 10-fold cross validation.

In [58]:
from sklearn.neighbors import KNeighborsClassifier as kNN
from sklearn.model_selection import KFold
#define function to  calculate accuracy

'''def calcAcc(pred,t):
    TPTN = 0
    total = 0
    
    for index,val in enumerate(pred):
        if val == t.iloc(index):
            TPTN += 1
        total += 1
    
    return TPTN, total'''

def calcAcc(pred,t):
    TPTN = 0
    total = 0

    for index,val in enumerate(pred):
        if val == t.iloc[index]:
            TPTN += 1
        total += 1
        
    return TPTN, total #return TPTN = (TP+TN) and total number of predictions. Accuracy value will be compoted based on these two values



#define function to perform K-NN classification using k-fold validation

def crossValidation(dataFrame, nbrs):
    TPTN = 0
    total = 0
    kf = KFold(10)
    
    for training, validation in kf.split(dataFrame):
        trainingDf = dataFrame.iloc[training].copy()
        validationDf = dataFrame.iloc[validation].copy()
        
        n = kNN(n_neighbors=nbrs)
        
        n.fit(trainingDf[range(0,8)], trainingDf[8])
        
        t1, t2 = calcAcc(n.predict(validationDf.copy()[range(0,8)]), validationDf[8])
        
        TPTN+=t1
        total+=t2
    
    return TPTN/total


#cross-validate with K = 1, 2, 3, 4, and 5 

k1 = crossValidation(diabetes, 1)
print(k1)

'''from sklearn.neighbors import KNeighborsClassifier as kNN
from sklearn.model_selection import KFold

#define function to  calculate accuracy
def calcAcc(pred,t):
    TPTN = 0
    total = 0

    for index,val in enumerate(pred):
        if val == t.iloc[index]:
            TPTN += 1
        total += 1
        
    return TPTN, total #return TPTN = (TP+TN) and total number of predictions. Accuracy value will be compoted based on these two values


#define function to perform K-NN classification using k-fold validation
def crossValidation(dataFrame,nbrs):
    #kf contains indices of instances in each fold - use KFold to split
    kf = KFold(10)
    TPTN = 0
    total = 0

    #KFold returns a list containing the training instances and validation instances in each list item
    for train, validation in kf.split(dataFrame):
        trainingDF = dataFrame.iloc[train].copy()
        validationDF = dataFrame.iloc[validation].copy()
        n = kNN(n_neighbors=nbrs)
        n.fit(trainingDF[range(0,8)],trainingDF[8])
        
        t1, t2 = calcAcc(n.predict(validationDF.copy()[range(0,8)]),validationDF[8]) 

        TPTN += t1
        total += t2
    
    return TPTN/total'''



#cross-validate with K = 1, 2, 3, 4, and 5 
k1 = crossValidation(diabetes,1)
print (k1)


0.7083333333333334
