# K-Nearest Neighbors (KNN)
Classifing student success data by means of the [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) from the sklearn module.

## Import Data
Import the data into a pandas dataframe. Get dummy variables for each categorical predictor in the data set and return the design matirx. Create a normalized and standardized design matrix as well to compare model preformance. Convert response variable to three classes *0 , 1,* and *2*.

In [1]:
import time
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectPercentile
from sklearn.neighbors import KNeighborsClassifier
from sklearn.utils.extmath import cartesian
from sklearn import preprocessing
from sklearn import metrics

def response_conv(arr):
    new = []
    for i in arr:
        if (i > 0 and i < 10):           # condition where student failed
            new.append(0)                 
                                          
        elif (i >= 10):                   # condition where student passed
            new.append(1)                 
    
        else:                             # condition where student received an incomplete
            new.append(2)
    return(new)                           # 1-dimensional response varibale returned

df = pd.read_csv('student-por2.csv')
df = pd.get_dummies(df)#, drop_first=True)
X = df.drop('G3',1)
y = response_conv(list(df.G3))

select = SelectPercentile()
newX = select.fit_transform(X,y)

X_scale = preprocessing.scale(newX)
X_norm = preprocessing.normalize(newX)



## Test/Training Sets and Optimal K
To train the model and later test we must split each design matrix and response vector into training and test sets. The fucntion *optK* finds the optimal  number of neighbors to be used in the model. Create KNN model for each design matrix using each respective optimal k value.

In [2]:
random.seed(1)
X1_train, X1_test, y1_train, y1_test = train_test_split(newX, y, test_size=0.33,random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X_scale, y, test_size=0.33,random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X_norm, y, test_size=0.33,random_state=42)

start_time = time.time()
myList = list(range(1,50))
neighbors = list(filter(lambda x: x % 2 != 0, myList))
combos = cartesian([['uniform','distance'],neighbors])

def opt(X,y):
    log_lo = []

    for w, k in combos:
        knn = KNeighborsClassifier(n_neighbors=int(k),weights=str(w))
        scores = cross_val_score(knn, X, y, cv=10, scoring='neg_log_loss')
        log_lo.append(scores.mean())
    
    #MSE = [1 - x for x in cv_scores]
    opt_ = combos[log_lo.index(min(log_lo))]
    return(opt_)

w1, k1 = opt(X1_train,y1_train)
w2, k2 = opt(X2_train,y2_train)
w3, k3 = opt(X3_train,y3_train)

print ("The optimal weight function and number of neighbors is %s and %r respectively for Non-standardized design matrix." % (w1,k1))
print ("The optimal weight function and number of neighbors is %s and %r respectively for standardized design matrix." % (w2,k2))
print ("The optimal weight function and number of neighbors is %s and %r respectively for normalized design matrix." % (w3, k3))
print("Run time: %r minutes" % (round((int(time.time() - start_time)/60),2)))

The optimal weight function and number of neighbors is uniform and '1' respectively for Non-standardized design matrix.
The optimal weight function and number of neighbors is uniform and '1' respectively for standardized design matrix.
The optimal weight function and number of neighbors is uniform and '1' respectively for normalized design matrix.
Run time: 0.08 minutes


## Fit and Predict
Fit the KNN model to each design matrix and create a dataframe comparing each models predictions to the actual value of the test set.

In [3]:
knn1 = KNeighborsClassifier(n_neighbors=int(k1),weights=str(w1)).fit(X1_train,y1_train)
knn2 = KNeighborsClassifier(n_neighbors=int(k2),weights=str(w2)).fit(X2_train,y2_train)
knn3 = KNeighborsClassifier(n_neighbors=int(k3),weights=str(w3)).fit(X3_train,y3_train)

knn_pred1 = knn1.predict(X1_test)
knn_pred2 = knn2.predict(X2_test)
knn_pred3 = knn3.predict(X3_test)

pred = pd.DataFrame(list(zip(y1_test, knn_pred1,knn_pred2,knn_pred3)), columns=['y_act','y_knn','y_knn_stand','y_knn_norm'])
pred.index.name = 'Obs'
pred


Unnamed: 0_level_0,y_act,y_knn,y_knn_stand,y_knn_norm
Obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,1,1,1
1,1,1,1,1
2,1,1,1,1
3,1,1,1,1
4,1,1,1,1
5,1,1,1,1
6,1,1,1,1
7,0,1,0,0
8,1,1,1,1
9,1,1,1,1


## Results
Returns model accuracay, confusion matrix and classifiaction report for each respective model.

In [4]:
cm_knn1 = pd.DataFrame(metrics.confusion_matrix(y1_test, knn_pred1), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_knn2 = pd.DataFrame(metrics.confusion_matrix(y2_test, knn_pred2), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_knn3 = pd.DataFrame(metrics.confusion_matrix(y3_test, knn_pred3), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])

zero = 0
one = 0
two = 0
for i in y1_train:
    if i == 0:
        zero += 1
    elif i == 1:
        one += 1
    else:
        two += 1
num1 = round(zero/len(y1_train),2)
num2 = round(one/len(y1_train),2)
num3 = round(two/len(y1_train),2)
print("The response vector has the following distribution: \nzeros: %r \nones: %r \ntwos: %r" % (num1,num2,num3))
print("\n")

print ("The accuracy of the Non-standarized KNN model is: ", knn1.score(X1_test,y1_test))
print("\n")
print ("The accuracy of the Standardized KNN model is: ", knn2.score(X2_test,y2_test))
print("\n")
print ("The accuracy of the Normalized KNN model is: ", knn3.score(X3_test,y3_test))
print("\n")

print("Non-standarized KNN Confusion Matrix: \n", cm_knn1)
print("\n")
print("Standarized KNN Confusion Matrix: \n", cm_knn2)
print("\n")
print("Normalized KNN Confusion Matrix: \n", cm_knn3)
print("\n")

print("Classification report for Non-standardized design matrix:\n", metrics.classification_report(y1_test,knn_pred1))
print("\n")
print("Classification report for standardized design matrix:\n", metrics.classification_report(y2_test,knn_pred2))
print("\n")
print("Classification report for Normalized design matrix:\n", metrics.classification_report(y3_test,knn_pred3))

The response vector has the following distribution: 
zeros: 0.14 
ones: 0.83 
twos: 0.02


The accuracy of the Non-standarized KNN model is:  0.883720930233


The accuracy of the Standardized KNN model is:  0.888372093023


The accuracy of the Normalized KNN model is:  0.841860465116


Non-standarized KNN Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       13        7       3
Pass(1)       12      174       1
Inc(2)         1        1       3


Standarized KNN Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       12        9       2
Pass(1)       10      176       1
Inc(2)         1        1       3


Normalized KNN Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)        9       12       2
Pass(1)       17      169       1
Inc(2)         2        0       3


Classification report for Non-standardized design matrix:
              precision    recall  f1-score   support

          0       0.50      0.57      0.53        23
          1       0.96