# k-NN Classification

k-NN: **k** **N**earest **N**eighbors

Neighbors-based classification is a type of *instance-based learning*: it does not attempt to construct a general internal model, but simply stores instances of the training data. 

Classification is computed from a simple majority vote of the *nearest neighbors of each point*: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.


## Example

Consider,

<!-- ![knn](assets/knn.png) -->

<img src="assets/knn.png" height="256" width="256">

* The test sample (green circle) should be classified either to the class of blue squares or to the class of red triangles. 

* If k = 3 (solid line circle) it is assigned to the class of red triangles because there are 2 triangles and only 1 square inside the inner circle. 

* If k = 5 (dashed line circle) it is assigned to the class of blue squares (3 squares vs. 2 triangles inside the dashed circle).


## k-NN Classification

The training examples are vectors in a multidimensional feature space, each with a class label. 

The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.

A commonly used distance metric for continuous variables is the Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the Hamming distance.


## A Code Example

In [1]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# get data
df = pd.read_csv("assets/iris.csv")
X  = df.drop(['id','Species'],axis=1)
y = df['Species']

# set up the model with k=3
model = KNeighborsClassifier(n_neighbors=3)

# do train-test
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.8, test_size=0.2)
model.fit(train_X, train_y)
predict_y = model.predict(test_X)
print("Train-Test Accuracy: {}".format(accuracy_score(test_y, predict_y)))

# do the 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5)
print("Fold Accuracies: {}".format(scores))
print("XV Accuracy: {}".format(scores.mean()))

Train-Test Accuracy: 0.9666666666666667
Fold Accuracies: [ 0.96666667  0.96666667  0.93333333  0.96666667  1.        ]
XV Accuracy: 0.9666666666666668


# Model Comparison

We now have two different kinds of models, decision trees and k-NN, we can use to do classification.

Let’s work our way through an example using the dataset ‘wdbc’ and compare the model performance of each of the models on that data set:

* Build optimal tree and KNN models using grid search
* Compute the accuracy for the classifiers
* Print out the confusion matrix for each classifier
* Print out the confidence interval for each classifier
* Decide if the difference between classifiers is statistically significant or not.

## Set Up

In [2]:
# basic data routines
import pandas as pd

# models
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# model evaluation routines
from bootstrap import bootstrap
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# get data
df = pd.read_csv("assets/wdbc.csv")
df = df.drop(['ID'],axis=1)
print(df.shape)
df.head()

(569, 31)


Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3,Diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,M
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,M
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,M
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,M
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,M


In [3]:
# format training data for sklean
X  = df.drop(['Diagnosis'],axis=1)
actual_y = df['Diagnosis']

## Decision Trees

In [4]:
# decision trees
model = DecisionTreeClassifier()

# grid search
param_grid = {'max_depth': list(range(1,21)), 'criterion': ['entropy','gini'] }
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, actual_y)
print("Grid Search: best parameters: {}".format(grid.best_params_))

# evaluate the best model
best_model = grid.best_estimator_
predict_y = best_model.predict(X)
print("Accuracy: {}".format(accuracy_score(actual_y, predict_y)))

# build the confusion matrix
labels = ['B', 'M']
cm = confusion_matrix(actual_y, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))

# boostrapped confidence interval
print("Confidence interval best decision tree: {}".format(bootstrap(best_model,df,'Diagnosis')))

Grid Search: best parameters: {'criterion': 'entropy', 'max_depth': 4}
Accuracy: 0.984182776801406
Confusion Matrix:
     B    M
B  350    7
M    2  210
Confidence interval best decision tree: (0.92105263157894735, 0.99122807017543857)


## KNN

In [5]:
# KNN
model = KNeighborsClassifier()

# grid search
param_grid = {'n_neighbors': list(range(1,51))}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, actual_y)
print("Grid Search: best parameters: {}".format(grid.best_params_))

# evaluate the best model
best_model = grid.best_estimator_
predict_y = best_model.predict(X)
print("Accuracy: {}".format(accuracy_score(actual_y, predict_y)))

# build the confusion matrix
labels = ['B', 'M']
cm = confusion_matrix(actual_y, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))

# boostrapped confidence interval
print("Confidence interval best KNN: {}".format(bootstrap(best_model,df,'Diagnosis')))

Grid Search: best parameters: {'n_neighbors': 14}
Accuracy: 0.9402460456942003
Confusion Matrix:
     B    M
B  349    8
M   26  186
Confidence interval best KNN: (0.88574561403508767, 0.97368421052631582)
