# K-Nearest Neighbour Classification

A supervised classifier that memorizes observations from within a test set to predict classification labels for new, unlabeled observations.

KNN makes predictions based on how similar training observations are to the new, incoming observations.

The more similar the observation values, the more likely they will be classified with the same label. 

Use cases:

- Stock Price Prediction
- Credit Risk Analysis 
- Predictive Trip Planning
- Recommendation Systems

Assumptions:

- Dataset has little noise
- Dataset is labeled
- Dataset only contains relevant features
- Dataset has distinguishable subgroups
- Avoid using KNN on large datasets (it will take a long time)

In [4]:
import numpy as np 
import pandas as pd 
import scipy
import urllib
import sklearn

import matplotlib.pyplot as plt 
from pylab import rcParams

from sklearn import neighbors, preprocessing, metrics
from sklearn.model_selection import train_test_split

In [5]:
from sklearn.neighbors import KNeighborsClassifier

In [6]:
np.set_printoptions(precision=4, suppress=True)
%matplotlib inline
rcParams['figure.figsize'] = 7,4 
plt.style.use('seaborn-whitegrid')

## Importing your data

In [8]:
cars = pd.read_csv('../../inputs/mtcars.csv')
cars.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


## Preparing your data

In [9]:
x_prime = cars[['mpg','disp','hp','wt']].values
y = cars.am.values

In [10]:
x = preprocessing.scale(x_prime)

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.2, random_state=17)

## Building and training your model

In [13]:
clf = KNeighborsClassifier()
clf.fit(x_train, y_train)
print(clf)

KNeighborsClassifier()


## Evaluating your model predictions

In [15]:
y_pred = clf.predict(x_test)
y_expect = y_test

print(metrics.classification_report(y_expect, y_pred))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89         4
           1       1.00      0.67      0.80         3

    accuracy                           0.86         7
   macro avg       0.90      0.83      0.84         7
weighted avg       0.89      0.86      0.85         7



These results are saying is that of all the points that were labeled one, only 67% of those results that were returned were truly relevant.
And of the entire dataset, 83% of the results that were returned were truly relevant.

High precision and low recall generally means that there are fewer results returned, but many of the labels that are predicted are returned correctly.
In other words, high accuracy but low completion.