# K-Nearest Neighbors

In [1]:
import numpy as np
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

A higher k means that the algorithm has more data on which to make the decision about a prediction, however if there are diverse  neighbors (meaning of a different classifications), it might not make for a more accurate prediction. A low k means that there are fewer neighbors on which to make a prediction, which also might not make for the most accurate prediction.

The n-neighbors argument indicates the number of neighboring observations that the classification algorithm will use to make a  prediction about an observation.    

The weights argument determines how the neighbors are weighted (i.e. how much impact each neighbor will have on the prediction). The uniform option means that each neighbor has an equal weight as another. The distance option weights the neighbors by the inverse of their distance. What this means is that  closer neighbors will have a larger impact on the prediction of an observation than the farther neighbors.

In [2]:
# load iris dataset
iris = sns.load_dataset("iris")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
# create feature matrix and target array
X_iris = iris.drop("species", axis = 1)
Y_iris = iris["species"]

In [4]:
# create knn model with k=1
k1 = KNeighborsClassifier(n_neighbors = 1)

In [5]:
k1.fit(X_iris, Y_iris)
Y_k1_predict = k1.predict(X_iris)
accuracy_score(Y_iris,Y_k1_predict)

1.0

Varying k has an impact on the accuracy score.

In [6]:
k5 = KNeighborsClassifier(n_neighbors = 5)
k5.fit(X_iris, Y_iris)
Y_k5_predict = k5.predict(X_iris)
accuracy_score(Y_iris,Y_k5_predict)

0.9666666666666667

In [7]:
k20 = KNeighborsClassifier(n_neighbors = 20)
k20.fit(X_iris, Y_iris)
Y_k20_predict = k20.predict(X_iris)
accuracy_score(Y_iris,Y_k20_predict)

0.98

The cross_val_score divide the data into a number of folds (specified by the cv = option). Then, the model is trained on the data but excluding one fold and then validated on the excluded fold. This training/validating process loops and another fold is left out, etc., then an average accuracy score is returned.

This function can be used when you have a dataset with a small sample size. 

In [8]:
k3 = KNeighborsClassifier(n_neighbors = 3)
score_k3 = cross_val_score(k3, X_iris, Y_iris, cv = 5)
print(score_k3)
np.mean(score_k3)

[0.96666667 0.96666667 0.93333333 0.96666667 1.        ]


0.9666666666666668

One of the folds has a perfect accuracy score, but the last two folds had a lower accuracy score (although quite high still). Overall, this algorithm predicted the species of iris quite well. It's important not to cherry-pick models based on on a single subtest score (similar to the multiple comparisons problem with p-values).