# K-Nearest Neighbors on Iris dataset. 

Here, both the datasets are split into train and test data using "train_test_split" function from the scikit learn. Train data is to fit the model while the test data which is unseen by the model

In [29]:
from sklearn.datasets import load_iris
iris = load_iris()

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest= train_test_split(iris.data, iris.target, random_state=2261)

print(Xtrain.shape, Xtest.shape)

(112, 4) (38, 4)


Implementing a euclidean distance function so it calculates the distance between two rows and returns the shortest one (or depending on required k) as the nearest neighbors.

Applying the function on a contrived data, we can see that the distance between f1 and f2 is shorter than f1 and f3 thereby having f2 as the nearest neighbor to f1.

In [30]:
import math

def euclidean_distance(firstrow, secondrow):
    dist = 0.0
    for i in range(len(firstrow)):
        dist += math.pow(firstrow[i]-secondrow[i],2)
    return math.sqrt(dist)

f1 =(5.1,3.5,1.4,0.2)
f2= (4.9,3.0,1.4,0.2)
f3 =(7.0,3.2,4.7,1.4)

print(euclidean_distance(f1, f2))
print(euclidean_distance(f1, f3))

0.5385164807134502
4.003748243833521


Here, implementing a function to get the neighbors by having train, test rows and number of neighbors (k). We enumerate the train set as we don't want to lose the index of the shortest distance(s). The function returns the index along with the shortest distance. Below is also an example how the function is executed.


In [31]:
def get_neighbors(train, test, k):
    distances = list()
    for idx, x in enumerate(train):
        dis = euclidean_distance(x, test)
        distances.append((idx, dis))
    distances.sort(key=lambda tup: tup[1])
    
    neighbors = list()
    
    for i in range(k):
        neighbors.append(distances[i])
    return neighbors

In [34]:
nearest = get_neighbors(Xtrain, Xtest[1], 2)

print(nearest)

[(107, 0.24494897427831777), (106, 0.38729833462074154)]


Now, predict_classification would classify the test row into the classes depending on the nearest neighbors. The function would select the maximum number of times a class occurs (output value) and returns the predicted class for the test set. 

Below is the executed function which predicts Xtest[12] belonging in class 2.

In [9]:
def predict_classification(train, test_row, num_neighbors):
    neighbors = get_neighbors(train, test_row, num_neighbors)
    output_values = [row[0] for row in neighbors]
    output = [ytrain[ov] for ov in output_values] 
    prediction = max(set(output), key=output.count)
    return prediction

In [36]:
predict_classification(Xtrain, Xtest[12], 3)


2

Now we will make predictions for every row in the test set and then calculate the accuracy.

In [40]:
import numpy as np
pred_class = []

for i in Xtest:
    clas = predict_classification(Xtrain, i, 3)
    pred_class.append(clas)

pred_class = np.asarray(pred_class)

#[print(x) for x in pred_class]

print(pred_class)

[2 2 0 1 0 0 0 1 1 2 2 1 2 0 0 2 2 0 0 2 2 0 2 1 2 1 2 0 2 0 1 1 2 0 2 2 1
 1]


In [26]:
acc = np.mean(cla == ytest)
print('accuray = ', acc*100)

accuray =  97.36842105263158


The accuracy of this knn model when k =3 is 97% which is a good accuracy rate. 