## Implementation of kNN Algorithm using Python

- Handling the data
- Calculate the distance
- Find k nearest point
- Predict the class
- Check the accuracy

### import libraries

In [1]:
import pandas as pd
import numpy as np
import pickle
import os
import operator
import random

### Handling the data

In [3]:
path = os.getcwd()
for i in range(3):
	path = os.path.dirname(path)
data = pd.read_csv( path + '/Datasets/IRIS.csv')
data.sort_values(by = 'sepal_width', inplace = True)

In [30]:
def train_test_split(dataset):
	training_data = dataset.iloc[:80].reset_index(drop = True)
	testing_data = dataset.iloc[80:].reset_index(drop = True)
	trainingSet = []
	test_classes = []
	test_data = []
	for index, rows in training_data.iterrows():
		my_list = [rows.sepal_length, rows.sepal_width, rows.petal_length, rows.petal_width, rows.species]
		trainingSet.append(my_list)

	for index, rows in testing_data.iterrows():
		my_list = [rows.sepal_length, rows.sepal_width, rows.petal_length, rows.petal_width]
		test_classes.append(rows.species)
		test_data.append(my_list)
	return trainingSet,test_data,test_classes

In [31]:
training_data, testing_data, test_classes =  train_test_split(data)

In [32]:
training_data

[[5.0, 2.0, 3.5, 1.0, 'Iris-versicolor'],
 [6.0, 2.2, 4.0, 1.0, 'Iris-versicolor'],
 [6.0, 2.2, 5.0, 1.5, 'Iris-virginica'],
 [6.2, 2.2, 4.5, 1.5, 'Iris-versicolor'],
 [4.5, 2.3, 1.3, 0.3, 'Iris-setosa'],
 [5.5, 2.3, 4.0, 1.3, 'Iris-versicolor'],
 [5.0, 2.3, 3.3, 1.0, 'Iris-versicolor'],
 [6.3, 2.3, 4.4, 1.3, 'Iris-versicolor'],
 [5.5, 2.4, 3.7, 1.0, 'Iris-versicolor'],
 [5.5, 2.4, 3.8, 1.1, 'Iris-versicolor'],
 [4.9, 2.4, 3.3, 1.0, 'Iris-versicolor'],
 [6.3, 2.5, 4.9, 1.5, 'Iris-versicolor'],
 [6.3, 2.5, 5.0, 1.9, 'Iris-virginica'],
 [5.1, 2.5, 3.0, 1.1, 'Iris-versicolor'],
 [5.7, 2.5, 5.0, 2.0, 'Iris-virginica'],
 [6.7, 2.5, 5.8, 1.8, 'Iris-virginica'],
 [5.6, 2.5, 3.9, 1.1, 'Iris-versicolor'],
 [5.5, 2.5, 4.0, 1.3, 'Iris-versicolor'],
 [4.9, 2.5, 4.5, 1.7, 'Iris-virginica'],
 [5.8, 2.6, 4.0, 1.2, 'Iris-versicolor'],
 [5.7, 2.6, 3.5, 1.0, 'Iris-versicolor'],
 [5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'],
 [7.7, 2.6, 6.9, 2.3, 'Iris-virginica'],
 [6.1, 2.6, 5.6, 1.4, 'Iris-virginica'],
 [5

In [33]:
testing_data

[[4.9, 3.0, 1.4, 0.2],
 [4.8, 3.0, 1.4, 0.1],
 [4.8, 3.0, 1.4, 0.3],
 [4.9, 3.1, 1.5, 0.1],
 [6.4, 3.1, 5.5, 1.8],
 [6.7, 3.1, 5.6, 2.4],
 [6.9, 3.1, 5.1, 2.3],
 [4.8, 3.1, 1.6, 0.2],
 [4.9, 3.1, 1.5, 0.1],
 [6.7, 3.1, 4.7, 1.5],
 [6.9, 3.1, 5.4, 2.1],
 [6.9, 3.1, 4.9, 1.5],
 [6.7, 3.1, 4.4, 1.4],
 [4.6, 3.1, 1.5, 0.2],
 [4.9, 3.1, 1.5, 0.1],
 [4.7, 3.2, 1.6, 0.2],
 [6.5, 3.2, 5.1, 2.0],
 [7.2, 3.2, 6.0, 1.8],
 [6.4, 3.2, 5.3, 2.3],
 [5.9, 3.2, 4.8, 1.8],
 [6.9, 3.2, 5.7, 2.3],
 [6.4, 3.2, 4.5, 1.5],
 [4.4, 3.2, 1.3, 0.2],
 [6.8, 3.2, 5.9, 2.3],
 [7.0, 3.2, 4.7, 1.4],
 [5.0, 3.2, 1.2, 0.2],
 [4.6, 3.2, 1.4, 0.2],
 [4.7, 3.2, 1.3, 0.2],
 [6.3, 3.3, 4.7, 1.6],
 [5.1, 3.3, 1.7, 0.5],
 [5.0, 3.3, 1.4, 0.2],
 [6.3, 3.3, 6.0, 2.5],
 [6.7, 3.3, 5.7, 2.1],
 [6.7, 3.3, 5.7, 2.5],
 [5.0, 3.4, 1.5, 0.2],
 [4.6, 3.4, 1.4, 0.3],
 [6.3, 3.4, 5.6, 2.4],
 [4.8, 3.4, 1.6, 0.2],
 [4.8, 3.4, 1.9, 0.2],
 [5.2, 3.4, 1.4, 0.2],
 [6.0, 3.4, 4.5, 1.6],
 [5.1, 3.4, 1.5, 0.2],
 [5.4, 3.4, 1.5, 0.4],
 [5.4, 3.4,

### Calculate the distace

In order to make any predictions, you have to calculate the distance between the new point and the existing points, as you will be needing k closest points.

In this case for calculating the distance, we will use the Euclidean distance. This is defined as the square root of the sum of the squared differences between the two arrays of numbers

Specifically, we need only first 4 attributes(features) for distance calculation as the last attribute is a class label. So for one of the approach is to limit the Euclidean distance to a fixed length, thereby ignoring the final dimension.

In [34]:
def euclidian_distance(data1, data2, length):
    distance = 0
    for i in range(length):
        distance += ((data1[i] - data2[i])**2)
        distance = (distance ** (1/2))
    return distance

### Find k nearest point

Now that you have calculated the distance from each point, we can use it collect the k most similar points/instances for the given test data/instance.

This is a straightforward process: Calculate the distance wrt all the instance and select the subset having the smallest Euclidean distance.

Let’s create a getKNeighbors function that  returns k most similar neighbors from the training set for a given test instance

In [35]:
def getKNeighbors(trainingSet, testInstance, k):
	distances = []
	length = len(testInstance)
	for i in range(len(trainingSet)):
		dist = euclidian_distance(testInstance, trainingSet[i], length)
		distances.append((trainingSet[i], dist))
		distances.sort(key = operator.itemgetter(1))
	neighbors = []
	for i in range(k):
		neighbors.append(distances[i][0])
	return neighbors

### Predict the class

Now that you have the k nearest points/neighbors for the given test instance, the next task is to predicted response based on those neighbors

You can do this by allowing each neighbor to vote for their class attribute, and take the majority vote as the prediction.

Let’s create a getResponse function for getting the majority voted response from a number of neighbors.

In [36]:
def predict(neighbors):
	classVotes = {}
	for i in range(len(neighbors)):
		response = neighbors[i][-1]
		if response in classVotes:
			classVotes[response] +=1
		else:
			classVotes[response] = 1
	sortedVotes = sorted(classVotes.items(), key = operator.itemgetter(1), reverse = True)
	return sortedVotes[0][0]

### Check the accuracy

Now that we have all of the pieces of the kNN algorithm in place. Let’s check how accurate our prediction is!

An easy way to evaluate the accuracy of the model is to calculate a ratio of the total correct predictions out of all predictions made.

Let’s create a getAccuracy function which sums the total correct predictions and returns the accuracy as a percentage of correct classifications.

In [37]:
def getAccuracy(testSet, predictions):
	correct = 0
	for i in range(len(testSet)):
		if testSet[i] is predictions[i]:
			correct += 1
	accuracy = correct/float(len(testSet)) * 100.0
	return accuracy

In [38]:
predictions = []
for i in range(len(testing_data)):
	neighbors = getKNeighbors(training_data, testing_data[i], k=3)
	predictions.append(predict(neighbors))
accuracy = getAccuracy(test_classes, predictions)
print("\nAccuracy = ", accuracy, "%")


Accuracy =  98.57142857142858 %


This was all about the kNN Algorithm using python.