# kNN classifier example implemented in Python

**Classification**

It is the process of predicting the class of given data points.

* Input: A training set of tuples - a list of descriptors and chosen class label attribute.
* Output: The model (classifier) that assigns a class to each tuple based on the descriptors' values.

**kNN algorithm**

The kNN-algorithm consists of three steps:
1. Calculating the distance between a test set and the training set. The most common is the Euclidean distance. 
2. Finding the nearest neighbours on the distance criterion.
3. Classifying point from the test set. The predicted class of test point is set equal to the most frequent class among k nearest training points.

Let's start by importing the necessary libraries.

In [3]:
import pandas as pd
from math import sqrt

We define class Point.The object of `class Point` has two attributes coordinates and label. 

Then, we define a method that calculates the distance between the test point and all other points in the training set.

In [4]:
class Point:
    
    def __init__(self, coordinates, label):
        self.coordinates = coordinates
        self.label = label
    
    def get_coordinates(self):
        return self.coordinates
    
    def get_label(self):
        return self.label
    
    def distance(self, point):
        return sqrt(sum([(x - y) ** 2 for x, y in zip(self.coordinates, point.coordinates)]))
    
    def __str__(self):
        return str(self.coordinates) + " " + str(self.label)

In the init function we define the value of hyperparameter k (the number of nearest neighbors). Choosing the value of k is not a part of the learning process. We've set `k=7`.

The kNN algorithm does not build any model with the training set. We store the training sample to use it in the prediction stage.

The method `classify` create the list that stores the k nearest neigbours. Then, we create a list of labels for neigbours and choose the most frequent class label among the k closest neigbours.

In [10]:
class Model:
    
    def __init__(self, k=7):
        self.training_set = None
        self.k = k
    
    def train(self, training_set):
        self.training_set = training_set
    
    def classify(self, test_set):
        
        res = []
        
        for point in test_set:
            neighbours = sorted(self.training_set, key = lambda x: x.distance(point))[0:self.k]
            labels = [x.label for x in neighbours]
            prediction = max(labels, key = labels.count)
            res += [(point, prediction)] 
        return res


Finally, we construct the confusion matrix to check the performance of the algorithm. Each row of the matrix represents the instances in a predicted class and each column represents the instances in an actual class.

In [11]:
class ConfusionMatrix:
    
    def __init__(self, predictions):
        self.predictions = predictions
        
    def __str__(self):
        
        rows = sorted(set([x[0].label for x in self.predictions]))
        columns = sorted(set([x[1] for x in self.predictions]))
        column_width = 10
        
        res = " ".join([str(x).center(column_width) for x in columns]) + "\n"
        for row in rows:
            for column in columns:
                res += str(sum([1 for x in self.predictions if x[0].label == row and x[1] == column])).center(column_width)
                res += " "
            res += str(row).center(column_width) + "\n"
        return res
            

Let's test out the model. First, we load two data sets: traing set and test set.

In [12]:
training_set = [Point(x[0:-1], x[-1]) for x in pd.read_csv("data/training_set.csv").values.tolist()]
test_set     = [Point(x[0:-1], x[-1]) for x in pd.read_csv("data/test_set.csv").values.tolist()]

We call `classify` method from `class model`.

In [13]:
model = Model()
model.train(training_set)
predictions = model.classify(test_set)

Let's print the confusion matrix to check the accuracy of our model.

In [24]:
print(ConfusionMatrix(predictions))

  label1     label2     label3  
    6          0          0        label1  
    0          3          1        label2  
    0          0          4        label3  

