# KNN Classifier

Given a set of categories {c1, c2,...cn} also called classes and a learn dataset LS consisting of pre-labelled instances, KNN will assign one of those categories to an arbitrary new and unseen instance.

It will do it by measuring the most common class in a determined number of neighbours k, and the neighbours will be the closest ones to the new instance, measured per distance.

## Cost: O(n)

KNN has to run on all examples of the training set at least once, so the cost is O(n).

## Implementation

In [1]:
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
iris_data = iris.data
iris_labels = iris.target

In [2]:
iris_data[2]

array([4.7, 3.2, 1.3, 0.2])

### Distance function

In this case the distance is Euclidean distance, but could be any specific distance, like for example, levenshtein.

In [3]:
def euclidean_dist(l1, l2):
    return np.linalg.norm(l1 - l2)

In [4]:
euclidean_dist(iris_data[2], iris_data[20])

0.8306623862918076

In [5]:
def get_neighbors(training_set, labels, test_instance, k, distance):
    distances = []
    for index in range(len(training_set)):
        dist = distance(test_instance, training_set[index])
        distances.append((training_set[index], dist, labels[index]))
    distances.sort(key=lambda x: x[1])
    neighbors = distances[:k]
    return neighbors

In [29]:
neighbors = get_neighbors(iris_data, iris_labels, [1, 0, 5, 2], k = 5, distance = euclidean_dist)
neighbors

[(array([4.9, 2.5, 4.5, 1.7]), 4.669047011971501, 2),
 (array([5. , 2. , 3.5, 1. ]), 4.8218253804964775, 1),
 (array([4.9, 2.4, 3.3, 1. ]), 4.985980344927164, 1),
 (array([5. , 2.3, 3.3, 1. ]), 5.017967716117751, 1),
 (array([5.2, 2.7, 3.9, 1.4]), 5.1478150704935, 1)]

### Voting mode

Voting is the way of choosing how to get the result. In the first case, we just return the majority class.

In [30]:
def vote(neighbors):
    winner_class = -1
    votes = {}
    for i in neighbors:
        if votes.get(i[2], False):
            votes[i[2]] += 1
            if votes[i[2]] >= winner_class: winner_class = i[2]
        else:
            votes[i[2]] = 1
    
    return winner_class

In [31]:
vote(neighbors)

1

In the second voting function, we return the probability of being that class.

In [32]:
def vote_prob(neighbors):
    votes = {}
    for i in neighbors:
        if votes.get(i[2], False):
            votes[i[2]] += 1
        else:
            votes[i[2]] = 1
    
    probs = {}
    for key, value in votes.items():
        probs[key] = round(value / len(neighbors), 2)
        
    return probs

In [35]:
vote_prob(neighbors)

{2: 0.2, 1: 0.8}

### Results

In [34]:
def knn_classifier(training_set, labels, test_instance, k, distance, vote):
    neighbours = get_neighbors(training_set, labels, test_instance, k, distance = euclidean_dist)
    return vote(neighbors)

In [26]:
knn_classifier(iris_data, iris_labels, [0, 0, 1, 3], k = 5, distance = euclidean_dist, vote = vote_prob)

{0: 1.0}

## Notes:

* K value should be odd in cases where we select the majority vote so we can break the simmetry.

## Links:
* None yet