# Iris and KNN

## An attempt to build a KNN classifier for the Iris dataset

This notebook builds a K-nearest neighbours classifier algorithm from scratch and then applies that to the iris dataset.

General Steps include:

* Read in and regularize the data, there are 4 useful columns
* split into a test and training set, roughly 40/60 split
* For each datum in test set, find euclidian distance of all in training set, find K closest
* Make prediction for each test datum based on poll of k nearest and compare predictions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

In [2]:
iris = pd.read_csv('Iris.csv')

# Regularise each column in turn
for col in iris.columns[1:5]:
    ave = iris[col].mean()
    std = iris[col].std()
    iris[col] = iris[col].map(lambda x: (x - ave)/std)
    
iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,-0.897674,1.028611,-1.336794,-1.308593,Iris-setosa
1,2,-1.1392,-0.12454,-1.336794,-1.308593,Iris-setosa
2,3,-1.380727,0.33672,-1.39347,-1.308593,Iris-setosa
3,4,-1.50149,0.10609,-1.280118,-1.308593,Iris-setosa
4,5,-1.018437,1.259242,-1.336794,-1.308593,Iris-setosa


In [3]:
iris.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,-4.513982e-16,3.714621e-16,2.94903e-16,1.37899e-16
std,43.445368,1.0,1.0,1.0,1.0
min,1.0,-1.86378,-2.430844,-1.563497,-1.439627
25%,38.25,-0.8976739,-0.585801,-1.223442,-1.177559
50%,75.5,-0.05233076,-0.1245404,0.3351431,0.1327811
75%,112.75,0.672249,0.5673506,0.7602119,0.7879511
max,150.0,2.483699,3.104284,1.780377,1.705189


In [4]:
# Takes 2 numpy arrays and returns euclidian distance between
def eucl_dist(first, second):
    return(np.sqrt(sum((first-second)**2)))

In [5]:
# Takes the tuples in form (value, class) and returns k with smallest value
def find_min_k(searchlist, k=5):
    
    # initialise the list with the first k entries
    findk = searchlist[:k]
    
    # Sort so the largest is in last position - easy to find then
    findk = sorted(findk, key = lambda x: x[0])
    
    # Going from after the entries already in the list, if value is larger,
    # replace current largest (in last place)
    # Then sort again (fast since small list) and repeat
    for case in searchlist[k:]:
        if case[0] < findk[k-1][0]:
            findk[k-1] = case
            findk = sorted(findk, key = lambda x: x[0])
    
    return findk
        
        

In [6]:
# Split into a training and a test set
# Done here by assigning a random variable and spltting based on that
# Never perfect split by ratio is problem though
iris_train = []
iris_test= []
ratio = .6

for num, line in iris.iterrows():
    if np.random.rand() < ratio:
        iris_train.append(line)
    else:
        iris_test.append(line)

# Convert into dataframes for convenience. Can do without though
iris_train = pd.DataFrame(iris_train)
iris_test = pd.DataFrame(iris_test)

# Reset indices
iris_train.index = range(len(iris_train))
iris_test.index = range(len(iris_test))

In [7]:
# empty list to hold prediction results at end
testresults = []

# Set K here
k=5

# Go through test cases one by one
for index, flower in iris_test.iterrows():
    
    distance_list = []
    
    # Easy to use numpy arrays for the distance calcs
    tester = np.array(flower[1:5])
    
    # Go through for each training case and append the distance and class to the distance_list
    for num, train_flower in iris_train.iterrows():
        trainer = np.array(train_flower[1:5])
        dist = eucl_dist(tester, trainer)
        distance_list.append((dist, train_flower[5]))
    
    # Go through distance_list to find k smallest distances
    klist = find_min_k(distance_list, k)
    
    # Find most common classes from list
    # WARNING: if multiple labels are most common, simply chooses one. To improve on. 
    prediction = Counter(klist).most_common()[0][0]
    testresults.append((flower[5], prediction))
    
    # Print out results, along with basic data if not correct
    print(True if flower[5] == prediction[1] else False, flower[5], prediction[1])
    if flower[5] != prediction[1]:
        print(klist)

True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-setosa Iris-setosa
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-versicolor Iris-versicolor
True Iris-ver