# Lab #4

## Introduction

In this laboratory, you will build your own version of the K-Nearest Neighbors algorithm (a.k.a. KNN) using the NumPy library.

For this lab, you will need two the dataset: Iris 

Iris. You can download it from:
- https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data


## Exercises

**1.**  Load the Iris dataset

In [1]:
import numpy as np

In [2]:
datalist = []
with open("iris.data","r") as iris:
    for row in iris:
        datalist.append(row.strip().split(","))
    datalist.pop()
"""
Info: you can use the pandas.read_csv method to easily parse and store the 
dataset into a pandas DataFrame. It applies for both locally stored data 
and remote files:

df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
header=None,
)
"""
data = np.array(datalist)
data[:3]

array([['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
       ['4.9', '3.0', '1.4', '0.2', 'Iris-setosa'],
       ['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']], dtype='<U15')

**2.** Let’s identify a portion of our data for which we will try to guess the species. Randomly select 20% of the records and store the first four columns (i.e. the features representing each flower) into a two-dimensional numpy array of shape N × C, you can call it X_test. For the same records, store the last column (i.e. the one with the species values) into another array, namely y_test. This is the data that will be used to test the accuracy of your KNN implementation and its correct functioning (i.e. the testing data).

In [3]:
testlenght = int(len(data)/5)
np.random.shuffle(data)
g_test = data[:testlenght]
X_test = g_test[:,:4].astype('float64')
y_test = g_test[:,4]

**3.** Store the remaining 80% of the records in the same way. In this case, use the names X_train and y_train for the arrays. This is the data that your model will use as ground-truth knowledge (i.e. the training data).

In [4]:
g_train = data[testlenght:]
X_train = g_train[:,:4].astype('float64')
y_train = g_train[:,4]

**4.** Focus now on the KNN technique. Implement the KNN algorithm and expose it as a Python class. To identify the K closest points, or neighbors, a notion of distance is required. Your implementation must support three different distance definitions.

Given two n-dimensional points p = (p1, p2, . . . , pn) and q = (q1, q2, . . . , qn), you should calculate _euclidean_, _cosine_, and _manhattan_ distances. Write three functions that use NumPy to compute the respective distance

Implement the predict method. The function receives as input a numpy array with N rows and C columns, corresponding to N flowers. The method assigns one of the three Iris species to each row 3 using the KNN algorithm, and returns them as a numpy array. For the actual implementation, apply the identify K neighbors using the distance specified by the parameters k and distance passed to the constructor.
Then, assign the label using a majority voting scheme. 2 If K is even, assign the label arbitrarily

In [5]:
class KNearestNeighbors:
    def __init__(self, k, distance_metric="euclidean"):
        self.k = k
        self.distance_metric = distance_metric
        
    def fit(self,X,y):
        self.train_x = X
        self.len2 = len(self.train_x) 
        self.train_y = y
    
    def euclidean(self):
        eudist = np.zeros((self.len1,self.len2))
        c = 0
        for x in self.test_x:
            eudist[c] = ( ( (self.train_x - x)**2).sum(axis=1))**0.5
            c +=1
        return eudist
    
    def cosine(self):
        cosinedist = np.zeros((self.len1,self.len2))
        c = 0
        for x in self.test_x:
            cs_n = (self.train_x*x).sum(axis=1)
            cs_d = (((self.train_x**2).sum(axis=1))**.5) * (((x**2).sum())**.5)
            cosinedist[c] = 1 - cs_n/cs_d
            c += 1
        return cosinedist
    
    def manhattan(self):
        manhdist = np.zeros((self.len1,self.len2))
        c = 0
        for x in self.test_x:
            manhdist[c] = abs(self.train_x-x).sum(axis=1)
            c+=1
        return manhdist
            
    def predict(self,X):
        self.test_x = X
        self.len1 = len(self.test_x)
        if self.distance_metric == "euclidean":
            self.dist = self.euclidean()
        elif self.distance_metric == "cosine":
            self.dist = self.cosine()
        elif self.distance_metric == "manhattan":
            self.dist = self.manhattan()
        self.prediction = []
        for i in range(self.len1):
            indices = self.dist[i].argsort()[:self.k]
            votes = {i:self.train_y[indices].tolist().count(i) for i in self.train_y[indices]}
            self.prediction.append(max(votes, key=votes.get))
        return np.array(self.prediction)

**5.** Try to use your KNN model to predict the species for each record in X_test and store them in a nupy array called y_pred.

Check how many Iris species in the array y_pred have been guessed correctly with respect to the ones in y_test. A prediction is correct if y_pred[i] == y_test[i]. The ratio between the number of correct guesses and the total number of guesses is known as accuracy. If all labels are assigned correctly ((y_pred == y_test).all() == True), the accuracy of the model is 100%. Instead, if none of the guessed species corresponds to the real one ((y_pred == y_test).any() == False), the accuracy is 0%.


In [6]:
#euclidean
knn1 = KNearestNeighbors(2,distance_metric="euclidean")
knn1.fit(X_train,y_train)
r1 = knn1.predict(X_test)
#cosine
knn2 = KNearestNeighbors(2,distance_metric="cosine")
knn2.fit(X_train,y_train)
r2 = knn2.predict(X_test)
#manhattan
knn3 = KNearestNeighbors(2,distance_metric="manhattan")
knn3.fit(X_train,y_train)
r3 = knn3.predict(X_test)

In [7]:
# 0% accuracy test
print(((r1 == y_test).any() == False)) #model accuracy is not 0%
print(((r2 == y_test).any() == False)) #model accuracy is not 0%
print(((r3 == y_test).any() == False)) #model accuracy is not 0%
# 100% accuracy test
print(((r1 == y_test).all() == True)) #model accuracy is not 100%
print(((r2 == y_test).all() == True)) #model accuracy is 100%
print(((r3 == y_test).all() == True)) #model accuracy is not 100%
# accuracy calculation
print("------------------------------")
print(f"Euclidean accuracy: {round((r1==y_test).sum()/len(y_test),3)}")
print(f"Cosine accuracy: {round((r2==y_test).sum()/len(y_test),3)}")
print(f"Manhattan accuracy: {round((r3==y_test).sum()/len(y_test),3)}")

False
False
False
False
False
False
------------------------------
Euclidean accuracy: 0.9
Cosine accuracy: 0.967
Manhattan accuracy: 0.9
