## Problem 1: Classification

In this problem we are going to implement one of the most popular and simple classifcation algorithm and explore various metrics that are helpful in choosing the best classifcation model. 

## K- Nearest Neighbours

KNN is one of the simplest machine learning algorithm anyone can think of. If you are given a database of examples (a training set), we will make predictions on future examples by finding the nearest points to them in the training set.

Nearness can be measured in various ways. One of the most popular measure is the euclidean distance between the data points. 
More examples of distance measures include the Manhattan distance, Chebyshev distance etc. More on this [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html).


KNN is a non parametric method, meaning the number of parameters of the model is not fixed rather it depends on the training data. 

**The algorithm** is quite simple. 

Given a set of data points $\{X_1, X_2, .. X_m\}$ with class labels  $\{Y_1, Y_2, .. Y_m\}$.

Each $Y_i$ can be any of the predifined set of classes. Lets assume that any $Y_i$ could belong to classes $\{ C_1,C_2...,C_m \}$


To classify any new point $X_{l'}$: 
1. Compute the distance of given $X_{l'}$ from all the data points  $\{X_1, X_2, .. X_m\}$ 
2. Find the K closest data points for the given $X_{l'}$. 
    1. Let $\{X_{a1}, X_{a2}, .. X_{ak}\}$  be the K closest points to $X_{l'}$. 
3. Out of these K closest points, find the majority class 
    1. i.e find $C_{j}$ such that frequency of $C_{j}$ in $\{Y_{a1}, Y_{a2}, .. Y_{ak}\}$ is maximum. 
    2. If there are multiple  $C_{j}$ with equal frequency, lets choose the class  $C_{j}$ with the smallest $j$



## Example 

<img src="knn_example.png" width="600" height="500" style="float:left">

In this case K=5, $\{C_1,C_2\}$ = $\{red,blue\}$.
To classify the first test sample. 
1. Compute the K closest points.
    1. The   $\{Y_{a1}, Y_{a2}, .. Y_{a5}\}$ points would corresponding to $\{red,red,red,red,blue\}$
2.  $C_{j}$ such that frequency of $C_{j}$ in $\{Y_{a1}, Y_{a2}, .. Y_{a5}\}$ is maximum is $\{red\}$
3. Hence the new test sample will be classified as belonging to $\{red\}$

Lets start the implementation of the KNN algorithm. We will run the implemented algorithm on the popular Iris data set.

** Note: ** For all the parts assume K is 3 until otherwise mentioned explicitly and all functions also assume K=3, K wont be anyother value. Your implementation can assume this as well.

In [1]:
import collections
import numpy as np
import pandas as pd

**NOTE** : All the test cases are hidden. However you can submit your solution and view whether your solution has passed or if therre is an error in the grading report.

### Part 0:  Distance

In [2]:
data=pd.read_csv('data.csv')

In [3]:
train=data.head(100)
test=data.tail(50)
trainx=train[['sepallength', 'sepalwidth', 'petallength', 'petalwidth'] ]
trainy=train[['type']]
trainx=trainx.values.tolist()
trainy=trainy.values.tolist()
testx=test[['sepallength', 'sepalwidth', 'petallength', 'petalwidth'] ]
testy=test[['type']]
testx=testx.values.tolist()
testy=testy.values.tolist()

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,type
0,72,6.3,2.5,4.9,1.5,1
1,112,6.8,3.0,5.5,2.1,2
2,132,6.4,2.8,5.6,2.2,2
3,88,5.6,3.0,4.1,1.3,1
4,37,4.9,3.1,1.5,0.1,0


** NOTE** : ** DO NOT MODIFY THESE DATA VARIABLES OR TEST CASES USE THEM AS PROVIDED**

Complete the function **distance** which computes and returns the distance between two data points x1,x2. 

The distance must be calculated using the formula: 

$ distance(x,y)=\sqrt{(x_1-y_1)^{4}+(x_2-y_2)^{4}+..+(x_n-y_n)^{4}} $

In our case n is 4 as there are 4 features

The return value of the function **distance** is the distance, a floating point and not an int.


A correct implementation will behave similar to below example:

** Input:** v1,v2 (represented as lists)

** call:**  distance([1,3.4],[4.4,3.1])

** output: **  11.560350340712 

** NOTE **: This distance function is not a standard one and not Euclidean either

In [5]:
def distance(v1, v2):
    ###
    diff = np.subtract(v1, v2)
    quad_diff = diff ** 4
    sum_diff = np.sum(quad_diff)
    return np.sqrt(sum_diff)
    ###


In [6]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Part 1: Total Distance

Complete the function distanceFromAll that returns the distances of the given point $X_i$ to that of all the points in the training set along with the corresponding label of the point. 

The return format is a list of tuples with first element as the distance and second element of the tuple to be the label of data point in trainx with which the distance is being computed.

ex: [ ($ dist_{ij},label_{j} $) ]

where 

$label_{ij}$ is the distance of $x_i$ from point $x_j$ in trainx and $label_j$ is the label corresponding to $x_j$

A correct implementation will behave similar to below example:

** Input:** xi:list, trainx: list of lists, trainy: list of lists

** call:**  distanceFromAll(xi,trainx[1:5],trainy[1:5])

** output: ** 
    [(2.962549820288542, [0]),
     (6.072898981323004, [2]),
     (1.2612295588036295, [2]),
     (22.488714527622644, [1])]    : list of tuples 


In [7]:
trainx[:5]

[[6.3, 2.5, 4.9, 1.5],
 [6.8, 3.0, 5.5, 2.1],
 [6.4, 2.8, 5.6, 2.2],
 [5.6, 3.0, 4.1, 1.3],
 [4.9, 3.1, 1.5, 0.1]]

In [8]:
def distanceFromAll(xi:list, trainx:list, trainy:list) -> list:
    ###
    dist_from_all = []
    for point, label in zip(trainx, trainy):
        dist_from_all.append((distance(xi, point), label))
        
    return dist_from_all
    ###


In [9]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Part 2: KNN

Now that we have all the core functions ready, we are ready to implement the main KNN algorithm. 
Complete the function predictKNN which computes the KNN prediction for given $X_i$. The function **predictKNN** should return the prediction for given input $x_i$, K and train data. 

A correct implementation will behave similar to below example:

** Input:** xi: list, trainx: list of lists, trainy: list of list, k: integer

** call:**  predictKNN(trainx[0],trainx,trainy)

** output: **  2: integer
        


In [10]:
from collections import Counter

In [11]:
test_dist_from_all = distanceFromAll(xi=[5, 3.5, 6.8, 4.7], trainx=trainx, trainy=trainy)
nearest = sorted(test_dist_from_all, key=lambda x: x[0])
labels = [label[0] for _, label in nearest][:51] 
raw_counter = Counter(labels)
raw_counter

Counter({2: 34, 1: 17})

In [12]:
winner, winner_count = raw_counter.most_common()[0]
print(winner)
print(winner_count)

2
34


In [13]:
number_of_winners = len([count for count in raw_counter.values() if count == winner_count])
number_of_winners

1

In [14]:
nearest[0][1][0]

2

In [15]:
set([y[0] for y in trainy])

{0, 1, 2}

In [19]:
def predictKNN(xi:list, trainx:list, trainy:list, k=3) -> int:
    ###
    dist_from_all = distanceFromAll(xi, trainx, trainy)
    
    # sort the dist_from_all by the shortest amount of distance (but on the 1st of the 2 values for each tuple)
    nearest_neighbors = sorted(dist_from_all, key=lambda x: x[0])
    k_nearest_labels = [label[0] for _, label in nearest_neighbors][:k]
    
    # Count how often a label appeared
    raw_count = Counter(k_nearest_labels)
    label_counter = Counter(k_nearest_labels).most_common() # sort by the most common label first
    winner, winner_count = label_counter[0]
    number_of_winners = len([count for count in raw_count.values() if count == winner_count])
    
    # If there is only 1 winner, then return that label
    # If there is > 1 winner then return the label for the nearest point
    if number_of_winners == 1:
        return winner
    else:
        return k_nearest_labels[0]
    ###


In [20]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Evaluating KNN

Congrats on implementing one of your first classification algorithm from scratch. Now that we have the algorithm coded up, you should ask yourself how good is the model. 

Well there are many ways to measure the goodness of a classifcation algorithm. One obvious metric is accuracy, which tells you how many of the predictions are correct. 
But not every time accuracy is a preffered measure. Meseaures like precision and recall are more relevant than accuracy in some contexts. 

More on this [here](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c)

**True positives** : Number of items correctly labeled as belonging to the positive class.

**False positives**: Number of items incorrectly labeled as belonging to the positive class. 

**True Negatives**: Number of items correctly labeled as belonging to the negative class.

**False Negatives**: Number of items incorrectly labeled as belonging to negative class. 

## Part 3: Confusion Matrix

Most of the metrics in classification setting can be inferred from the confusion matrix.  So lets implement the function **confusion** to compute the confusion matrix given predicted labels and the actual labels. 


In a confusion matrix each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. 

An example of a confusion matrix :

<img src="./confusion.jpg" width="300" height="300" style="float:left">


Each entry (i,j) in confusion matrix has the number of points of class $j$ that are actually predicted as $i$

Complete the function **confusion** to compute the confusion matrix for number of classes equal to 3. The function should return the confusion matrix as an array. 

A correct implementation will behave similar to below example:

** Input:** predicted: list of labesl(integers), actual: list of labels(integers)

** call:**  confusion(pred,actual)

** output: ** :                [[1.  3.  1.]
                 [ 3. 21.  4.]
                 [ 5.  23. 5.]]   : 3X3 numpy array
        


In [21]:
from sklearn.metrics import confusion_matrix

In [22]:
def confusion(predicted,actual):
    ###
    return confusion_matrix(actual, predicted)
    ###


In [23]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Now that you have confusion matrix you can use it to compute various metrics to evaluate the built classification model in any context. 

**Fin!** You've reached the end of this problem. Don't forget to restart the
kernel and run the entire notebook from top-to-bottom to make sure you did
everything correctly. If that is working, try submitting this problem. (Recall
that you *must* submit and pass the autograder to get credit for your work!)