# K-nearest neighbours classification
## Instructions:
* Go through the notebook and complete the tasks. 
* Make sure you understand the examples given. If you need help, refer to the Essential readings or the documentation link provided, or go to the discussion forum. 
* When a question allows a free-form answer (e.g. what do you observe?), create a new markdown cell below and answer the question in the notebook. 
* Save your notebooks when you are done.
 
**Task 1:**
Run the cell below to load our data. Notice the last line, where we add some random Gaussian noise to our data to make the task more challenging (data in real life usually contains some form of noise).


In [1]:
%matplotlib inline

from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()

#view a description of the dataset 
print(iris.DESCR)

#Set X a samples times features matrix, Y equal to the targets
X=iris.data 
y=iris.target 


#we add some random noise to our data to make the task more challenging
X=X+np.random.normal(0,0.4,X.shape)


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

**Task 2:**
1.	How many data samples do we have?
2.	Print the value below using shape on ```X``` appropriately. 

In [2]:
#Enter code here
# Print the number of samples using the shape of X
print(f"Number of samples: {X.shape[0]}")



Number of samples: 150


**Task 3:**
1.	How many features do we have?
2.	Print the value below using shape on ```X``` appropriately. 


In [3]:
#Enter code here
# Print the number of features using the shape of X
print(f"Number of features: {X.shape[1]}")



Number of features: 4


**Task 4:**
1.	How many classes do we have?
2.	Print the value below using ```np.unique``` appropriately. 


In [5]:
#Enter code here
# Print the number of unique classes using np.unique
print(f"Number of classes: {len(np.unique(y))}")




Number of classes: 3


**Task 5:**
1.	How many samples do we have that belong to class 1?
2.	Print this in the cell below using the ```np.where``` function appropriately. 


In [6]:
#Enter code here
# Print the number of samples belonging to class 1 using np.where
class_1_samples = np.where(y == 1)[0].shape[0]
print(f"Number of samples in class 1: {class_1_samples}")



Number of samples in class 1: 50


**Task 6:** 

Assume we want to generate a list of shuffled indices of our data. Use the function ```numpy.random.permutation``` to do that. In the cell below, you can already see how to create a list of indices that is not shuffled.


In [7]:
L=list(range(X.shape[0]))
print(L)
#Enter code here
# Create a list of indices and shuffle them using np.random.permutation
L = np.random.permutation(X.shape[0])
print(f"Shuffled indices: {L}")




[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]
Shuffled indices: [  9 111  57  65  47 118 141  71 143 126 106  29  86  39 110  22 115  94
  56  54  26  44   7 129  85  28  66  19  37 122  43  30  75 132 112 104
  64  77 138  69  91 139  46  12  23  14   5  87  68 127  93   4 121  20
 100  83  51 130 116  95  99  92  62 136  50  63  81  74   6  25 103 147
  61 124  35 144 142 113 125  17  78  49 119 105 

**Task 7:**
Here is an example of using the k-NN classifier. We split our data to training and testing (with a 0.2 percentage for our test data), fit on the training data, test on the testing data. 
Go through the code and make sure you understand it.
Now do the same for the next cell, which prints the confusion matrix and the total accuracy. 
You can find some documentation to help you here: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html. 
Note that for this lab, we use the Euclidean distance along with 10 neighbours.


In [8]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#define knn classifier, with 5 neighbors and use the euclidian distance
knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
#define training and testing data, fit the classifier
knn.fit(X_train,y_train)
#predict values for test data based on training data
y_pred=knn.predict(X_test)
#print values
print(y_test) # true values
print(y_pred) # predicted values


[1 1 1 1 0 1 2 2 2 0 1 0 2 0 0 1 0 2 2 2 2 1 0 0 1 2 0 1 2 2]
[1 1 1 1 0 1 2 1 2 0 1 0 1 0 0 1 0 2 2 2 1 1 0 0 2 2 0 2 1 2]


In [10]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))


[[9 0 0]
 [0 8 2]
 [0 4 7]]
0.8


**Task 8:**
Write your <b>own</b> functions that return the confusion matrix given the true and predicted labels, as well as the accuracy. To do so, fill in the code in the next two cells. 


In [12]:
def myConfMat(y_test, y_pred, classno):
    # Create an empty confusion matrix
    C = np.zeros((classno, classno), dtype=int)

    # Populate the confusion matrix
    for i in range(len(y_test)):
        C[y_test[i], y_pred[i]] += 1
    
    return C

# Calculate and print the custom confusion matrix
print(myConfMat(y_test, y_pred, len(np.unique(y))))




[[9 0 0]
 [0 8 2]
 [0 4 7]]


In [13]:
def myAccuracy(y_test, y_pred):
    # Calculate accuracy as the ratio of correct predictions to total predictions
    correct = np.sum(y_test == y_pred)
    total = len(y_test)
    return correct / total

# Calculate and print the custom accuracy
print(f"Custom Accuracy: {myAccuracy(y_test, y_pred)}")




Custom Accuracy: 0.8


**Optional task:**</span> Write your own functions to calculate class-relative precision and recall. Compare these to the sklearn functions ``precision_score`` and ``recall_score`` on your y_test and y_pred values.

In [14]:
def myPrecision(y_true, y_pred):
    classes = np.unique(y_true)
    precision = np.zeros(classes.shape)

    for class_label in classes:
        true_positives = np.sum((y_true == class_label) & (y_pred == class_label))
        predicted_positives = np.sum(y_pred == class_label)
        if predicted_positives > 0:
            precision[class_label] = true_positives / predicted_positives
        else:
            precision[class_label] = 0.0
            
    return precision

# Calculate and print custom precision
print(f"Custom Precision: {myPrecision(y_test, y_pred)}")



Custom Precision: [1.         0.66666667 0.77777778]


In [15]:
def myRecall(y_true, y_pred):
    classes = np.unique(y_true)
    recall = np.zeros(classes.shape)

    for class_label in classes:
        true_positives = np.sum((y_true == class_label) & (y_pred == class_label))
        actual_positives = np.sum(y_true == class_label)
        if actual_positives > 0:
            recall[class_label] = true_positives / actual_positives
        else:
            recall[class_label] = 0.0
    
    return recall

# Calculate and print custom recall
print(f"Custom Recall: {myRecall(y_test, y_pred)}")


Custom Recall: [1.         0.8        0.63636364]
