# Lab 2: k-NN

In this lab, we will explore classification with the k-Nearest Neighbors (k-NN) classifier. We will first use k-NN to perform classification on a synthetic dataset and then to recognize images of hand-written digits.


Your first objective is to create a toy dataset in order to understand how k-NN works. The dataset will consist of $200$ points in the $2$-dimensional space ($N = 200$, $d = 2$). Each point will belong to a specific class and there will be $4$ classes in total (classes $1,2,3$ and $4$). Each class will contain exactly $50$ points. For each class, points will be drawn from a specific Gaussian distribution. For points belonging to class $1$, use a Gaussian distribution with mean [$1,1$] and standard deviation $0.5$. For points belonging to class $2$, a Gaussian distribution with mean [$1,-1$] and standard deviation $0.5$. For points belonging to class $3$, a Gaussian distribution with mean [$-1,1$] and standard deviation $0.5$. Finally, for points belonging to class $4$, use a Gaussian distribution with mean [$-1,-1$] and standard deviation $0.5$. To generate these values make use of the `randn` function (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.randn.html) that returns a sample from the 'standard normal' distribution as follows: 

```python
sigma * np.random.randn(...) + mu
```

In [None]:
#your code here

After generating the $200$ points, plot them in a $2$-dimensional plane using `scatter` (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter). Set the colors of points belonging to classes $1,2,3$ and $4$ to blue, red, green and yellow respectively.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap        
cmap_bold = ListedColormap(['blue', 'red', 'green', 'yellow'])
        
plt.scatter(X[:,0], X[:,1], c=y, cmap=cmap_bold)

Perform k-NN on the generated dataset to classify the $200$ points to the $4$ classes using $k=3$. Use the `KNeighborsClassifier` function provided by `scikit-learn` (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). Measure the performance of the k-NN classifier by computing its accuracy. To do this, you can use the `accuracy_score` function of `scikit-learn` (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html). 

In [None]:
#your code here

What do you observe? Did k-NN manage to correctly classify all points?

Run the following code to plot the decision boundaries for each class.

In [None]:
from matplotlib.colors import ListedColormap

h = .02  # step size in the mesh

# Create color map
cmap_custom = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF', '#EEEEEE'])

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_custom)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("4-Class classification (k = %i)" % k)

Experiment with different values of k and observe how the decision boundaries of the $4$ classes are affected.

---

Your next task is to write your own k-NN classifier. First, write a function that computes the `euclidean distance` between a pair of points. Given two datapoints $\mathbf{x}$ and $\mathbf{z}$, their euclidean distance is defined as:

$$ d(\mathbf{x},\mathbf{z}) = \sqrt {\sum_{i=1}^d (\mathbf{x_i}-\mathbf{z}_i)^2} $$

In [None]:
from math import sqrt
def euclideanDistance(vectorA, vectorB):
    # Computes the euclidean distance between two vectors. The
    # two vectors must have the same size.
    
    #your code here

Write another function that takes as input the training data $\mathbf{X}$, their class labels $\mathbf{y}$, the parameter $k$ and a datapoint $\textbf{z}$ and returns the predicted class for $\textbf{z}$ using k-NN. In case of ties, assign $\textbf{z}$ randomly to one of the classes.

In [None]:
from collections import Counter

def kNN(k, X, y, z):
    # Assigns to the test instance the label of the majority of the labels of the k closest 
    # training examples using the k-NN with euclidean distance.
    
    #your code here
    
    return label

Use the two functions that you created to perform classification on the training set with $k=3$. Compute the accuracy achieved by your classifier.

In [None]:
#your code here

Is the accuracy equal to the one achieved by the k-NN classifier of `scikit-learn`?

---

We will next use the k-NN classifier to recognize a set of hand-written digits. We will employ a dataset consisting of $1797$ images that is available from `scikit-learn` (http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html). Each image corresponds to a hand-written digit and its dimensionality is $8 \times 8$.  All the images have been already transformed into vectors of dimensionality $64$ and are in the traditional tabular form.

Run the following code to load the dataset and to plot the first ten digits.

In [None]:
from sklearn import datasets

digits = datasets.load_digits()

X = digits.data
y = digits.target

# Show the first ten digits
fig = plt.figure('First 10 Digits') 
for i in range(10):
    a = fig.add_subplot(2,5,i+1) 
    plt.imshow(X[i,:].reshape(8,8), cmap=plt.cm.gray)
    plt.axis('off')

Your next task is to split the dataset into a training and a test set. To do this, use the function `train_test_split` of `scikit-learn` (http://scikit-learn.org/0.16/modules/generated/sklearn.cross_validation.train_test_split.html). Set the test size equal to $\frac{4}{10}$ of the whole dataset.

In [None]:
#your code here

Use k-NN to recognize the hand-written digits of the test set. Compute the classification accuracy, and generate a text summary of the precision, recall and F$1$-score for each class. Use the `classification_report` function of `scikit-learn` (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

In [None]:
#your code here

Compute the classification accuracy for different values of $k$ and plot them in a $2$d plane using plot (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot).

In [None]:
k = 30
accuracies = np.zeros(len(range(1,k)))

#your code here

What do you observe?