# Lab 06: Nearest Neighbors ###
In this lab, we'll play around with nearest-neighbor metrics and work on the problem of using nearest neighbors for classification.

In [None]:
# Run the following cell, don't modify it:
import matplotlib
#matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import math
import scipy.stats as stats
plt.style.use('fivethirtyeight')

import pandas as pd

### Distance in Multiple Dimensions ###
We know how to compute euclidean distance in 2-dimensional space. If we have a point at coordinates $(x_0,y_0)$ and another at $(x_1,y_1)$, the distance between them is

$$D = \sqrt{(x_0-x_1)^2 + (y_0-y_1)^2}.$$

In 3-dimensional space, the points are $(x_0, y_0, z_0)$ and $(x_1, y_1, z_1)$, and the formula for the distance between them is

$$
D = \sqrt{(x_0-x_1)^2 + (y_0-y_1)^2 + (z_0-z_1)^2}
$$

In $n$-dimensional space, things are a bit harder to visualize, but I think you can see how the formula generalized: we sum up the squares of the differences between each individual coordinate, and then take the square root of that.  

**Question 1**: Complete the following distance function that computes the euclidean distance between two $n$-dimensional points `point1` and `point2`. 

In [None]:
def distance(point1, point2):
    """Returns the distance between point1 and point2
    where each argument is an array 
    consisting of the coordinates of the point"""
    return ...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
def distance(point1, point2):
    """Returns the distance between point1 and point2
    where each argument is an array 
    consisting of the coordinates of the point"""
    return np.sqrt(np.sum((point1 - point2)**2))
</pre>
</details>

### Wine Classification

Let's use this on a [wine dataset](https://archive.ics.uci.edu/ml/datasets/Wine). The table `wine` contains the chemical composition of 178 different Italian wines. The classes are the grape species, called cultivars. 

In [None]:
wine = pd.read_csv('wine.csv')
wine

There are three classes but let's just see whether we can tell Class 1 apart from the other two. Let's create a new column, `ClassOne`, with value `1` if the class is `1` and zero otherwise.

In [None]:
# For converting Class to binary
def is_one(x):
    if x == 1:
        return 1
    else:
        return 0
    
wine['Class'] = wine['Class'].apply(lambda x: is_one(x))
wine

The first two wines are both in Class 1. Let's find the distance between them.

**Question 2**: Find the distance between Wine 0 and Wine 1 in the `wine` table using the `distance` function defined in question 1. 

*Hint*: To access a row $i$ in a dataframe `df`, use `df.iloc[i]`

In [None]:
# Enter your answer here.
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
distance(wine.iloc[0], wine.iloc[1])
</pre>
</details>

**Question 2**: The last wine in the table is of Class 0. Find its distance from the first wine.

*Hint*: You could use the row number, or a negative index to access the last element of the dataframe with the `iloc` function

In [None]:
# Enter your answer here.
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
distance(wine.iloc[0], wine.iloc[-1])
</pre>
OR
<pre>
distance(wine.iloc[0], wine.iloc[177])
</pre>
</details>

That's quite a bit bigger! Let's do some visualization to see if Class 1 really looks different from Class 0. 

**Question 3**: Create a scatter plot of `Flavanoids` vs. `Alcohol` and color the points based on `Class`. 

*Hint*: Use the `colormap='viridis'` paramter to get a better color differentiator than black and white.

In [None]:
# Create the scatterplot here
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
wine.plot.scatter('Flavanoids', 'Alcohol', c='Class', colormap='viridis')
</pre>
</details>

The yellow points (Class 1) are almost entirely separate from the purple ones. That is one indication of why the distance between two Class 1 wines would be smaller than the distance between wines of two different classes. Let's see the compare with `Alcalinity of Ash` and `Ash` as attributes

**Question 3**: Create a scatter plot of `Alcalinity of Ash` vs. `Ash` and color the points based on `Class`. 

*Hint*: Use the `colormap='viridis'` paramter to get a better color differentiator than black and white.

In [None]:
# Create the scatterplot here
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
wine.plot.scatter('Alcalinity of Ash', 'Ash', c='Class', colormap='viridis')
</pre>
</details>

But for some pairs the picture is more murky.

**Question 4**: Create a scatter plot of `Magnesium` vs. `Total Phenols` and color the points based on `Class`. 

*Hint*: Use the `colormap='viridis'` paramter to get a better color differentiator than black and white.

In [None]:
# Create the scatterplot here
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
wine.plot.scatter('Magnesium', 'Total Phenols', c='Class', colormap='viridis')
</pre>
</details>

Let's see if we can implement a classifier based on all of the attributes. After that, we'll see how accurate it is.

### A Plan for the Implementation ###
It's time to write some code to implement the classifier.  The input is a `point` that we want to classify.  The classifier works by finding the $k$ nearest neighbors of `point` from the training set.  So, our approach will go like this:

1. Find the closest $k$ neighbors of `point`, i.e., the $k$ wines from the training set that are most similar to `point`.

2. Look at the classes of those $k$ neighbors, and take the majority vote to find the most-common class of wine.  Use that as our predicted class for `point`.

So that will guide the structure of our Python code.

### Implementation Step 1 ###
To implement the first step for the kidney disease data, we had to compute the distance from each patient in the training set to `point`, sort them by distance, and take the $k$ closest patients in the training set.  

That's what we did in the previous section with the point corresponding to Alice. Let's generalize that code. We'll redefine `distance` here, just for convenience.

**Question 5**: Complete the `all_distances` function below. The function should return an array of distances between every point in the `table` and a `new_point`

*Hint*: [df.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) can be used to apply a function to every row, if you pass `axis=1`


In [None]:
# Complete the function below
def all_distances(training, new_point):
    """Returns an array of distances
    between each point in the training set
    and the new point (which is a row of attributes)"""
    attributes = training.drop(['Class'], axis=1) # Remove the class attribute from the table
    return ...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
def all_distances(training, new_point):
    """Returns an array of distances
    between each point in the training set
    and the new point (which is a row of attributes)"""
    attributes = training.drop(['Class'], axis=1) # Remove the class attribute
    return attributes.apply(lambda x: distance(x, new_point), axis=1)</pre>
</details>

In [None]:
# Use this cell to test your function
all_distances(wine, wine.loc[0])

**Question 6**: Complete the `closest` function below. Given a series of indexed data returned by `all_distances`, it should return the top `k` points that are closest to it.

*Hint*: [series.sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sort_values.html) will sort the values in increasing distance by default, and you can slice the resulting series by `k`

In [None]:
# Complete the function below
def closest(training, new_point, k):
    """Returns the top k points closest to new_point from training"""
    distances = all_distances(training, new_point)
    topk = ...
    return topk

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
def closest(training, new_point, k):
    """Returns the top k points closest to new_point from training"""
    distances = all_distances(training, new_point)
    topk = distances.sort_values()[:k]
    return topk
</pre>
</details>

Let's see how this works on our `wine` data. We'll just take the first wine and find its five nearest neighbors among all the wines. Remember that since this wine is part of the dataset, it is its own nearest neighbor. So we should expect to see it at the top of the list, followed by four others.

First let's extract its attributes:

In [None]:
special_wine = wine.iloc[0]

And now let's find its 5 nearest neighbors.

In [None]:
closest(wine, special_wine, 5)

Bingo! The first row is the nearest neighbor, which is itself – there's a 0 in the series value as expected. 

The following function will allow us to see these points by index from the original table

In [None]:
# Run this cell for this function
def show_closest_rows(table, closest_series):
    return table.iloc[closest_series.index]

In [None]:
# Run the cell below to see the closest wines
show_closest_rows(wine, closest(wine,special_wine,5))

All five nearest neighbors are of Class 1, which is consistent with our earlier observation that Class 1 wines appear to be clumped together in some dimensions.

### Implementation Steps 2 and 3 ###
Next we need to take a "majority vote" of the nearest neighbors and assign our point the same class as the majority.

**Question 7**: Complete the `majority` function below. Given a set of rows returned by `show_closest_rows` it should take vote on the `Class` column and return the majority (`0` or `1`)

*Hint*: You can use [loc[]]() and the python built-in function `len()` to find the number of rows that match a certain condition.

In [None]:
# Complete the function below
def majority(topkrows):
    ones = ...
    zeros = ...
    if ones > zeros:
        return 1
    else:
        return 0

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
def majority(topkrows):
    ones = len(topkrows.loc[topkrows['Class'] == 1])
    zeros = len(topkrows.loc[topkrows['Class'] == 0])
    if ones > zeros:
        return 1
    else:
        return 0
</pre>
</details>

In [None]:
# Test your function here
special_wine = wine.iloc[0]
top5rows = show_closest_rows(wine, closest(wine,special_wine,5))
majority(top5rows)

We're now ready to stitch everything together. 

**Question 8**: Using the functions defined above, write a `classify` function that given a `training` set, and a `new_point` and a classifier parameter `k`, find the closest class to the point.

In [None]:
def classify(training, new_point, k):
    ...
    ...
    return ...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
def classify(training, new_point, k):
    closest_points = closest(wine, new_point, 5)
    topkrows = show_closest_rows(wine, closest_points)
    return majority(topkrows)
</pre>
</details>

In [None]:
# Test your classifier below
special_wine = wine.iloc[0]
classify(wine, special_wine, 5)

If we change `special_wine` to be the last one in the dataset, is our classifier able to tell that it's in Class 0?

In [None]:
special_wine = wine.iloc[177]
classify(wine, special_wine, 5)

Yes! The classifier gets this one right too.

But we don't yet know how it does with all the other wines, and in any case we know that testing on wines that are already part of the training set might be over-optimistic. We'll learn about classifier training and errors next week!

That's it for Lab 06. This lab was built from the [Data 8 textbook](https://www.inferentialthinking.com/chapters/17/4/Implementing_the_Classifier).