In [None]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from matplotlib.colors import ListedColormap
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier 

# Build and test a Nearest Neighbors classifier.

- We will try to build the best possible nearest neighbors classifier for a specific data set. By 'best', we mean _highest accuracy_
- Use a train/dev/test split
- Experiment with as many hyper-parameters as possible



### Load the Iris data

Load the Iris data to use for experiments. The data include 50 observations of each of 3 types of irises (150 total). Each observation includes 4 measurements: sepal and petal width and height. The goal is to predict the iris type from these measurements.

<http://en.wikipedia.org/wiki/Iris_flower_data_set>

In [None]:
# Load the data, which is included in sklearn.
iris = load_iris()
print 'Iris target names:', iris.target_names
print 'Iris feature names:', iris.feature_names
X, y = iris.data, iris.target

### Take a quick look at the data

This data is fairly well behaved. For now, we are going to skip EDA and feature engineering---which would normally be essential steps

### Break off a test set

In [None]:
np.random.seed(0)



At this point, we set aside the test set. We should only touch the test set once!

### Split the training set into development train/test

In [None]:
np.random.seed(1)



## Implement K Nearest Neighbors

### Implement a distance function

Create a distance function that returns the distance between 2 observations.

Just for fun, let's compute all the pairwise distances in the training data and plot a histogram.

##### Nicer Versions of the above two cells

Python is great for data science because there are a ton of high-level, commonly used, and often vectorized operations

### 1-NN Classifier

Ok now let's create a class that implements a Nearest Neighbors classifier. We'll model it after the sklearn classifier implementations, with fit() and predict() methods.

<http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier>

#### Test

See how well the classifier performs

### k nearest neighbors

The implementation above only allows for a single nearest neighbor; that is, the classifier predicts the label of the closest available point. What about using more than one nearest neighbor. Typically, this means to make a prediciton we:

1. Find the k closest points (according to our distance metric) to the query point.
2. Find the majority label of those k points found in (1)
3. Return the label in (2) as the prediction

Try implementing this strategy below.
Hint: Check out the `most_common` method in `Counter`

#### Test

`NearestNeighbors` and `OurKNearestNeighbors` should give the same prediction

### Picking k: the number of neighbors to use in classification

Implement a way to pick the number of neighbors to use in the classifier. We already have a test set, so simply extend the procedure in the previous code cell to run over different numbers of neighbors. Plot the test set performance versus the number of neighbors.

### This isn't really enough information. Let's try using more train/dev splits

Step 1 - Write a function that splits a training dataset randomly, builds a kNN classifier, and reports the acccuracy

Step 2 - Run the above function 500 times for each value of k between 1 and 15 to get the mean test accuracy for that value of k

Step 3 (optional) - If you would like, now would be the time to experiment with different distance metrics. Repeat the above step and see what the results look like

### Deployment

Use Occam's razor to select the best model.

Once you have chosen the hyper-parameters for the model (k and distance metric), it is time to deploy the model. Now is the time to test our model on the test set, so that we know what to expect after deployment.

Note: Before deploying the model, we could actually incorporate the test data into our training set. The only thing that matters now, is how well the model generalizes in the real world.

### Visualizing the results

We've been a litte haphazard so far, we should have plotted the data and some results to get an idea of how the algorithm is performing. Plot the data with the true labels as colors, and plot it with some fitted labels, for differing values of k, to see how our KNN algorithm is performing.

In [None]:
k = 

# clfk = OurKNearestNeighbors(k=k)
# clfk.fit(train_data, train_labels)
# preds = clfk.predict(test_data)

cm_bright = ListedColormap(['#FF0000', '#0000FF', '#00FF00'])

plt.figure(figsize=(8,8)) 
p = plt.subplot(2, 2, 1)
p.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright)
plt.title("Iris Test Data: X1 vs X2")

p = plt.subplot(2, 2, 2)
p.scatter(X_test[:, 2], X_test[:, 3], c=y_test, cmap=cm_bright)
plt.title("Iris Test Data: X3 vs X4")

p = plt.subplot(2, 2, 3)
p.scatter(X_test[:, 0], X_test[:, 1], c=preds, cmap=cm_bright)
plt.title("Iris Test Data: X1 vs X2 [pred colors]")

p = plt.subplot(2, 2, 4)
p.scatter(X_test[:, 2], X_test[:, 3], c=preds, cmap=cm_bright)
plt.title("Iris Test Data: X3 vs X4 [pred colors]")

## Bonus: Implement variable scaling. 

That is, scale each feature to its normalized z-score. Make sure not to contaminate the test set..

Repeat steps one and two above

Step 1

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler



Step 2

Question: Did variable scaling improve anything? Why or why not?