In [None]:
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from helper_functions import plot_supervised_model

For the final example of a supervised machine learning model, we'll look at K-Nearest Neigbor Classifier and how changing the number of neighbors and the weighting scheme influences model performance.

First, load the same dataset as in the previous example.

In [None]:
iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target

#### More about K-Nearest Neighbors

K-nearest neighbor is a pretty cool algorithm, which is different from the ones we've learned so far because it doesn't use parameters. There are no coefficients or parameters learned by this algorithm. Instead it keeps track of the data and queries the data during classification time.

It's a fairly simple algorithm. You keep track of all of the data points, plotted in their multidimensional space. You already know which classes these data points belong to because you have their labels.

When a new data point comes in and you want to classify it, you plot that in the multidimensional space along with all of the other data points.

To find the best class for the point, you look at the labels for the k-nearest neighbors (the k data points closest to the new data point). You return the most common class amongst the k-nearest neighbors as the new point's label.

* You can play around with how many neighbors you look at (k).

* You can play around with how you weight the distance of a neighbor when considering it's class. For example, if one data point is really close to your new point, you probably expect that class to be more likely, than a bunch of data points which could be much further away from your point.

* You can also play around with the type of function you use to measure distance (look up Euclidean distance or Manhattan distance as a few options if you're interested), but we won't do that in this workshop

### K-Nearest Neighbors

We'll initialize three different types of models to see how the number of neighbors change the model.

Quickly note that we use the `uniform` weighting scheme here. That means that all of the neighbors contribute equally to the class calculation, regardless of their distance.

In [None]:
models = [('2 Neighbors Uniform Weights', KNeighborsClassifier(n_neighbors=2, weights='uniform')),
          ('5 Neighbors Uniform Weights', KNeighborsClassifier(n_neighbors=5, weights='uniform')),
          ('15 Neighbors Uniform Weights', KNeighborsClassifier(n_neighbors=15, weights='uniform'))]

Use the first two features (sepal length and width and petal length).

In [None]:
X_sub = X[:, :3]

Separate the data into training and test sets using a 80/20 split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_sub, y, test_size=0.2, random_state=42)

Let's train, predict, and see the accuracy score for each of our models.

Which model has the highest accuracy?<br />
Why do you think this may be the case?

In [None]:
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(name + ' Accuracy: ' + str(accuracy_score(y_pred, y_test)))

Next we'll plot the models as usual.

Do you notice any difference between the three models?

In [None]:
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    plot_supervised_model(name, model, X_test, y_test, y_pred)

Train the KNN model with weights that are inversely proportional to distance (link to documentation, or tell them weights='distance')

Now it's time to explore the effect of the weighting scheme.

We use `weights='distance'` here. The `distance` weighting scheme takes the distance into consideration when computing the majority class for the new data point. The neighbors are weighted inversely relating to their distance. Points which are closer to the new point count more than points which are further.

More info [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [None]:
models = [('2 Neighbors Inverse Distance Weights', KNeighborsClassifier(n_neighbors=2, weights='distance')),
          ('5 Neighbors Inverse Distance Weights', KNeighborsClassifier(n_neighbors=5, weights='distance')),
          ('15 Neighbors Inverse Distance Weights', KNeighborsClassifier(n_neighbors=15, weights='distance'))]

We'll train, predict, compute the accuracy, and plot the models for each of these models.

How do the accuracies compare with the `uniform` weight K Nearest Neighbors models from before?
Why do you think that's the case?

In [None]:
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(name + ' Accuracy: ' + str(accuracy_score(y_pred, y_test)))
    plot_supervised_model(name, model, X_test, y_test, y_pred)