In [None]:
# plotting
%matplotlib inline

# data analysis
import pandas as pd

# data visualization
from helper_functions import plot_setup
plot_setup()

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

For the final example of a supervised machine learning model, we'll look at **K-Nearest Neigbor Classifier** and how changing the number of neighbors and the weighting scheme influences model performance.

Again, load the same dataset as in the previous examples.

In [None]:
titanic = pd.read_csv('titanic_processed.csv')

In [None]:
X = titanic.drop('survived', axis = 1)
y = titanic['survived']

### K-Nearest Neighbor

K-nearest neighbor is a pretty cool algorithm, which is different from the ones we've learned so far because it doesn't learn parameters. There are no coefficients or parameters learned by this algorithm. Instead it keeps track of the data and queries the data during classification time.

It's a fairly simple algorithm. You keep track of all of the data points, plotted in their multidimensional space. You already know which classes these data points belong to because you have their labels.

When a new data point comes in and you want to classify it, you plot that in the multidimensional space along with all of the other data points.

To find the best class for the point, you look at the labels for the k-nearest neighbors (the k data points closest to the new data point). You return the most common class amongst the k-nearest neighbors as the new point's label.

You can play around with several model input parameters to change your classifier:

* How many neighbors you look at (k).

* How you weight the distance of a neighbor when considering it's class. For example, if one data point is really close to your new point, you probably expect that class to be more likely than a bunch of data points much further away from your point.

* The type of function you use to measure distance (look up [Euclidean distance](https://www.cut-the-knot.org/pythagoras/DistanceFormula.shtml) or [Manhattan distance](https://en.wiktionary.org/wiki/Manhattan_distance) as a few options if you're interested).

We'll initialize three different types of models to see how the number of neighbors changes the model.

Quickly note that we use the `uniform` weighting scheme here. That means that all of the neighbors contribute equally to the class calculation, regardless of their distance.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

models = [('2 Neighbors Uniform Weights', KNeighborsClassifier(n_neighbors=2, weights='uniform')),
          ('5 Neighbors Uniform Weights', KNeighborsClassifier(n_neighbors=5, weights='uniform')),
          ('15 Neighbors Uniform Weights', KNeighborsClassifier(n_neighbors=15, weights='uniform'))]

We'll start with four features that we have found to be predictive (pclass, gender, age, family_status).

In [None]:
X_sub = X[['pclass', 'gender', 'age', 'family_status']]

Separate the data into training and test sets using a 80/20 split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_sub, y, test_size=0.2, random_state=42)

Let's train, predict, and see the accuracy score for each of our models.

Which model has the highest accuracy?<br />
Why do you think this may be the case?

In [None]:
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(name + ' Accuracy: ' + str(accuracy_score(y_pred, y_test)))

Now it's time to explore the effect of the weighting scheme.

We use `weights='distance'` here. The `distance` weighting scheme takes the distance into consideration when computing the majority class for the new data point. The neighbors are weighted inversely relating to their distance. Points which are closer to the new point count more than points which are further. More info [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [None]:
models = [('2 Neighbors Distance Weights', KNeighborsClassifier(n_neighbors = 2, weights = 'distance')),
          ('5 Neighbors Distance Weights', KNeighborsClassifier(n_neighbors = 5, weights = 'distance')),
          ('15 Neighbors Distance Weights', KNeighborsClassifier(n_neighbors = 15, weights = 'distance'))]

We'll train, predict, compute the accuracy, and plot the models for each of these models.

How do the accuracies compare with the `uniform` weight K Nearest Neighbors models from before?
Why do you think that's the case?

In [None]:
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(name + ' Accuracy: ' + str(accuracy_score(y_pred, y_test)))

#### How to pick the values for the model parameters?

Often the model you'd want to use has a number of parameters you can tweak. Finding the set of parameters that work best in your particular case can be time consuming and often bears the risk of introducing overfitting. `scikit-learn` provides a way for you to search for the best parameters automatically and in a cross-validated way.

In [None]:
parameters = {'n_neighbors': [2, 3, 5, 7, 9, 11, 13, 15], 'weights': ['uniform', 'distance']}

In [None]:
clf = GridSearchCV(KNeighborsClassifier(), parameters, cv = 5)
clf.fit(X_train, y_train)

This is now the best accuracy score we have after the grid search.

In [None]:
clf.best_score_

And these are the parameters that the search has found to have the highest accuracy.

In [None]:
clf.best_params_

In order to train the model on our data, `clf.best_estimator_` now becomes our model, and everything else looks exactly like in the previous examples.

In [None]:
model = clf.best_estimator_
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
accuracy_score(y_pred, y_test)

This accuracy score is lower than our best score above even though we've used the best parameters. Why do you think that is?

**On Your Own**: Play around with the `p` input parameter. It is called the "Power parameter for the Minkowski metric", but just think about it as the metric used to measure distance between two points.

`p=1`: Manhattan distance  
`p=2`: Euclidean distance  

How does changing the value affect accuracy?