# K-Nearest Neighbors Classification

We've seen how to use $k$-nearest neighbors for regression problems, where the response variable is quantitative. We can also use $k$-nearest neighbors for classification problems, where the response variable is categorical.

The idea is the same. To predict the response for a new set of inputs, we look at the class labels of the $k$-nearest neighbors. The most common class label is the prediction for the new input.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

data = pd.read_csv("/data/titanic.csv")

### Exercise 1

Fit a 5-nearest neighbors model to predict `survived` from `age`, `sex`, and `class`. Predict whether a 20-year old female in first-class would survive. What about a 20-year old female in third-class?

In [2]:
data_train = data[['age', 'sex', 'class', 'survived']].dropna()
train = pd.get_dummies(data_train[['sex', 'class']])
train = train.drop('sex_female', axis=1).drop('class_First', axis=1)
train['age'] = data_train['age']
train.head()

Unnamed: 0,sex_male,class_Second,class_Third,age
0,1,0,1,22.0
1,0,0,0,38.0
2,0,0,1,26.0
3,0,0,0,35.0
4,1,0,1,35.0


In [3]:
from sklearn.neighbors import KNeighborsClassifier

# YOUR CODE HERE
model = KNeighborsClassifier(n_neighbors=5)
model.fit(train, data_train['survived'])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [4]:
print('20-year old female in First class:', model.predict_proba([[0, 0, 0, 20.0]]))
print('20-year old female in Third class:',model.predict_proba([[0, 0, 1, 20.0]]))

20-year old female in First class: [[ 0.4  0.6]]
20-year old female in Third class: [[ 0.6  0.4]]


### Exercise 2

Use cross-validation to determine the optimal number of neighbors $k$, as measured by F1 score. Plot the training and test error curves as a function of $k$.

In [9]:
from sklearn.model_selection import cross_val_score
