# KNN for Titanic: Planning

Plan:
- use sklearn
    - use built in sklearn functionality for everything possible, including preprocessing
- use few features:
    - sex
    - age
    - pclass?
    - fare?
- start by exploring all data
    - df.head()
    - choose relevant features and explain
- preprocess data
    - fill in missing values
    - one-hot encode categorical
    - normalize numerical
        - don't just use sklearn's normalize -- write own
- use cross validation to find best number of neighbors?
    - outside scope of workshop?
    - won't work?
        - try it on own first

Start by importing the necessary packages and loading the dataset.

In [465]:
# basic imports
import numpy as np
import pandas as pd

In [466]:
# load dataset and visualize first few entries
data = pd.read_csv('titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Here's the data! Let's select relevant features. 

Which features do you think will be good predictors of survival? Excluding 'Survived', here are our choices:

In [467]:
list(data)

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

Let's consider the phrase 'women and children first'. Let's also consider location in the ship, determined by passenger class. We'll choose:
- Pclass
- Sex
- Age

We also need to save the 'Survived' column of the data. This is what we are trying to predict! These are our labels.

In [468]:
# choose relevant features
pclass = data.Pclass.values
sex = data.Sex.values
age = data.Age.values

# save labels
y = data.Survived.values

# save number of data points
m = y.size

print('\nFeatures:\n', np.array([pclass[:10], sex[:10], age[:10]]).T)
print('\nLabels:\n', y[:10].reshape(-1, 1))


Features:
 [[3 'male' 22.0]
 [1 'female' 38.0]
 [3 'female' 26.0]
 [1 'female' 35.0]
 [3 'male' 35.0]
 [3 'male' nan]
 [1 'male' 54.0]
 [3 'male' 2.0]
 [3 'female' 27.0]
 [2 'female' 14.0]]

Labels:
 [[0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]]


We have some missing values in age. Let's fill them in with the mean value.

In [469]:
# fill in missing values in age
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean')
age = imputer.fit_transform(age.reshape(-1, 1))

Before using k-nearest neighbors, we need to make sure the data is in the right form. Is there a problem with our current setup?

There is a problem! Well, multiple problems, actually. 

First off, Sex is not a number. K-nearest neighbors can only learn from numerical features. Let's fix that by one-hot encoding.

In [470]:
# one-hot encode Sex feature
sex = np.asarray([1 if sex[i] == 'male' else 0 for i in range(m)]).reshape(-1, 1)

What about the class feature? This feature has values 1, 2, and 3, corresponding to the first, second, and third classes in the ship. We actually need to one-hot encode this feature as well. This is necessary because class is a categorical feature, not a continuous value.

We will use a preprocessing function to do this.

In [471]:
# one hot encode class feature
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
pclass = onehotencoder.fit_transform(pclass.reshape(-1, 1)).toarray()

Ok, now our data is all numerical, and k-nearest neighbors can learn from it.

However, there's still a problem...

Look at the range of values for each feature. Are they all on the same scale?

In [472]:
print('\nFeatures:\n', np.hstack((pclass[:10, :], sex[:10, :], age[:10, :])))


Features:
 [[  0.           0.           1.           1.          22.        ]
 [  1.           0.           0.           0.          38.        ]
 [  0.           0.           1.           0.          26.        ]
 [  1.           0.           0.           0.          35.        ]
 [  0.           0.           1.           1.          35.        ]
 [  0.           0.           1.           1.          29.69911765]
 [  1.           0.           0.           1.          54.        ]
 [  0.           0.           1.           1.           2.        ]
 [  0.           0.           1.           0.          27.        ]
 [  0.           1.           0.           0.          14.        ]]


They're not! Age has a much larger range of values than either sex or class.

This is a problem! Age differences will be considered much more significant than differences in sex or class. This does not make sense for our problem. 

Let's fix this by scaling and shifting our feature values.

In [473]:
# shift/scale age to be in range 0 -> 1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
age = scaler.fit_transform(age)

#np.min(age), np.max(age), np.mean(age)

Now, we are good to go! Let's combine our features into a single array in order to train our classifier.

In [474]:
# create single array of features
X = np.hstack([pclass, sex, age])

Now, let's make a k-neareset neighbors classifier.

In [475]:
# create classifier, fit on data
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Let's evaluate our classifier's performance on the training data.

In [476]:
# make predictions for all X
y_pred = classifier.predict(X)

# compare predictions to actual
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y, y_pred)
print('Accuracy on training set: %.2f' % accuracy)

Accuracy on training set: 0.85
