# KNN for Titanic: Planning

Plan:
- use sklearn
    - use built in sklearn functionality for everything possible, including preprocessing
- use few features:
    - sex
    - age
    - pclass?
    - fare?
- start by exploring all data
    - df.head()
    - choose relevant features and explain
- preprocess data
    - fill in missing values
    - one-hot encode categorical
    - normalize numerical
        - don't just use sklearn's normalize -- write own
- use cross validation to find best number of neighbors?
    - outside scope of workshop?
    - won't work?
        - try it on own first

In [33]:
# imports
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

In [34]:
# load dataset and visualize first few entries
data = pd.read_csv('titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Here's the data! Let's select relevant features. 

Which features do you think will be good predictors of survival?

Let's consider the phrase 'women and children first'. Let's also consider location in the ship, determined by passenger class. We'll choose:
- Pclass
- Sex
- Age

We also need to save the 'Survived' column of the data. This is what we are trying to predict! These are our labels.

In [35]:
# choose relevant features
pclass = data.Pclass.values
sex = data.Sex.values
age = data.Age.values

# save labels
y = data.Survived.values

# save number of data points
m = y.size

print('\nFeatures:\n', np.array([pclass[:5], sex[:5], age[:5]]).T)
print('\nLabels:\n', y[:5].reshape(-1, 1))


Features:
 [[3 'male' 22.0]
 [1 'female' 38.0]
 [3 'female' 26.0]
 [1 'female' 35.0]
 [3 'male' 35.0]]

Labels:
 [[0]
 [1]
 [1]
 [1]
 [0]]


Before using k-nearest neighbors, we need to make sure the data is in the right form. Is there a problem with our current setup?

There is a problem! Well, multiple problems, actually. 

First off, Sex is not a number. K-nearest neighbors can only learn from numerical features. Let's fix that by one-hot encoding.

In [36]:
# one-hot encode Sex feature
sex = np.asarray([1 if sex[i] == 'male' else 0 for i in range(m)])

Ok, now our data is all numerical, and k-nearest neighbors can learn from it.

However, there's still a problem...

Look at the range of values for each feature. Are they all on the same scale?

They're not! Age has a much larger range of values than either sex or class.

This is a problem! Age differences will be considered much more significant than differences in sex or class. This does not make sense for our problem. 

Let's fix this by scaling and shifting our feature values.

In [37]:
np.mean(age)

nan

Oh no! Looks like something is strange...

In [38]:
age

array([ 22.  ,  38.  ,  26.  ,  35.  ,  35.  ,    nan,  54.  ,   2.  ,
        27.  ,  14.  ,   4.  ,  58.  ,  20.  ,  39.  ,  14.  ,  55.  ,
         2.  ,    nan,  31.  ,    nan,  35.  ,  34.  ,  15.  ,  28.  ,
         8.  ,  38.  ,    nan,  19.  ,    nan,    nan,  40.  ,    nan,
          nan,  66.  ,  28.  ,  42.  ,    nan,  21.  ,  18.  ,  14.  ,
        40.  ,  27.  ,    nan,   3.  ,  19.  ,    nan,    nan,    nan,
          nan,  18.  ,   7.  ,  21.  ,  49.  ,  29.  ,  65.  ,    nan,
        21.  ,  28.5 ,   5.  ,  11.  ,  22.  ,  38.  ,  45.  ,   4.  ,
          nan,    nan,  29.  ,  19.  ,  17.  ,  26.  ,  32.  ,  16.  ,
        21.  ,  26.  ,  32.  ,  25.  ,    nan,    nan,   0.83,  30.  ,
        22.  ,  29.  ,    nan,  28.  ,  17.  ,  33.  ,  16.  ,    nan,
        23.  ,  24.  ,  29.  ,  20.  ,  46.  ,  26.  ,  59.  ,    nan,
        71.  ,  23.  ,  34.  ,  34.  ,  28.  ,    nan,  21.  ,  33.  ,
        37.  ,  28.  ,  21.  ,    nan,  38.  ,    nan,  47.  ,  14.5 ,
      

Looks like we have some missing values in age. Let's fill them in with the mean value.

In [39]:
# fill in missing values in age
num_valid = 0
sum_age = 0

for i in range(m):
    # if element in age is not missing, note as valid
    if not np.isnan(age[i]):
        num_valid += 1
        sum_age += age[i]

# out of valid points, find mean
mean_age = sum_age / num_valid
        

# shift and scale age to range 0 -> 1
age -= np.mean(age) - 1


29.69911764705882