# KNN

## Motivation

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data

~[scikit-learn](http://scikit-learn.org/stable/modules/neighbors.html)

It's a beautiful day in this neighborhood,
A beautiful day for a neighbor.
Would you be mine?
Could you be mine?

~ Mr. Rogers

**Readings**: 
* openCV: http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_ml/py_knn/py_knn_understanding/py_knn_understanding.html
* dataquest: https://www.dataquest.io/blog/k-nearest-neighbors/  
* k-d tree: https://ashokharnal.wordpress.com/2015/01/20/a-working-example-of-k-d-tree-formation-and-k-nearest-neighbor-algorithms/
* euclidean: http://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/


## Data

In [120]:
import pandas
import numpy
import csv
#from scipy.stats import mode
from sklearn import neighbors
from sklearn.neighbors import DistanceMetric 
from pprint import pprint

MY_TITANIC_TRAIN = 'train.csv'
MY_TITANIC_TEST = 'test.csv'
titanic_dataframe = pandas.read_csv(MY_TITANIC_TRAIN, header=0)
print('length: {0} '.format(len(titanic_dataframe)))
titanic_dataframe.head(5)

length: 891 


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


* Remove Columns

In [121]:
titanic_dataframe.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
print('dropped')

dropped


In [122]:
titanic_dataframe.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


* Which are the factors?

In [123]:
titanic_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.7+ KB


Pre-Processing

In [124]:
# age_mean = numpy.mean(titanic_dataframe['Age'])
titanic_dataframe['Age'].fillna(numpy.mean(titanic_dataframe['Age']),inplace=True)
# titanic_dataframe.fillna(value=age_mean, axis=0)
titanic_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.7+ KB


In [125]:
titanic_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.7+ KB


In [126]:
# titanic_dataframe = titanic_dataframe.dropna()
titanic_dataframe['Embarked'].fillna(titanic_dataframe['Embarked'].mode().item(),inplace=True)
titanic_dataframe['Port'] = titanic_dataframe['Embarked'].map({'C':1, 'S':2, 'Q':3}).astype(int)

titanic_dataframe['Gender'] = titanic_dataframe['Sex'].map({'female': 0, 'male': 1}).astype(int)
titanic_dataframe = titanic_dataframe.drop(['Sex', 'Embarked', 'PassengerId', ], axis=1)
titanic_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Port        891 non-null int64
Gender      891 non-null int64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB


In [127]:
#Convert Columns to List
cols = titanic_dataframe.columns.tolist()
titanic_dataframe = titanic_dataframe[cols]


train_cols = [x for x in cols if x != 'Survived']
target_cols = [cols[0]]

print(train_cols, target_cols)
train_data = titanic_dataframe[train_cols]
target_data = titanic_dataframe[target_cols]

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Port', 'Gender'] ['Survived']


In [128]:
algorithm_data_model = neighbors.KNeighborsClassifier()
algorithm_data_model.fit(train_data.values, [value[0] for value in target_data.values])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [129]:
df_test = pandas.read_csv('test.csv')
ids = df_test.PassengerId.values
df_test.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1, inplace=True)

In [130]:
print(len(df_test))
df_test.info()

418
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        417 non-null float64
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB


In [131]:
mean_age = df_test.Age.mean()
df_test.Age.fillna(mean_age, inplace=True)

mean_fare = df_test.Fare.mean()
df_test.Fare.fillna(mean_fare, inplace=True)


df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test['Port'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3}).astype(int)
df_test = df_test.drop(['Sex', 'Embarked'], axis=1)


test_data = df_test.values
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass    418 non-null int64
Age       418 non-null float64
SibSp     418 non-null int64
Parch     418 non-null int64
Fare      418 non-null float64
Gender    418 non-null int64
Port      418 non-null int64
dtypes: float64(2), int64(5)
memory usage: 22.9 KB


In [132]:
titanic_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Port        891 non-null int64
Gender      891 non-null int64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB


In [133]:
output = algorithm_data_model.predict(df_test).astype(int)
print(output[:10])

[0 0 0 0 0 0 0 1 0 1]


In [134]:
result = numpy.c_[ids.astype(int), output]

In [135]:
print(result)

[[ 892    0]
 [ 893    0]
 [ 894    0]
 [ 895    0]
 [ 896    0]
 [ 897    0]
 [ 898    0]
 [ 899    1]
 [ 900    0]
 [ 901    1]
 [ 902    0]
 [ 903    0]
 [ 904    1]
 [ 905    0]
 [ 906    1]
 [ 907    1]
 [ 908    0]
 [ 909    0]
 [ 910    0]
 [ 911    0]
 [ 912    0]
 [ 913    1]
 [ 914    1]
 [ 915    1]
 [ 916    1]
 [ 917    0]
 [ 918    1]
 [ 919    0]
 [ 920    0]
 [ 921    0]
 [ 922    1]
 [ 923    1]
 [ 924    0]
 [ 925    0]
 [ 926    1]
 [ 927    0]
 [ 928    0]
 [ 929    0]
 [ 930    0]
 [ 931    1]
 [ 932    0]
 [ 933    0]
 [ 934    0]
 [ 935    0]
 [ 936    1]
 [ 937    0]
 [ 938    0]
 [ 939    0]
 [ 940    1]
 [ 941    0]
 [ 942    1]
 [ 943    0]
 [ 944    1]
 [ 945    1]
 [ 946    0]
 [ 947    0]
 [ 948    0]
 [ 949    0]
 [ 950    0]
 [ 951    1]
 [ 952    0]
 [ 953    1]
 [ 954    0]
 [ 955    0]
 [ 956    1]
 [ 957    0]
 [ 958    0]
 [ 959    0]
 [ 960    0]
 [ 961    1]
 [ 962    0]
 [ 963    0]
 [ 964    0]
 [ 965    0]
 [ 966    1]
 [ 967    0]
 [ 968    0]

In [138]:
prediction_file = open('ourpredictions.csv', 'w')
open_file = csv.writer(prediction_file)
open_file.writerow(['PassengerId', 'Survived'])
open_file.writerows(zip(ids, output))
prediction_file.close()

In [143]:

%timeit algorithm_data_model.predict(df_test).astype(int)


1000 loops, best of 3: 1.64 ms per loop


Timeit seems to be fairly quick for what it's doing.

Kaggle Placement: 66.029% Accurate. 3727th place.
Seems about right for how we are calculating this.