# Classification with _K-nearest neighbor_

We are trying to predict who would survive the titanic

In [2]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

In [3]:
df = pd.read_csv('titanic.csv')
df = df[['Survived','Pclass', 'Age', 'SibSp', 'Sex', 'Parch']] #make this selection before you dropna
df = df.dropna() #first get rid of rows with empty cells 
df.head()
df['Survived'].value_counts()

0    424
1    290
Name: Survived, dtype: int64

Unfortunately most people did not survive

Now we need to get dummie variables for our categorical variable of gender

In [4]:
dummies = pd.get_dummies(df['Sex'])
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns (axis=0 is rows)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Sex,Parch,female,male
0,0,3,22.0,1,male,0,0,1
1,1,1,38.0,1,female,0,1,0
2,1,3,26.0,0,female,0,1,0
3,1,1,35.0,1,female,0,1,0
4,0,3,35.0,0,male,0,0,1


In [5]:
from sklearn.preprocessing import normalize #get the function needed to normalize our data.

X = df[['Age', 'Pclass', 'SibSp', 'Parch', 'female']] #create the X matrix
#X = normalize(X) #normalize the matrix to put everything on the same scale

y = df['Survived'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables
#the order of the above variables is very important
X_train.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,female
641,24.0,1,0,0,1
433,17.0,3,0,0,0
202,34.0,3,0,0,0
585,18.0,1,0,2,1
544,50.0,1,1,0,0


In [6]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier(n_neighbors=3) #create a KNN-classifier with 5 neighbors  (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
knn.score(X_test, y_test) #calculate the fit on the test data

0.7953488372093023

79% of survivors are predicted accurately, this is very good. But let's look at the confusion matrix to see how well the model identifies the different survivors. A confusion matrix gives the different classes and the number of predictions for each combination

In [7]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[111,  23],
       [ 21,  60]])

In [8]:
conf_matrix = pd.DataFrame(cm, index=['Not survived (actual)', 'Survived (actual)'], columns = ['Not survived (predicted)', 'Survived (predicted)']) 
conf_matrix

Unnamed: 0,Not survived (predicted),Survived (predicted)
Not survived (actual),111,23
Survived (actual),21,60


The way to read this is that of the survivals, 60 are correctly predicted as 'survived', 23 that were predicted as survivors actually died. And of those who were predicted to not survive, 111 were predicted correctly, while 21 actually did survive. The _recall_ and _precision_ for the survival predictions:

$recall = \frac{60}{ 21 + 60} = .74$

$precision = \frac{60 }{ 23 + 60} = .72$

We might improve our scores by trying out different values of _k_.

The precision is 72%

The recall is 72%