# Execute the code below

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
titanic = sns.load_dataset('titanic')

  import pandas.util.testing as tm


Have a look on the titanic dataset

In [0]:
titanic.shape

(891, 15)

In [0]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


**We are going to learn machine learning concepts with the titanic dataset, one of the most infamous shipwrecks in history.**

# Data preprocessing


Machine learning is nothing without fine data preprocessing.  
Excute the code below that modifies the titanic dataset by:

* Selecting dedicated and useful features (i.e columns)
* Removing NA data
* Recoding the feature 'sex' (gender) into numerical data, because ML needs (and loves) numerical data

In [0]:
titanic = titanic[['survived', 'pclass', 'sex', 'age']]
titanic.dropna(axis=0, inplace=True)
titanic['sex'].replace(['male', 'female'], [0, 1], inplace=True)
titanic.head()

Unnamed: 0,survived,pclass,sex,age
0,0,3,0,22.0
1,1,1,1,38.0
2,1,3,1,26.0
3,1,1,1,35.0
4,0,3,0,35.0


# KNN classification with Scikit-Learn

## Train Test Split Data


First you have to divide the titanic dataframe into 2 separated dataframes :
  - `y` with the feature to be predicted (i.e. survived)
  - `X` with the other features that will be used for the model

And then from `X` and `y`, you need to separate them for training and testing your model :
* Use 75% of data for training, the rest for testing
* Don't forget to split data with the random mode

[See the previous quest on train-test split if needed](https://odyssey.wildcodeschool.com/quests/581)

In [0]:
# Your code here
from sklearn.model_selection import train_test_split
y = titanic['survived']
X = titanic[['pclass', 'sex', 'age']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size = 0.75)
print("The length of the initial dataset is :", len(X))
print("The length of the train dataset is   :", len(X_train))
print("The length of the test dataset is    :", len(X_test))

The length of the initial dataset is : 714
The length of the train dataset is   : 535
The length of the test dataset is    : 179


## Model initialisation

CONGRATS !!! You are going to develop your first ML model for KNN classification.  
For that, please create a `model` object that initialises your model with the KNN classifier


[More info here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [0]:
# Your code here
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

## Model fitting


Now you have to fit your model on the training data.

[More info here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [0]:
# Your code here
model = model.fit(X_train, y_train)
print(model)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')


## Make predictions

You model is ready for prediction !

In [0]:
# Your code here
predictions = model.predict(X_test)
print(predictions)

[1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0
 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0
 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0
 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0]


Make prediction for yourself !  
Fill the data below and evaluate your chance of survival ...

In [0]:
# Your code here
my_class = 3
my_sex = 0
my_age = 28
my_data = np.array([my_class, my_sex, my_age]).reshape(1,3)
model.predict(my_data)

array([0])

## Model evaluation

Last but not least, you should evaluate the accuracy of your model.  
The metric `accuracy_score` is directly imported form `sklearn.metrics `.  
Please remember than other metrics are available to evaluate classification models such as precision, recall, f1 score and all together compose the `confusion matrix`.

In [0]:
# Your code here
from sklearn.metrics import accuracy_score, confusion_matrix

accuracy = accuracy_score(y_test, predictions)
conf_mtrx = confusion_matrix(y_test, predictions)

print("accuracy = %.3f" % accuracy)
print("Confusion matrix:\n", conf_mtrx)

accuracy = 0.726
Confusion matrix:
 <function confusion_matrix at 0x7fa5f8220d08>


Performances of our model are pretty poor and could be explained by the original dataset.  
Antoher way is to tune the hyperparameter such as the number of neighbors ...

## Hyperparameter


Let's play with the `n_neighbors` hyperparameter of the model.  
* Evaluate the score of your models by adjusting the hyperparameter from 2 to 10.
* What is the value of `n_neighbors` that leads to the best score ?

In [0]:
# Your code here
for i in range(2, 11):
  model = KNeighborsClassifier(n_neighbors = i).fit(X_train, y_train)
  predictions = model.predict(X_test)
  accuracy = accuracy_score(y_test, predictions) 
  conf_mtrx = confusion_matrix(y_test, predictions)
  precision = conf_mtrx[0, 0] / conf_mtrx[:, 0].sum()  
  recall = conf_mtrx[0, 0] / conf_mtrx[0, :].sum() 
  print("Model with n_neighbors =", i)
  print("Accuracy =", accuracy)
  print("Precision =", precision)
  print("Recall =", recall)
  print("Confusion matrix:\n", conf_mtrx, "\n")

Model with n_neighbors = 2
Accuracy = 0.7206703910614525
Precision = 0.7111111111111111
Recall = 0.897196261682243
Confusion matrix:
 [[96 11]
 [39 33]] 

Model with n_neighbors = 3
Accuracy = 0.7262569832402235
Precision = 0.7416666666666667
Recall = 0.8317757009345794
Confusion matrix:
 [[89 18]
 [31 41]] 

Model with n_neighbors = 4
Accuracy = 0.7150837988826816
Precision = 0.71875
Recall = 0.8598130841121495
Confusion matrix:
 [[92 15]
 [36 36]] 

Model with n_neighbors = 5
Accuracy = 0.7262569832402235
Precision = 0.7636363636363637
Recall = 0.7850467289719626
Confusion matrix:
 [[84 23]
 [26 46]] 

Model with n_neighbors = 6
Accuracy = 0.6927374301675978
Precision = 0.6940298507462687
Recall = 0.8691588785046729
Confusion matrix:
 [[93 14]
 [41 31]] 

Model with n_neighbors = 7
Accuracy = 0.7318435754189944
Precision = 0.7521367521367521
Recall = 0.822429906542056
Confusion matrix:
 [[88 19]
 [29 43]] 

Model with n_neighbors = 8
Accuracy = 0.7039106145251397
Precision = 0.701492

# Conclusions
The model with **7 neighbours** seems to be the **best** one. 
--



* Congrats !!! You just landed on the MACHINE LEARNING planet
* The KNN classifier is an algorithm from supervised learning part of ML
* Scikit learn is the to-know-and-to-love toolbox for ML
* Our KNN classifier could be improved with hyperparameter tuning
* Other algorithms should be tested for selecting the best one, but it is another story ... to be continued ML Data Wilders :) 