# 🚢 Titanic classifier

🎯 In this challenge, the goal is to use SVM classifiers to predict whether a passenger survived or not (accuracy score), and compare your performance with your buddy of the day on an unseen test set that you will both share. Be aware that you will only have one trial on the test set!

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
np.random.seed(8)

🚢 Import the `Titanic dataset`:

In [3]:
# YOUR CODE HERE
url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv"
data = pd.read_csv(url)
data

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.2500,0,0,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,3,26.0,0,0,7.9250,1,0,1,0,0,0,1
3,1,1,35.0,1,0,53.1000,1,1,0,0,0,0,1
4,0,3,35.0,0,0,8.0500,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,0,3,39.0,0,5,29.1250,1,0,1,0,0,1,0
710,0,2,27.0,0,0,13.0000,0,0,0,0,0,0,1
711,1,1,19.0,0,0,30.0000,1,1,0,0,0,0,1
712,1,1,26.0,0,0,30.0000,0,1,0,0,1,0,0


❓ **Question (Game Plan**) ❓ 

Write down below in plain english the different steps you are going to perform to answer the question.

In [None]:
# GOAL : Use SVM Classifier to predict whether a passanger survived or not. 
# Clean the data
# Train/Test split
# Feature Encoding
# Feature Scaling
# Model
# Model Tuning
# Performance analysis on test set

<details><summary>👨🏻‍🏫 <i>Read our answer suggested answer here</i></summary>
    
    
0. 🧹 Data Cleaning
1. ✂️ Train/Test Split
2. 🔡 Feature Encoding
3. ⚖️ Feature Scaling
4. 🐣 A first model
5. 🤖 Model Tuning: Cross-Validated RandomSearch (Coarse Grain approach first, Fine Grain afterwards)
6. 🕵🏻 True performance analysis on the test set
</details>

## (0) 🧹 Data Cleaning

❓ **Question (Duplicated rows)** ❓

Are there any duplicated rows ? If so, drop them.

In [4]:
# YOUR CODE HERE
len(data)

714

In [5]:
data.duplicated().sum()

38

In [6]:
data.drop_duplicates(inplace=True)
len(data)

676

❓ **Question (Missing values)** ❓

In which columns do we have missing values ?

Drop the column if there are too many missing values or impute these missing values.

In [7]:
# YOUR CODE HERE
data.isnull().sum().sort_values(ascending=False)

survived                   0
pclass                     0
age                        0
sibsp                      0
parch                      0
fare                       0
sex_female                 0
class_First                0
class_Third                0
who_child                  0
embark_town_Cherbourg      0
embark_town_Queenstown     0
embark_town_Southampton    0
dtype: int64

## (1) ✂️ Holdout

❓ **Question (Train-Test-Split)** ❓ 

* Holdout 30% of your dataset as the test set for a final evaluation  
    * Use `random_state=0` to compare your final results with your buddy's results)

In [15]:
# YOUR CODE HERE
X = data[['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_female',
       'class_First', 'class_Third', 'who_child', 'embark_town_Cherbourg',
       'embark_town_Queenstown', 'embark_town_Southampton']]
y = data['survived']


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## (2) 🔡 Encoding (the categorical variables)

✅ **Encoding the target**

👇 Your target is either `survived` or `died`. It was already done for you as shown down below.

In [None]:
data.survived.value_counts()

❓ **Question (Encoding the categorical features)** ❓

In [None]:
# YOUR CODE HERE

In [None]:
##############################################
# SOLUTION 1 - Scikit Learn - OneHot Encoder #
##############################################

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop = "if_binary", # Doesn't create an extra column for binary features
                    sparse = False, # Returns full matrixes with zeros where need be instead of sparse matrixes
                    handle_unknown="ignore") # Useful to set everything to zero for unseen categories in the test set

ohe.fit(X_train[categorical_features])


ohe.categories_

## (3) ⚖️ Feature Scaling

❓ **Question (Scaling)** ❓

Scale *both* your training set and your test set using the scaler of your choice

In [None]:
# YOUR CODE HERE

## (4) 🐣 Baseline Model

❓ **Question (Starting with a simple model...) ❓

Cross-validate a Linear SVC model as your baseline model, using the accuracy score. 

In [None]:
# YOUR CODE HERE

## (5) 🧨 Random Search

❓ **Question (Optimizing a Support Vector Classifier)** ❓

*  Use a **RandomizedSearchCV** to optimize both the parameters `kernel` and `C` of an SVM classifier (SVC)
    - Start with a total of `n_iter=100` combinations, cross-validated `cv=5` times each
    - Use `verbose=1`to check progress
    - Use `n_jobs=-1` to use all your CPU cores
    - (Optional) You can also optimize other parameters of your choice if you want to.

☣️ If the `RandomizedSearchCV` seems stuck after more than 10 seconds, perform one search per SVM kernel. Scikit-Learn sometimes experiences issues with _Searching_ multiple kernels at a time

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

######################
# Instanciate model  #
######################



#################################
# Hyperparameters' search space #
#################################



################################
# Instanciate Random Search    #
################################


❓ **Question(Optimized Model and its performance)** ❓

* What are the best parameters ?
* What is the best score?

In [None]:
# YOUR CODE HERE

## (6) 🕵️‍♀️ Final test score and Confusion Matrix

❓ **Question (Evaluating on the test set)** ❓

* Select the best model you want to test. You will compare your result with your buddy of the day!

* Compute its `accuracy`, `classification_report` and show the `confusion_matrix` on the test set.

☣️ You can only test one model. Once you have seen the test set, any other optimization would result in data leakage 

In [None]:
# YOUR CODE HERE

❓ **Question (Confusion Matrix)** ❓

In [None]:
# YOUR CODE HERE

🏁 Congratulations! You were able to tackle a classification task from A to Z, cleaning your dataset, encoding and scaling your features, optimizing your model... !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!