<h1 class="text-center">Introduction to Machine Learning: Titanic learning from the disaster</h1>
<h2 class="text-center">February, 2022</h2>


The purpose of this tutorial is to to predict who will survive and who will die on the Titanic using passengers data (age, ticket price, class, etc). The data are coming from a [Kaggle datascience competition](https://www.kaggle.com/c/titanic). You will use Pandas to load and pre-process the data and Sklearn for the classification part. 


![](img/dicap_titanic.png)

- In Section I, exploration data analysis, visualization and basic prediction based on gender
- In Section II, data pre-processing: scalling, missing values and categorical data encoding
- In Section III, a first pipeline using Logistic Regression 
- In Section IV, a second pipeline using RandomForest
- 📜 The last section (V) is the evaluation. We will ask you to improve and explore other pipelines (XGBoost, Ensemble Learning,...). You would have 2 weeks to do so and share with us the code + 1 page explication on your method (more details in the Section).

The code must be completed after each ❓ **Question** ❓. A blank cell with "HERE" appears as a comment in the code. The parameters that do not change the course of the story are accompanied "EDIT ME!" as a comment: you can change them at the time or at the end of the section to see the changes involved.

You can also find some 🔴 HINTS 🔴 with associated links to documentation and usefull functions.

In [None]:
import numpy as np # library for numerical analysis
import pandas as pd # library for data manipulation: data frame
import matplotlib.pyplot as plt # library for plotting
import seaborn as sns # advanced library for plotting

## Section I

Load the train CSV file using pandas and display the 5 first rows.

In [None]:
train = pd.read_csv("input/train.csv") 
train.head(10)

We got a column `Survived` that corresponds to the label we will try to predict.  
The `NaN` means that the value is missing. It is something we would need to investigate and correct. 

#### ❓ **Question** ❓ Now do the same with test data.
🔴 HINTS 🔴  
`input/test.csv`

In [None]:
# HERE

There is no `Survived` column in the test set, of course !

### Exploration Data Analysis (EDA)

First we will explore the data and do some plotting to know better what we have at hand.

In [None]:
train.columns

In [None]:
train.describe(include="all")

* We have 891 training examples (passengers), that is quite limited but still OK to do Machine Learning.

* It seems that some values are missing (`NaN`). We would need to how many values are missing for each feature.

* Some features are categorical (e.g. `Sex`, `Pclass`, `Embarked`), some other numerical (`Age`, `Fare`, `SibSp`, `Parch`) and finally some alphanumeric (`Ticket`, `Cabin`). We would need to transform the categorical data so they can be processed by a classifier (only numerical data).

### Missing values
🔴 HINTS 🔴  
We will use [`pd.isnull`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function to detect and count missing values.

In [None]:
print(pd.isnull(train).sum()/len(train)*100)

The `Cabin` features is missing for 77.1% of the passengers, so we would probably drop it as too many values are missing. 

`Age` is probably an important feature and missing for ~20% of the passengers. We will try to fill the gaps.

### Intuitions on the data
* Based on the movie and on the custom "women and children first", women and kids are probably more likely to survive.
* People in first class are more likely to survive as their cabin is closer to the deck (top of the boat). 
* People traveling alone are more likely to survive as they did not have to wait for relatives that may be slower. 

Let do some plotting to check our intuitions.

#### Male/Female
🔴 HINTS 🔴
- With Pandas to select a column you can simply use: `train["Sex"]`
- [`df.value_counts(normalize = True)`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) counts unique values and return a normalized count
- [`df.plot(kind='bar')`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) will plot an histogram of the values

Draw a first bar plot of sex survival by sex and compute mean percentage of male and woman who survived.

In [None]:
train["Sex"].value_counts(normalize = False).plot(kind='bar', ylabel='Number of passengers')
print("Percentage of male: {0:.2f}%".format(train["Sex"].value_counts(normalize = True)[0]*100))

#### Plot survival rate for womens
Here we will use [Matplotlib](https://matplotlib.org/) library directly and not trough Pandas.

In [None]:
count = train["Survived"][train["Sex"] == 'female'].value_counts(normalize = False)
_ = plt.bar(x=count.index, height=count)
_ = plt.ylabel('Number of passengers')
_ = plt.xticks(ticks=[0,1], labels=['Dead', 'Survived'])

#### ❓ **Question** ❓ Do the same with `male`

In [None]:
# HERE

#### ❓ **Question** ❓ Print the proportions in percentage

In [None]:
# HERE

So if we predict that all males will die and all female will survive we would reach an accuracy of: $0.6476\times(1-0.1889) + (1-0.6476)*0.7420)$ = 78.7%. 

Not bad ! Will be hard to beat !

#### Passenger class feature: `Pclass`
We will do the same analysis with passenger this time using an advanced library for plotting: [Seaborn](https://seaborn.pydata.org/).

In [None]:
sns.barplot(x="Pclass", y="Survived", data=train)

print("Percentage of Pclass = 1 who survived: {0:.2f}%".format(train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1]*100))

print("Percentage of Pclass = 2 who survived: {0:.2f}%".format(train["Survived"][train["Pclass"] == 2].value_counts(normalize = True)[1]*100))

print("Percentage of Pclass = 3 who survived: {0:.2f}%".format(train["Survived"][train["Pclass"] == 3].value_counts(normalize = True)[1]*100))


This could be intersected with the `Sex` feature to improve our first naive classifier.


🔴 HINTS 🔴  
It uses the [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby) function from Pandas.

In [None]:
train.groupby(['Pclass','Sex'])['Survived'].mean()

96.8% of women from first class have survived and only 13.5% of men from thrid class.

#### ❓ **Question** ❓ Create other `grouby` like that to see if we can do bettter ! 

In [None]:
# HERE

## Section 2: missing values and data pre-processing
In the`Cabin` feature many values are missing.It is very unlikely that `Ticket` number contains any useful information.

#### ❓ **Question** ❓ Drop the `Cabin` and Ticket number feature.

🔴 HINTS 🔴   
Use the function [`df.drop('col_name', axis='columns')`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop) to drop a column

In [None]:
# HERE

### Encode non-numerical labels
In the `Name` feature each passenger has a title that we will use to infer age when it is missing. For instance usually *Miss* and *Master* refer to people of younger age than *Mrs.* or *Mr.*

We will encode the non-numerical labels to a numerical value: *Master* $\rightarrow$ 0, *Miss* $\rightarrow$ 1, ... 

In [None]:
train.Name.head()

🔴 HINTS 🔴   
We want the letters after the first *space* and end it after the `.`  
We will use **regular expression (regex)** on string with the [`extract`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html?highlight=extract#pandas.Series.str.extract) function. 

In [None]:
# Put train and test in a list to do it on both
combine = [train, test]

# For train and test do:
for dataset in combine:
    str_names = dataset.Name.str # Get name column and convert it ot string
    
    # Perform reg-ex on it: extract letters after the first space and stop after the .
    # expand = Flase returns a Serie and not a DataFrame
    
    titles = str_names.extract(' ([A-Za-z]+)\.', expand=False) 
    # Put that in a new column
    dataset['Title'] = titles

#### ❓ **Question** ❓  
Use the [`pd.crosstab(index, column)`](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html?highlight=cross%20tab#pandas.crosstab) function to create a cross tabulation between `Title` and `Sex`.

In [None]:
# HERE

Replace various titles with more common names

In [None]:
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Capt', 'Col',
    'Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')
    
    dataset['Title'] = dataset['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

#### ❓ **Question** ❓  
Use the [preprocessing.LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) function from Sklearn to encode the title to a numerical value.

🔴 HINTS 🔴   
You can use `train["Title"].values` to extract the Title list in the form of an array (no more a Pandas structure).  

You need to define first an encoder: `le = preprocessing.LabelEncoder()`, then `fit` it to some values and finally `transform` the title and replace the values in the [`Title`] column (or create a new column and drop `Title` column.

In [None]:
# HERE

#### ❓ **Question** ❓  
Use again `value_counts` to count the occurence of each title.

In [None]:
# HERE

In [None]:
train.head()

#### Filling Age missing values

First we need to discretize the ages. It does not matter if a passenger is 31 or 32, what matter is that the passenger is young.  

We have defined a first dicretization: `[0, 5, 12, 18, 24, 35, 60, 100]` and you can modify it later.

🔴 HINTS 🔴   

- `df["Column"].fillna(value)` replace all the `NaN` values in `Column` by `value`.
- [`pd.cut`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html?highlight=cut#pandas.cut) can be used to cut our continus Age data into segments.

In [None]:
train["Age"] = train["Age"].fillna(-0.5) 
test["Age"] = test["Age"].fillna(-0.5)

# The bins for the age group and corresponding labels
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf] # EDIT ME
labels = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']

# 
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = labels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels = labels)

#### ❓ **Question** ❓  
Use Seaborn to draw a bar plot of AgeGroup vs Survived.

In [None]:
# HERE

We will now use this Age Group to intersect it with the Title use the most frequent age group for each title.

Fill missing age values using a correspondance between Title and mode of each AgeGroup.

1. Find the [mode](https://en.wikipedia.org/wiki/Mode_(statistics)) for each AgeGroup.
2. Make a correspondance between AgeGroup and Title
3. Fill the msising values

In [None]:
master_age = train[train["Title"] == 0]["AgeGroup"].mode() #Baby
miss_age = train[train["Title"] == 1]["AgeGroup"].mode() #Student
mr_age = train[train["Title"] == 2]["AgeGroup"].mode() #Young Adult
mrs_age = train[train["Title"] == 3]["AgeGroup"].mode() #Adult
rare_age = train[train["Title"] == 4]["AgeGroup"].mode() #Adult
royal_age = train[train["Title"] == 5]["AgeGroup"].mode() #Adult

In [None]:
age_title_mapping = {0: "Baby", 1: "Student", 2: "Young Adult", 3: "Adult", 4: "Adult", 5: "Adult"}

In [None]:
# For train
for x in range(len(train["AgeGroup"])):
    if train["AgeGroup"][x] == "Unknown":
        train["AgeGroup"][x] = age_title_mapping[train["Title"][x]]

# For  test
for x in range(len(test["AgeGroup"])):
    if test["AgeGroup"][x] == "Unknown":
        test["AgeGroup"][x] = age_title_mapping[test["Title"][x]]

### Embarked Feature: fill missing values

#### ❓ **Question** ❓  
How many people have embarked from Southampton (S), Cherbourg (C) and Queenstown (Q) ?

🔴 HINTS 🔴  
`value_counts`

In [None]:
# HERE

It's clear that the majority of people embarked in Southampton (S). We will fill in the missing values with S.

#### ❓ **Question** ❓  
Replacing the missing values in the Embarked feature with S

🔴 HINTS 🔴   
`fillna`

In [None]:
# HERE

#### One Hot Encoding
We could encode with S: 0, C: 1 and Q: 2 but it would mean that S is closer to C than Q which may not be true in practice. So instead we will create 3 collumns that encode for S, C and Q.

#### ❓ **Question** ❓  
Create 3 collumns that encode for S, C and Q.

🔴 HINTS 🔴   
[`pd.get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [None]:
# HERE

In [None]:
train.head()

#### ❓ **Question** ❓ Drop name feature
🔴 HINTS 🔴   
`df.drop(['Column_name'], axis='column')` 

In [None]:
# HERE

#### ❓ **Question** ❓ Encode sex feature
🔴 HINTS 🔴   
`le = preprocessing.LabelEncoder()` 

In [None]:
# HERE

#### ❓ **Question** ❓ Drop fare values as it is redundant with class information
🔴 HINTS 🔴   
`df.drop(['Column_name'], axis='column')` 

In [None]:
# HERE

#### ❓ **Question** ❓ Encode age groupe and drop age column
🔴 HINTS 🔴   
- `le = preprocessing.LabelEncoder()` 
- `df.drop(['Column_name'], axis='column')` 

In [None]:
# HERE

In [None]:
train.head()

## Section III: Classification using [Logisitic Regresssion](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression)

### [Train/validation split](https://scikit-learn.org/stable/modules/cross_validation.html) 
We will divide the training data in two sets:
- The train set to train the model on
- The validation set to estimate performance and track it

⚠️ The validation set is different from the test set defined above. This validaton set is used to have an estimation of the classification performance while the test set is used in the competition (we don't have the corresponding labels).

In [None]:
from sklearn.model_selection import train_test_split

# Get labels
targets = train["Survived"] 
predictors = train.drop(['Survived', 'PassengerId'], axis=1) # PassengerId is only useful to take part to Kaggle competition
X_test = test.drop(['PassengerId'], axis=1)

# Use 20% of data as validation
x_train, x_val, y_train, y_val = train_test_split(predictors, targets, test_size = 0.20, random_state = 1)

### Train the model

Here we will use [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) from Sklearn. It allows to chain pre-processing operators (here standard deviation normalization) with classifier in a pipeline. 
You have to: 
1. Instanciate a classifier and set the hyper-parameters (here they are left to default)
2. Put pre-processing and classifier in a Pipeline
6. Train the pipeline. In scikit-learn, all classifier have `.fit(X_train, y_train)` method to train it.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

clf = LogisticRegression()  # EDIT ME

model = make_pipeline(StandardScaler(), clf)

model.fit(x_train, y_train)

### Estimate performance
After training the model, you can call `clf.predict(X)` to compute a prediction.  
With the validation set we will have an estimation of the performance of the model. Estimating the performance on the same set that has been used for training would be overfiting. It is the same when you take an exam: questions are from the same set that you had during lectures but not exactlly the same otherwise to avoid by heart.   

In [None]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(x_val)
acc_logreg = round(accuracy_score(y_pred, y_val) * 100, 2)
print("Accuracy on the validation set: {0:.2f}%".format(acc_logreg))

Less than the guess on gender !!

### Test with anoter train/test split

In [None]:
# Use 20% of data as validation
x_train, x_val, y_train, y_val = train_test_split(predictors, targets, test_size = 0.20, random_state = 55)

In [None]:
model = make_pipeline(StandardScaler(), clf)

model.fit(x_train, y_train)
acc_logreg = model.score(x_val, y_val) * 100
print("Accuracy on the validation set: {0:.2f}%".format(acc_logreg))

Now it's 1% better !! Depending on the train/split we can have large difference in the estimation of the accuracy (here about 3%). One should be able to measure and take into account the variance of this estimation.

[**Cross-validation**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) was designed for the estimation of accuracy variance ! The training data are divided into **K** folds, **K-1** folds are used to train model and 1 fold to estimate performance. Then the folds used to train and test the model are rotated so we obtain **K** estimation of the performance with **K** distinct training sets.

![](img/grid_search_cross_validation.png)

#### ❓ **Question** ❓ Use [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) to perform 5-folds cross validation for the accuracy estimation. 
🔴 HINTS 🔴   
- `n_jobs = -1` option allows to run the training for each fold in parallel.
- Another scoring method can be provided to the `scoring` argument

In [None]:
# HERE

In [None]:
print("Mean accuracy on the 5-folds cross-validation : {0:.2f}%".format(scores.mean()))
print("Standard deviation of the accuracy on the 5-folds cross-validation : {0:.2f}%".format(scores.std()))

#### Take part to the Kaggle competition 

In [None]:
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)

You can use this `submission.csv` file to try to participate to the Kaggle competition ! 

### Feature importance with Logistic Regression
Logistic Regression is a linear regression associated with a non-linearity: sigmoid function (more details in the slides of the lecture).

You can access the weights of the linear regression to estimate feature importance. It would provide some interpretability to the model.  


⚠️ This feature importance is associated with the model, it is not something that should be extrapolated further. 

In [None]:
print(model[1].coef_)

In [None]:
model[1].coef_.shape

In [None]:
x_train.columns

In [None]:
coefs = pd.DataFrame(
   model[1].coef_.T,
   columns=['Coefficients'], index=x_train.columns
)

coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Logistic Regression model')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)

## Section IV: Classification using RandomForest

![](https://i.imgur.com/AC9Bq63.png)

#### ❓ **Question** ❓ Do the same but this time with a Random Forest classifier and using only 4 features: **"Pclass"**, **"Sex"**, **"SibSp"**, and **"Parch"**. 
🔴 HINTS 🔴   
- `from sklearn.ensemble import RandomForestClassifier`
- `rf = RandomForestClassifier(n_estimators=20, max_depth=2, max_features=2, random_state=1)` 


With random forest you can also plot feature importance: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

In [None]:
# HERE

## Section V: harder, bettter, faster, stronger

### 📜 Evaluation

You will be evaluated on two aspects: 
1. On the testing score: do your best and try to avoid overfitting !
2. A small report (1 page max) explaining your method and the choices you made: try to justify here the choice you made and why it improved the performance.

We expect your code (a standalone Python script or notebook) and the 1 page report in a *.zip* file by email to <ludovic.darmet@isae-supaero.fr> before the 15th of February. 

#### ❓ **Question** ❓ Try to do better !!
🔴 HINTS 🔴
* Have a [cross-validation procedure](https://scikit-learn.org/stable/modules/cross_validation.html) for better performance estimation
* Use more features and [select them](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector)
* Optimized hyper-parameters (using [`GridSearch`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV);
* More advanced classification algorithms such as [`Gradient Boosting classifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

In [None]:
# HERE