# Exploration of the Titanic data set

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Exploration

### Question 1

Load the titanic using `pandas`. It is located in `data/titanic.csv`. Using the function `head()` and `info()`, which issues do you identify which need to be solved before to learn a machine learning model.

### Question 2

- By checking the variable `Survived`, is the dataset balanced? What will be the chance level accuracy?
- What variables contain more missing values?

### Question 3

Using the `paiplot` of `seaborn` on the `Age`, `Pclass`, `Fare`, `Sex`, and `Survived` columns, identify some intuitions regarding the correlation between the survival and the features. Make some plots to confirm your intuition.

## 2. Predicting survival

The titanic dataset is an heterogeneous dataset and it gives the opporunity to show the scikit-learn pipelining features. We will show in this notebook how to make a simple classification pipeline. The aim is to predict or not if a passenger survived the titanic trip.

In [None]:
data = pd.read_csv('data/titanic.csv', index_col='PassengerId')

In [None]:
data.head()

First, we need to split the dataset into 2 arrays: the data array and the classification array.

In [None]:
label = data['Survived']
data = data.drop(columns='Survived')

In [None]:
data.head()

In [None]:
label.head()

Because the data type in the titanic dataset, we need to specifically have different preprocessing for the continuous and categorical columns. The `ColumnTransformer` of scikit-learn allows to dispatch different preprocessing depending of the columns. Usually, the categorical variable needs to be encoded while the continuous variable can be standardized.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

We are creating three preprocessing:

* an ordinal encoding for the sex;
* a one hot encoding for the remaining categorical features;
* and a standardization for the continuous features.

In addition, missing values will be filled up with either the median (for continuous variable) or a constant value (categorical variable).

In [None]:
preprocessor = make_column_transformer(
    (['Sex'], OrdinalEncoder()),
    (['Pclass', 'SibSp', 'Parch', 'Embarked'], make_pipeline(
        SimpleImputer(strategy='constant', fill_value='missing'),
        OneHotEncoder(handle_unknown='ignore'))),
    (['Age', 'Fare'], make_pipeline(StandardScaler(), SimpleImputer(strategy='median')))
)

A logistic regression classifier will be used in which the C parameter will be optimized. We will apply a 5-fold cross-validation scheme to estimate the accuracy of the model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

In [None]:
pipe = make_pipeline(preprocessor, LogisticRegression(solver='lbfgs', max_iter=100000))

In [None]:
gridsearch = GridSearchCV(pipe, param_grid={'logisticregression__C': [0.1, 1, 10, 100]}, cv=5)

In [None]:
test_score = cross_val_score(gridsearch, data, label, n_jobs=1, cv=5)

In [None]:
print('Test score: {:.2f}% +- {:.2f}%'.format(test_score.mean() * 100,
                                              test_score.std() * 100))