## Predicting Survival on the Titanic

### History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Assignment:

Build a **Machine Learning Pipeline**, to **engineer the features** in the data set and **predict** who is more likely to Survive the catastrophe. Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.

In [37]:
import re
import pandas as pd  # to handle datasets
import numpy as np
import matplotlib.pyplot as plt  # for visualization
from sklearn.model_selection import train_test_split  # to divide!
from sklearn.preprocessing import StandardScaler  # feature scaling
from sklearn.linear_model import LogisticRegression  # to build the models
from sklearn.metrics import accuracy_score, roc_auc_score  # to evaluate the models
import joblib  # to persist the model and the scaler
from sklearn.pipeline import Pipeline  # pipeline
from sklearn.base import BaseEstimator, TransformerMixin  # for the preprocessors
from feature_engine.imputation import (  # for imputation
    CategoricalImputer,
    AddMissingIndicator,
    MeanMedianImputer)
from feature_engine.encoding import (   # for encoding categorical variables
    RareLabelEncoder,
    OneHotEncoder)

## Prepare the data set

In [56]:
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


**Replace interrogation marks by NaN values!**

In [57]:
data = data.replace('?', np.nan)

**Retain only the first cabin if more than 1 are available per passenger!**

In [58]:
def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
data['cabin'] = data['cabin'].apply(get_first_cabin)

**Extracts the title (Mr, Ms, etc) from the name variable!**

In [59]:
def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'
data['title'] = data['name'].apply(get_title)

**Cast numerical variables as floats!**

In [60]:
data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')

**Drop unnecessary variables!**

In [61]:
data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked,title
0,1,1,female,29.0,0,0,211.3375,B5,S,Miss
1,1,1,male,0.9167,1,2,151.55,C22,S,Master
2,1,0,female,2.0,1,2,151.55,C22,S,Miss
3,1,0,male,30.0,1,2,151.55,C22,S,Mr
4,1,0,female,25.0,1,2,151.55,C22,S,Mrs


**Save the data set!**

In [62]:
# data.to_csv('titanic.csv', index=False)

## Configuration

In [63]:
target = 'survived'

In [64]:
vars_num = [c for c in data.columns if data[c].dtypes!='O' and c!=target]
vars_cat = [c for c in data.columns if data[c].dtypes=='O']
print('Number of numerical variables: {}'.format(len(vars_num)))
print('Number of categorical variables: {}'.format(len(vars_cat)))

Number of numerical variables: 5
Number of categorical variables: 4


In [65]:
vars_num, vars_cat

(['pclass', 'age', 'sibsp', 'parch', 'fare'],
 ['sex', 'cabin', 'embarked', 'title'])

**List of variables to be used in the pipeline's transformers!**

**We take from numerical variables pnly 'age' and 'fare!**

In [66]:
NUMERICAL_VARIABLES = ['age', 'fare']
CATEGORICAL_VARIABLES = ['sex', 'cabin', 'embarked', 'title']
CABIN = ['cabin']

## Separate data into train and test

In [67]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((1047, 9), (262, 9))

## Preprocessors

### Class to extract the letter from the variable Cabin

**Extract fist letter of variable!**

In [68]:
class ExtractLetterTransformer(BaseEstimator, TransformerMixin): 
    def __init__(self, variables):
        if not isinstance(variables, list):
            raise ValueError('variables should be a list')
        self.variables = variables
    def fit(self, X, y=None):
        return self   # we need this step to fit the sklearn pipeline
    def transform(self, X):
        X = X.copy()  # so that we do not over-write the original dataframe
        for feature in self.variables:
            X[feature] = X[feature].str[0]
        return X

## Pipeline

- Impute categorical variables with string missing
- Add a binary missing indicator to numerical variables with missing data
- Fill NA in original numerical variable with the median
- Extract first letter from cabin
- Group rare Categories
- Perform One hot encoding
- Scale features with standard scaler
- Fit a Logistic regression

**Set up the pipeline!**

In [69]:
titanic_pipe = Pipeline([
    # impute categorical variables with string missing
    ('categorical_imputation', CategoricalImputer(
        imputation_method='missing', variables=CATEGORICAL_VARIABLES)),
    # add missing indicator to numerical variables
    ('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARIABLES)),
    ('median_imputation', MeanMedianImputer(  # impute numerical variables with the median
        imputation_method='median', variables=NUMERICAL_VARIABLES)),
    ('extract_letter', ExtractLetterTransformer(variables=CABIN)),  # Extract letter from cabin
    ('rare_label_encoder', RareLabelEncoder(  # remove if less than 5%! Group them in 'Rare'
        tol=0.05, n_categories=1, variables=CATEGORICAL_VARIABLES)),
    ('categorical_encoder', OneHotEncoder(  # encode categorical variables into k-1 variables
        drop_last=True, variables=CATEGORICAL_VARIABLES)),
    ('scaler', StandardScaler()),   # scale
    ('Logit', LogisticRegression(C=0.0005, random_state=0)),])

**Train the pipeline!**

In [70]:
titanic_pipe.fit(X_train, y_train)

Pipeline(steps=[('categorical_imputation',
                 CategoricalImputer(variables=['sex', 'cabin', 'embarked',
                                               'title'])),
                ('missing_indicator',
                 AddMissingIndicator(variables=['age', 'fare'])),
                ('median_imputation',
                 MeanMedianImputer(variables=['age', 'fare'])),
                ('extract_letter',
                 ExtractLetterTransformer(variables=['cabin'])),
                ('rare_label_encoder',
                 RareLabelEncoder(n_categories=1,
                                  variables=['sex', 'cabin', 'embarked',
                                             'title'])),
                ('categorical_encoder',
                 OneHotEncoder(drop_last=True,
                               variables=['sex', 'cabin', 'embarked',
                                          'title'])),
                ('scaler', StandardScaler()),
                ('Logit', LogisticRegress

## Make predictions and evaluate model performance

Determine:
- roc-auc
- accuracy

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

**Make predictions** for train set! Determine mse and rmse! Then make predictions for test set! Determine mse and rmse!**

In [71]:
class_ = titanic_pipe.predict(X_train)
pred = titanic_pipe.predict_proba(X_train)[:,1]
print('train roc-auc: {}'.format(roc_auc_score(y_train, pred)))
print('train accuracy: {}'.format(accuracy_score(y_train, class_)))
print()
class_ = titanic_pipe.predict(X_test)
pred = titanic_pipe.predict_proba(X_test)[:,1]
print('test roc-auc: {}'.format(roc_auc_score(y_test, pred)))
print('test accuracy: {}'.format(accuracy_score(y_test, class_)))
print()

train roc-auc: 0.8450386398763523
train accuracy: 0.7220630372492837

test roc-auc: 0.8354629629629629
test accuracy: 0.7137404580152672



That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**