## Predicting Survival on the Titanic

### History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Assignment:

Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.

Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.

In [111]:
import re

# to handle datasets
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler

# to build the models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import accuracy_score, roc_auc_score

# to persist the model and the scaler
import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

## Prepare the data set

In [112]:
# load the data - it is available open source and online

data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')

# display data
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [113]:
# replace interrogation marks by NaN values

data = data.replace('?', np.nan)

In [114]:
# retain only the first cabin if more than
# 1 are available per passenger

def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
    
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [115]:
# extracts the title (Mr, Ms, etc) from the name variable

def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'
    
data['title'] = data['name'].apply(get_title)

In [116]:
# cast numerical variables as floats

data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')

In [117]:
# drop unnecessary variables
data.drop(labels=['name','ticket', 'body','home.dest','boat'], axis=1, inplace=True)

# display data
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked,title
0,1,1,female,29.0,0,0,211.3375,B5,S,Miss
1,1,1,male,0.9167,1,2,151.55,C22,S,Master
2,1,0,female,2.0,1,2,151.55,C22,S,Miss
3,1,0,male,30.0,1,2,151.55,C22,S,Mr
4,1,0,female,25.0,1,2,151.55,C22,S,Mrs


In [118]:
# save the data set

data.to_csv('titanic.csv', index=False)

## Data Exploration

### Find numerical and categorical variables

In [119]:
numerical = data.select_dtypes(include=['int64','float64']).columns

In [120]:
categorical = data.select_dtypes(include=['object']).columns

In [121]:
numerical

Index(['pclass', 'survived', 'age', 'sibsp', 'parch', 'fare'], dtype='object')

In [122]:
categorical

Index(['sex', 'cabin', 'embarked', 'title'], dtype='object')

In [123]:
target = 'survived'

### Find missing values in variables

In [124]:
survived_value = data[data['survived']==1]['embarked'].value_counts().idxmax()
not_survived_value = data[data['survived']==0]['embarked'].value_counts().idxmax()
data['embarked'] = data.apply(lambda x: survived_value if (x.embarked is np.nan and x.survived == 1) else (not_survived_value if (x.embarked is np.nan and x.survived == 0) else x.embarked), axis=1)

In [125]:
mean_fare = data['fare'].mean()
data['fare'].fillna(mean_fare, inplace = True)

In [126]:
mean = data['age'].mean()
std = data['age'].std()
data['age'] = data.apply(lambda x: np.random.randint(mean - std, mean + std) if np.isnan(x.age) else x.age, axis=1)

In [127]:
data.isnull().sum()

pclass         0
survived       0
sex            0
age            0
sibsp          0
parch          0
fare           0
cabin       1014
embarked       0
title          0
dtype: int64

In [128]:
cabin_value = str(data['cabin'].value_counts().idxmax())[0]
data['cabin'] = data.apply(lambda x: cabin_value if pd.isnull(x.cabin) else (str(x.cabin)[0]), axis=1)

In [129]:
def fill_missing(feature):
    survived_value = data[data['survived']==1]['embarked'].value_counts().idxmax()
    not_survived_value = data[data['survived']==0]['embarked'].value_counts().idxmax()
    data = data.apply(lambda x: survived_value if x.survived == 1 else not_survived_value, axis=1)

In [130]:
data[categorical].isnull().sum()/data[categorical].isnull().count()

sex         0.0
cabin       0.0
embarked    0.0
title       0.0
dtype: float64

In [131]:
data[numerical].isnull().sum()/data[numerical].isnull().count()

pclass      0.0
survived    0.0
age         0.0
sibsp       0.0
parch       0.0
fare        0.0
dtype: float64

### Determine cardinality of categorical variables

In [132]:
for c in categorical:
    print("Category {}: {}".format(c, data[c].nunique()))

Category sex: 2
Category cabin: 8
Category embarked: 3
Category title: 5


### Determine the distribution of numerical variables

In [133]:
for n in numerical:
    mean = data[n].mean()
    std = data[n].std()
    print("Feature {}: mean {}, std {}".format(n, mean, std))

Feature pclass: mean 2.294881588999236, std 0.8378360189701274
Feature survived: mean 0.3819709702062643, std 0.4860551708664827
Feature age: mean 29.753756073338426, std 13.349486117900629
Feature sibsp: mean 0.4988540870893812, std 1.041658390596102
Feature parch: mean 0.3850267379679144, std 0.8655602753495147
Feature fare: mean 33.295479281345564, std 51.73887903247135


## Separate data into train and test

Use the code below for reproducibility. Don't change it.

In [134]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1047, 9), (262, 9))

## Feature Engineering

### Fill in Missing data in numerical variables:

- Add a binary missing indicator
- Fill NA in original variable with the median

### Remove rare labels in categorical variables

- remove labels present in less than 5 % of the passengers

### Perform one hot encoding of categorical variables into k-1 binary variables

- k-1, means that if the variable contains 9 different categories, we create 8 different binary variables
- Remember to drop the original categorical variable (the one with the strings) after the encoding

### Scale the variables

- Use the standard scaler from Scikit-learn

## Train the Logistic Regression model

- Set the regularization parameter to 0.0005
- Set the seed to 0

In [135]:
list(categorical)

['sex', 'cabin', 'embarked', 'title']

In [136]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(iterations=100, cat_features = list(categorical),
                           learning_rate=0.01,
                           depth=5, custom_metric=['Logloss',
                                          'AUC:hints=skip_train~false'])
# Fit model
model.fit(X_train,
          y_train,
          verbose=False)

print(model.get_best_score())

{'learn': {'Logloss': 0.5016567256408881, 'AUC': 0.8506665378670788}}


In [137]:
# Get predicted classes
preds_class = model.predict(X_test)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(X_test)
# Get predicted RawFormulaVal
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

In [138]:
from sklearn.metrics import accuracy_score, roc_auc_score
accuracy_score(y_test, preds_class)

0.7862595419847328

In [139]:
roc_auc_score(y_test, preds_class)

0.7487037037037036

In [82]:
categorical

Index(['sex', 'cabin', 'embarked', 'title'], dtype='object')

In [140]:
for var in categorical:
    
    # to create the binary variables, we use get_dummies from pandas
    
    data = pd.concat([data,
                         pd.get_dummies(data[var], prefix=var, drop_first=True)
                         ], axis=1)
    
data.drop(labels=categorical, axis=1, inplace=True)

data.shape

(1309, 20)

In [141]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train2.shape, X_test2.shape

((1047, 19), (262, 19))

In [143]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train2, y_train)
random_forest.score(X_train2, y_train)

0.9799426934097422

In [147]:
y_pred2 = random_forest.predict(X_test2)
accuracy_score(y_test, y_pred2)

0.8015267175572519

In [148]:
roc_auc_score(y_test, y_pred2)

0.7801851851851851

## Make predictions and evaluate model performance

Determine:
- roc-auc
- accuracy

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**