# Titanic: Machine Learning from Disaster

**The sinking of the Titanic is one of the most infamous shipwrecks in history.**

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this notebook,I build a predictive model that answers the question: “what sorts of people were more likely to survive?”, using passenger data such as name, age, gender, socio-economic class, etc.

## Overview
* Import Libraries
* Perform a coarse grid search for the parameter space of several different models
* Perform a finer grid search of the best models to finalise model tuning
* Combine the best models into an ensemble predictor

## Import libraries
---

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # visualisation and plotting tools
import scipy
import matplotlib.pyplot as plt
sns.set(style="darkgrid")

In [185]:
import data_clean # the cleaning function that was built using the exploration notebook

train_df = pd.read_csv('./data/train.csv')#'/kaggle/input/titanic/train.csv')
test_df = pd.read_csv('./data/test.csv')#'/kaggle/input/titanic/test.csv')

train_df.loc[:,'train'] = 1
test_df.loc[:,'train'] = 0
test_df.loc[:,'Survived'] = np.nan
df = pd.concat((train_df,test_df), ignore_index = True)

all_cat = True
df = data_clean.clean(df, all_cat = all_cat)
df.drop(columns = ['PassengerId'], inplace = True)
df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked', 'train', 'Ticket_unique', 'Title', 'Num_cabins',
       'Cabin_letter'],
      dtype='object')

In [186]:

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

# Create the encoder.
encoder = OneHotEncoder(handle_unknown="ignore", sparse = False)
# Encode the categorical variables and scale the ordinal.
encoder.fit(df[df.train == 1].drop(['Survived','train','Age','Fare','SibSp','Parch','Num_cabins'], axis = 1))
scaler = StandardScaler()
scaler.fit(df[df.train == 1][['Age','Fare','SibSp','Parch','Num_cabins']])

# Apply the encoder.
X_train_cat = encoder.transform(df[df.train == 1].drop(['Survived','train','Age','Fare','SibSp','Parch','Num_cabins'], axis = 1))
X_train_num = scaler.transform(df[df.train == 1][['Age','Fare','SibSp','Parch','Num_cabins']])
X_train = np.concatenate([X_train_cat,X_train_num], axis = 1)
X_test_cat = encoder.transform(df[df.train == 0].drop(['Survived','train','Age','Fare','SibSp','Parch','Num_cabins'], axis = 1))
X_test_num = scaler.transform(df[df.train == 0][['Age','Fare','SibSp','Parch','Num_cabins']])
X_test = np.concatenate([X_test_cat,X_test_num], axis = 1)

y_train = df[df.train==1].Survived

X_train.shape

(888, 29)

### Coarse grid search
Perform a coarse parameter space search for several different models. We will select the top models and perform a finer grid search afterwards

In [187]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

cv = 5
params = {
            'lr_liblin' :
            {
                'clf': [LogisticRegression(max_iter=1000)],
                'clf__solver' : ['liblinear'],
                'clf__penalty' : ['l1','l2'],
            },
            'lr_lbfgs' :
            {
                'clf': [LogisticRegression(max_iter=1000)],
                'clf__solver': ['lbfgs'],
                'clf__penalty': ['l2'],
            },
            'svc' :
            {
                'clf': [SVC(probability=True)],
                'clf__kernel': ['poly', 'rbf', 'sigmoid','linear'],
            },
            'gbc' :
            {
                'clf': [GradientBoostingClassifier()],
            },
            'rfc' :
            {
                'clf': [RandomForestClassifier()],
                'clf__criterion': ['gini','entropy'],
            },
            'knn' :
            {
                'clf': [KNeighborsClassifier()],
                'clf__weights': ['uniform','distance'],
                'clf__n_neighbors': [3, 5, 10, 20],
                'clf__p': [1,2]
            },
            'mlp' :
            {
                'clf': [MLPClassifier(solver = 'lbfgs')],
                'clf__activation': ['identity', 'logistic', 'tanh', 'relu'],
            },
            'abc' :
            {
                'clf': [AdaBoostClassifier()]
            }
        }

result=[]
best_clfs_coarse = {}
for clf_, params_ in params.items():
    #classifier
    clf = params_['clf'][0]
    print(clf.__class__)

    #getting arguments by
    #popping out classifier
    prm_ = dict(params_)
    prm_.pop('clf')

    #pipeline
    steps = [('clf',clf)]

    #cross validation using
    #Grid Search
    grid = GridSearchCV(Pipeline(steps), param_grid=prm_, cv=cv, return_train_score=True)
    grid.fit(X_train, y_train)

    #storing result
    result.append\
    (
        {
            'classifier': grid.estimator.steps[0][1],
            'cv_results': grid.cv_results_
        }
    )

    best_clfs_coarse[clf_] = grid.best_estimator_.steps[0][-1]

<class 'sklearn.linear_model._logistic.LogisticRegression'>
<class 'sklearn.linear_model._logistic.LogisticRegression'>
<class 'sklearn.svm._classes.SVC'>
<class 'sklearn.ensemble._gb.GradientBoostingClassifier'>
<class 'sklearn.ensemble._forest.RandomForestClassifier'>
<class 'sklearn.neighbors._classification.KNeighborsClassifier'>
<class 'sklearn.neural_network._multilayer_perceptron.MLPClassifier'>
<class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>


In [188]:
results = ['mean_test_score',
           'mean_train_score',
           'std_test_score', 
           'std_train_score',
           'params']
df_tuning = pd.DataFrame()
for r in result:
    df_ = pd.DataFrame(r['cv_results'])[results]
    df_['classifier'] = r['classifier']
    df_tuning = df_tuning.append(df_, ignore_index = True)
df_tuning = df_tuning.sort_values(by = 'mean_test_score', ascending = False)
df_tuning[['mean_test_score','std_test_score','params','classifier']]

Unnamed: 0,mean_test_score,std_test_score,params,classifier
14,0.836749,0.022512,"{'clf__n_neighbors': 5, 'clf__p': 1, 'clf__wei...",KNeighborsClassifier()
7,0.833352,0.021404,{},GradientBoostingClassifier()
16,0.827772,0.031389,"{'clf__n_neighbors': 5, 'clf__p': 2, 'clf__wei...",KNeighborsClassifier()
4,0.826585,0.0196,{'clf__kernel': 'rbf'},SVC(probability=True)
24,0.825475,0.030167,"{'clf__n_neighbors': 20, 'clf__p': 2, 'clf__we...",KNeighborsClassifier()
2,0.820974,0.014509,"{'clf__penalty': 'l2', 'clf__solver': 'lbfgs'}",LogisticRegression(max_iter=1000)
1,0.820974,0.014509,"{'clf__penalty': 'l2', 'clf__solver': 'libline...",LogisticRegression(max_iter=1000)
6,0.818676,0.02446,{'clf__kernel': 'linear'},SVC(probability=True)
12,0.817609,0.024757,"{'clf__n_neighbors': 3, 'clf__p': 2, 'clf__wei...",KNeighborsClassifier()
22,0.817552,0.023971,"{'clf__n_neighbors': 20, 'clf__p': 1, 'clf__we...",KNeighborsClassifier()


In [189]:
best_clfs_coarse.pop('lr_lbfgs')
best_clfs_coarse

{'lr_liblin': LogisticRegression(max_iter=1000, solver='liblinear'),
 'svc': SVC(probability=True),
 'gbc': GradientBoostingClassifier(),
 'rfc': RandomForestClassifier(criterion='entropy'),
 'knn': KNeighborsClassifier(p=1),
 'mlp': MLPClassifier(activation='identity', solver='lbfgs'),
 'abc': AdaBoostClassifier()}

In [190]:
def param_search(steps, params):
    #cross validation using
    #Grid Search
    grid = GridSearchCV(Pipeline(steps), 
            param_grid=params, 
            cv=5, 
            return_train_score=False, 
            verbose = True, 
            n_jobs = 4)
    grid.fit(X_train, y_train)

    print(grid.best_score_)
    print(grid.best_params_)

    return grid.best_estimator_.steps[0][-1]

In [191]:
# Tune the KNN

params = {
    'clf__n_neighbors' : [2,3,4,5,6,10],
    'clf__weights': ['uniform','distance'],
    'clf__p': [1,2]
}

best_knn = best_clfs_coarse['knn']
#pipeline
steps = [('clf', best_knn)]

best_knn = param_search(steps, params)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
0.8367485558306355
{'clf__n_neighbors': 5, 'clf__p': 1, 'clf__weights': 'uniform'}


In [192]:
# Tune the SVC

params = {  
        'clf__C' : [1.e-4, 1.e-3, 1.e-2, 1.e-1, 1, 1.e1, 1.e2, 1.e3, 1.e4],
        'clf__degree' : [2, 3, 4, 5],
        'clf__gamma': ['scale', 'auto'],
        }

best_svc = best_clfs_coarse['svc']
#pipeline
steps = [('clf', best_svc)]

best_svc = param_search(steps, params)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
0.8299561988192725
{'clf__C': 1, 'clf__degree': 2, 'clf__gamma': 'auto'}


In [193]:
# Tune the RFC

params = {  
        'clf__max_features' : ['auto', 'sqrt', 'log2'],
        'clf__n_estimators' : [10, 50, 100, 1000, 5000],
        'clf__max_depth' : [5, 10, 20, 30],
        'clf__min_samples_leaf' : [1, 2, 5, 10] 
        }

best_rfc = best_clfs_coarse['rfc']
#pipeline
steps = [('clf', best_rfc)]

best_rfc = param_search(steps, params)

Fitting 5 folds for each of 300 candidates, totalling 1500 fits
0.8412492858503142
{'clf__max_depth': 20, 'clf__max_features': 'log2', 'clf__min_samples_leaf': 2, 'clf__n_estimators': 50}


In [198]:
# Tune the LR

params = {  
        'clf__C' : [1.e-4, 1.e-3, 1.e-2, 1.e-1, 1, 1.e1, 1.e2, 1.e3, 1.e4],
        'clf__fit_intercept' : [True, False] 
        }

best_lr = best_clfs_coarse['lr_liblin']
#pipeline
steps = [('clf', best_lr)]

best_lr = param_search(steps, params)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
0.8209737827715357
{'clf__C': 1, 'clf__fit_intercept': True}


In [197]:
# Tune the MLP

params = {  
        'clf__alpha' : [1.e-4, 1.e-3, 1.e-2, 1.e-1, 1],
        'clf__learning_rate'  : ['constant','adaptive'],
        'clf__max_iter' : [200,1000],
        }

best_mlp = best_clfs_coarse['mlp']
#pipeline
steps = [('clf', best_mlp)]

best_mlp = param_search(steps, params)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
0.8153494572462388
{'clf__alpha': 0.0001, 'clf__learning_rate': 'adaptive', 'clf__max_iter': 200}


In [199]:
# Tune the GBC

params = {  
        'clf__loss' : ['deviance', 'exponential'],
        'clf__learning_rate'  : [0.1, 0.01],
        'clf__n_estimators' : [100,500,1000],
        'clf__max_depth' : [2,3,5],
        }

best_gbc = best_clfs_coarse['gbc']
#pipeline
steps = [('clf', best_gbc)]

best_gbc = param_search(steps, params)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
0.833352377324954
{'clf__learning_rate': 0.1, 'clf__loss': 'deviance', 'clf__max_depth': 3, 'clf__n_estimators': 100}


In [202]:
from sklearn.ensemble import VotingClassifier

voting_clf_hard = VotingClassifier(estimators = [('knn',best_knn),('svc',best_svc),('rfc',best_rfc),('lr',best_lr),('mlp',best_mlp),('gbc',best_gbc)], voting = 'hard') 
voting_clf_soft = VotingClassifier(estimators = [('knn',best_knn),('svc',best_svc),('rfc',best_rfc),('lr',best_lr),('mlp',best_mlp),('gbc',best_gbc)], voting = 'soft') 

In [203]:
scores = cross_val_score(voting_clf_hard,X_train,y_train,cv=5)
print('voting_clf_hard :', scores)
print('voting_clf_hard mean :',np.mean(scores))

voting_clf_hard : [0.84831461 0.80898876 0.85393258 0.81355932 0.84745763]
voting_clf_hard mean : 0.8344505808417445


In [204]:
scores = cross_val_score(voting_clf_soft,X_train,y_train,cv=5)
print('voting_clf_soft :', scores)
print('voting_clf_soft mean :', np.mean(scores))

voting_clf_soft : [0.84269663 0.80898876 0.8258427  0.81355932 0.85875706]
voting_clf_soft mean : 0.8299688948136863


In [205]:
params = {'weights' : [[i,j,k,l,m,n] for i in range(1,3) for j in range(1,3) for k in range(1,3) for l in range(1,3) for m in range(1,3) for n in range(1,3)]}

vote_weight = GridSearchCV(voting_clf_hard, param_grid = params, cv = 5, verbose = True, n_jobs = 2)
best_clf_weight = vote_weight.fit(X_train,y_train)
print(vote_weight.best_score_)
print(vote_weight.best_params_)

Fitting 5 folds for each of 64 candidates, totalling 320 fits
0.8445819843839268
{'weights': [2, 2, 1, 1, 1, 1]}


In [206]:
params = {'weights' : [[i,j,k,l,m,n] for i in range(1,3) for j in range(1,3) for k in range(1,3) for l in range(1,3) for m in range(1,3) for n in range(1,3)]}

vote_weight = GridSearchCV(voting_clf_soft, param_grid = params, cv = 5, verbose = True, n_jobs = 2)
best_clf_weight_soft = vote_weight.fit(X_train,y_train)
print(vote_weight.best_score_)
print(vote_weight.best_params_)

Fitting 5 folds for each of 64 candidates, totalling 320 fits
0.836716815844601
{'weights': [1, 2, 1, 1, 1, 1]}


In [207]:
voting_clf_hard_sub = best_clf_weight.best_estimator_.predict(X_test).astype(int)
voting_clf_soft_sub = best_clf_weight_soft.best_estimator_.predict(X_test).astype(int)
rfc_sub = best_rfc.predict(X_test).astype(int)

In [26]:
#%debug
import data_clean
test_df = pd.read_csv('./data/test.csv')
print(len(test_df))
test_df = data_clean.clean(test_df, all_cat == True)
print(len(test_df))

418
418


In [212]:
final_data_hard_vote = {'PassengerId': test_df.PassengerId, 'Survived': voting_clf_hard_sub}
submission_hard_vote = pd.DataFrame(data=final_data_hard_vote)
submission_hard_vote.to_csv('submission_voting_hard_clf.csv', index =False)

In [213]:
final_data_soft_vote = {'PassengerId': test_df.PassengerId, 'Survived': voting_clf_soft_sub}
submission_soft_vote = pd.DataFrame(data=final_data_soft_vote)
submission_soft_vote.to_csv('submission_voting_soft_clf.csv', index =False)

In [214]:
final_data_rfc = {'PassengerId': test_df.PassengerId, 'Survived': rfc_sub}
submission_rfc = pd.DataFrame(data=final_data_rfc)
submission_rfc.to_csv('submission_rfc_clf.csv', index =False)