# Titanic Survival Prediction

This notebook is intended to predict survival of Titanic passengers using Python and its libraries. Many models will be covered with fine-tuned hyperparameters.

## Import and preprocessing

In [29]:
import numpy as np
from lib.data_utils import load_Titanic, create_submission
from sklearn.preprocessing import scale, LabelEncoder
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [2]:
# read the data
x_train, y_train = load_Titanic()

# preprocessing: standardize numeric, encode categorical
x_train[:,[0,2,3,4,5]] = scale(x_train[:,[0,2,3,4,5]])
for i in [1,6]:
    x_train[:,i] = LabelEncoder().fit_transform(x_train[:,i])

# see if it works well
print(x_train[:5])

[['0.827377' '1' '-0.59759' '0.432793' '-0.47367' '-0.50244' '2']
 ['-1.56610' '0' '0.632604' '0.432793' '-0.47367' '0.786845' '0']
 ['0.827377' '0' '-0.29004' '-0.47454' '-0.47367' '-0.48885' '2']
 ['-1.56610' '0' '0.401941' '0.432793' '-0.47367' '0.420730' '2']
 ['0.827377' '1' '0.401941' '-0.47454' '-0.47367' '-0.48633' '2']]




## k-Nearest Neighbors

We use random search to fine-tune the hyperparameters of kNN and get the best model.

In [16]:
# set the range of hyperparameters
param_distributions = {'n_neighbors': np.array(np.linspace(1,15,15), dtype=np.int),
                       'p': np.array(np.linspace(1,5,5), dtype=np.int),
                       'leaf_size': np.array(np.linspace(10,50,40), dtype=np.int)}
# initialize the random search
random_search = RandomizedSearchCV(estimator=KNeighborsClassifier(),
                                   param_distributions=param_distributions,
                                   n_iter=30,
                                   cv=10,
                                   verbose=1)
# start searching
random_search.fit(x_train, y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:  2.6min finished


RandomizedSearchCV(cv=10, error_score='raise',
          estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
          fit_params=None, iid=True, n_iter=100, n_jobs=1,
          param_distributions={'leaf_size': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45, 46, 47, 48, 50]), 'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]), 'p': array([1, 2, 3, 4, 5])},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=1)

After the searching, we see the hyperparameters and accuracy of the best model, and keep the model.

In [17]:
print(random_search.best_params_)
print('Best accuracy: ', random_search.best_score_)
print(random_search.best_estimator_)
nn = random_search.best_estimator_

{'leaf_size': 24, 'n_neighbors': 6, 'p': 1}
Best accuracy:  0.824915824916
KNeighborsClassifier(algorithm='auto', leaf_size=24, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=1,
           weights='uniform')


## Naive Bayes

Since there's no hyperparameters for Gaussian Naive Bayes, there's no need to do random search, so we just fit the data, and see the training accuracy.

In [20]:
nb = GaussianNB()
nb.fit(x_train.astype(np.float), y_train.astype(np.float))
print('Training accuracy: ', nb.score(x_train.astype(np.float), y_train.astype(np.float)))

Training accuracy:  0.792368125701


## Random Forest

Although Random Forest has been run on R with 0.78 test accuracy, we train here again with hyperparameters fine-tuned.

In [24]:
# set the range of hyperparameters
param_distributions = {'n_estimators': np.array(np.arange(10,501), dtype=np.int),
                       'max_features': np.array(np.linspace(1,7,7), dtype=np.int),
                       'min_samples_split': np.array(np.linspace(2,10,9), dtype=np.int),
                       'min_samples_leaf': np.array(np.linspace(1,10,10), dtype=np.int),
                       'min_impurity_decrease': np.linspace(0,0.1,20)}
# initialize the random search
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(),
                                   param_distributions=param_distributions,
                                   n_iter=30,
                                   cv=10,
                                   verbose=1)
# start searching
random_search.fit(x_train, y_train)

Fitting 10 folds for each of 30 candidates, totalling 300 fits


[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:  4.8min finished


RandomizedSearchCV(cv=10, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=30, n_jobs=1,
          param_distributions={'min_impurity_decrease': array([ 0.     ,  0.00526,  0.01053,  0.01579,  0.02105,  0.02632,
        0.03158,  0.03684,  0.04211,  0.04737,  0.05263,  0.05789,
        0.06316,  0.06842,  0.07368,  0.07895,  0.08421,  0.08947,
        0.09474,  0.1    ]), 'n_estimators': array([ ...es': array([1, 2, 3, 4, 5, 6, 7]), 'min_samples_split': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10])},
          pre_dispatch='2*n_jobs', ran

After the searching, we see the hyperparameters and accuracy of the best model, and keep the model.

In [26]:
print(random_search.best_params_)
print('Best accuracy: ', random_search.best_score_)
print(random_search.best_estimator_)
rf = random_search.best_estimator_

{'min_impurity_decrease': 0.0, 'n_estimators': 16, 'min_samples_leaf': 4, 'max_features': 7, 'min_samples_split': 4}
Best accuracy:  0.849607182941
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=7, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=4,
            min_weight_fraction_leaf=0.0, n_estimators=16, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


## SVM

We use random search to fine-tune the hyperparameters of SVM and get the best model.

In [35]:
# set the range of hyperparameters
param_distributions = {'C': np.linspace(0.1,5,20),
                       'gamma': np.linspace(0.1,1,10),
                       'tol': np.linspace(1e-4,1e-2,10)}
# initialize the random search
random_search = RandomizedSearchCV(estimator=SVC(),
                                   param_distributions=param_distributions,
                                   n_iter=100,
                                   cv=10,
                                   verbose=1)
# start searching
random_search.fit(x_train, y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:  2.3min finished


RandomizedSearchCV(cv=10, error_score='raise',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params=None, iid=True, n_iter=100, n_jobs=1,
          param_distributions={'gamma': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]), 'C': array([ 0.1    ,  0.35789,  0.61579,  0.87368,  1.13158,  1.38947,
        1.64737,  1.90526,  2.16316,  2.42105,  2.67895,  2.93684,
        3.19474,  3.45263,  3.71053,  3.96842,  4.22632,  4.48421,
        4.74211,  5.     ]), 'tol': array([ 0.0001,  0.0012,  0.0023,  0.0034,  0.0045,  0.0056,  0.0067,
        0.0078,  0.0089,  0.01  ])},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=1)

After the searching, we see the hyperparameters and accuracy of the best model, and keep the model.

In [36]:
print(random_search.best_params_)
print('Best accuracy: ', random_search.best_score_)
print(random_search.best_estimator_)
svm = random_search.best_estimator_

{'gamma': 0.10000000000000001, 'tol': 0.01, 'C': 2.9368421052631581}
Best accuracy:  0.829405162738
SVC(C=2.9368421052631581, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.10000000000000001,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.01, verbose=False)


## Neural Network using Keras

To have more flexibility, we use Keras instead of Multilayer Perceptron in scikit_learn. 

## Create submission

This part is used to generate submission file for Kaggle competition using trained models.

In [34]:
# read the test data and preprocess
x_test = load_Titanic(filename='../data/test - processed.csv', test=True)
x_test[:,[0,2,3,4,5]] = scale(x_test[:,[0,2,3,4,5]])
for i in [1,6]:
    x_test[:,i] = LabelEncoder().fit_transform(x_test[:,i])
    
# create submission file
create_submission(svm, x_test, '../submission/submission_svm_ft.csv')


