# Titanic: Machine Learning from Disaster

***By Joe Corliss***

## Table of Contents

1. [Getting Started](#1)
    1. [Imports](#1.1)
    2. [Read In the Data](#1.2)
2. [Pre-processing](#2)
    1. [Feature Extraction and Selection](#2.1)
    2. [Dummy Variables](#2.2)
    3. [Survival Correlations](#2.3)
    4. [Imputation with Mean Substitution](#2.4)
    5. [Standardization](#2.5)
3. [Predictive Modeling](#3)
    1. [Random Forest](#3.1)
    2. [Gradient Boosting](#3.2)
    3. [Logistic Regression](#3.3)
    4. [Gaussian Naive Bayes](#3.4)
    5. [Support Vector Classifier](#3.5)
    6. [k-Nearest Neighbors](#3.6)
4. [Conclusion](#4)
    1. [Results Summary](#4.1)
    2. [Test Set Predictions](#4.2)

# Introduction

[Kaggle Competition](https://www.kaggle.com/c/titanic)

[Notebook on Kaggle](https://www.kaggle.com/pileatedperch/titanic-predicting-survival)

[GitHub Repository](https://github.com/jgcorliss/titanic-competition)

# Getting Started
<a id='1'></a>

## Imports
<a id='1.1'></a>

In [None]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas options
pd.set_option('display.max_colwidth', 1000, 'display.max_rows', None, 'display.max_columns', None)

# Plotting options
%matplotlib inline
mpl.style.use('ggplot')
sns.set(style='whitegrid')

## Read In the Data
<a id='1.2'></a>

In [None]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
TestPassengerId = df_test.loc[:,'PassengerId'] # We'll use this later for the submission file

Concatenate the train and test sets together.

In [None]:
df = df_train.append(df_test, ignore_index=True)

Basic metadata:

In [None]:
df.shape

In [None]:
df.info()

Locate incomplete columns:

In [None]:
def incomplete_cols(df):
    """
    Returns a list of incomplete columns in df and their fraction of non-null values.
    
    Input: pandas DataFrame
    Returns: pandas Series
    """
    cmp = df.notnull().mean().sort_values()
    return cmp.loc[cmp<1]

In [None]:
incomplete_cols(df)

In [None]:
df.sample(5) # Display some random rows

# Pre-processing
<a id='2'></a>

## Feature Extraction and Selection
<a id='2.1'></a>

### Age

In [None]:
df['Age'].notnull().mean()

In [None]:
df['Age_NA'] = np.uint8(df['Age'].isnull())

In [None]:
plt.figure(figsize=(12,4), dpi=90)
sns.distplot(df.loc[df['Age'].notnull(), 'Age'], bins=range(0,90, 2), kde=False)
plt.ylabel('Count')
plt.title('Histogram of Passenger Age')

### Cabin

In [None]:
df['Cabin'].notnull().mean()

Extract the cabin letter (A, B, C, etc).

In [None]:
def find_cabin(s):
    try:
        return s[0]
    except:
        return 'NA'

In [None]:
df.loc[:,'Cabin'] = df['Cabin'].apply(find_cabin)

In [None]:
df['Cabin'].value_counts()

### Embarked

In [None]:
df['Embarked'].notnull().mean()

In [None]:
df['Embarked'].value_counts(dropna=False)

In [None]:
sns.countplot(x='Embarked', data=df)
plt.title('Passenger Ports of Embarkation')

### Fare

In [None]:
df['Fare'].notnull().mean()

In [None]:
plt.figure(figsize=(12,4), dpi=90)
sns.distplot(df.loc[df['Fare'].notnull(), 'Fare'], kde=False)
plt.ylabel('Count')
plt.title('Histogram of Passenger Fares')

In [None]:
df['Fare'].skew()

### Name

In [None]:
df['Name'].notnull().mean()

In [None]:
df['Name'].sample(5)

Let's extract everyone's titles.

In [None]:
df['Title'] = df['Name'].apply(lambda s: s.split(', ')[1].split(' ')[0])

In [None]:
df['Title'].nunique()

Did it work?

In [None]:
df[['Name', 'Title']].sample(10)

Seems good. Title counts:

In [None]:
df['Title'].value_counts()

The only odd-looking value is "the". Let's investigate.

In [None]:
df.loc[df['Title']=='the']

Her title is actually "Countess."

There doesn't seem to be any new information in `Title`. We already know the passenger age, sex, marital status (`SibSp`), and economic class (`Pclass`). Furthermore, many of the titles have too few data points to be useful. So we're going to drop the `Name` and `Title` columns.

In [None]:
df.drop(labels=['Name','Title'], axis=1, inplace=True)

### Parch

In [None]:
df['Parch'].notnull().mean()

In [None]:
df['Parch'].value_counts()

In [None]:
plt.figure(dpi=80)
sns.countplot(x='Parch', data=df)
plt.title('Number of Parents/Children')

### PassengerId

In [None]:
df.shape[0]

In [None]:
df['PassengerId'].nunique()

The passenger ID will (probably) be removed in automatic variable selection.

### Pclass

In [None]:
df['Pclass'].notnull().mean()

In [None]:
df['Pclass'].value_counts()

A majority of the passengers are Lower Class.

In [None]:
plt.figure(figsize=(4,4), dpi=90)
sns.countplot(x='Pclass', data=df)
plt.title('Passenger Ticket Class')

### Sex

In [None]:
df['Sex'].notnull().mean()

In [None]:
df['Sex'].value_counts()

In [None]:
plt.figure(figsize=(4,4), dpi=90)
sns.countplot(x='Sex', data=df)
plt.title('Passenger Sex')

### SibSp

In [None]:
df['SibSp'].notnull().mean()

In [None]:
df['SibSp'].value_counts()

In [None]:
plt.figure(dpi=90)
sns.countplot(x='SibSp', data=df)
plt.title('Number of Siblings/Spouses')

### Survived

This is the target variable.

In [None]:
df.loc[df['Survived'].notnull(), 'Survived'].value_counts(normalize=True)

So there's a 38.4% survival rate in the training set.

In [None]:
plt.figure(figsize=(4,4), dpi=90)
sns.countplot(x='Survived', data=df)
plt.title('Passenger Survival')

### Ticket

In [None]:
df['Ticket'].notnull().mean()

In [None]:
df['Ticket'].nunique()

In [None]:
df.shape[0]

Apparently there are duplicate tickets...? That's weird.

In [None]:
df['Ticket'].sample(10)

Maybe the prefix on some of the tickets is important? Let's grab it.

In [None]:
def ticket_prefix(s):
    'Find the content of the ticket before the ticket number'
    temp = s.split(' ')
    if len(temp) > 1:
        return ' '.join(temp[:-1])
    else:
        return 'NONE'

In [None]:
df.loc[:,'Ticket Prefix'] = df['Ticket'].apply(ticket_prefix)

In [None]:
df['Ticket Prefix'].nunique()

In [None]:
df['Ticket Prefix'].value_counts()

Some of the prefixes are very similar. Let's assume that the characters `.`, `/`, and whitespace and not significant, and that `SC/PARIS` is the same as `SC/Paris`.

In [None]:
df.loc[:,'Ticket Prefix'] = df['Ticket Prefix'].apply(lambda s: s.replace('.','').replace('/','').replace(' ','').upper())

In [None]:
df['Ticket Prefix'].nunique()

In [None]:
df['Ticket Prefix'].value_counts()

If there are fewer than 10 occurences of a particular prefix, we'll reclassify it as "other."

In [None]:
vals = df['Ticket Prefix'].value_counts()

def other_prefix(prefix):
    if vals[prefix] < 10:
        return "OTHER"
    else:
        return prefix

In [None]:
df.loc[:,'Ticket Prefix'] = df['Ticket Prefix'].apply(other_prefix)

In [None]:
df['Ticket Prefix'].value_counts()

Now let's extract the ticket number. As it turns out, a very small number of tickets have no ticket number, so we'll assign NaN for those tickets.

In [None]:
def ticket_number(s):
    'Find the ticket number on a ticket'
    try:
        return np.int64(s.split(' ')[-1])
    except:
        return np.nan

In [None]:
df['Ticket Number'] = df['Ticket'].apply(ticket_number)

In [None]:
df['Ticket Number'].isnull().sum()

In [None]:
plt.figure(figsize=(12,4), dpi=90)
sns.distplot(df.loc[df['Ticket Number'].notnull(), 'Ticket Number'], bins=400, kde=False)
plt.ylabel('Count')
plt.title('Histogram of Ticket Number')

In [None]:
plt.figure(figsize=(12,4), dpi=90)
sns.distplot(df.loc[df['Ticket Number'].notnull(), 'Ticket Number'], bins=500, kde=False)
plt.xlim([0, 500000])
plt.ylabel('Count')
plt.title('Histogram of Ticket Number')

We can see there are 5 classes of ticket numbers. Let's convert to those classes.

In [None]:
def tick_num_class(tick_num):
    if tick_num < 100000:
        return 'A'
    elif tick_num < 200000:
        return 'B'
    elif tick_num < 300000:
        return 'C'
    elif tick_num < 400000:
        return 'D'
    elif tick_num >= 400000:
        return 'E'
    else:
        return 'NA'

In [None]:
df.loc[:,'Ticket Number'] = df['Ticket Number'].apply(tick_num_class)

In [None]:
df['Ticket Number'].value_counts()

In [None]:
df.drop(labels='Ticket', axis=1, inplace=True)

## Dummy Variables
<a id='2.2'></a>

Create dummy variables for categorical features.

In [None]:
df.sample(5)

In [None]:
df.info()

In [None]:
incomplete_cols(df)

In [None]:
df.shape

In [None]:
df = pd.get_dummies(df, drop_first=True)

In [None]:
df.shape

In [None]:
df.sample(5)

## Variable Selection

## Imputation with Mean Substitution
<a id='2.4'></a>

In [None]:
X = df.drop(labels='Survived', axis=1)
y = df.loc[:,'Survived']

In [None]:
incomplete_cols(X)

In [None]:
from sklearn.preprocessing import Imputer

In [None]:
imputer = Imputer().fit(X)

In [None]:
X = pd.DataFrame(imputer.transform(X), columns=X.columns)

In [None]:
incomplete_cols(X)

## Standardization
<a id='2.5'></a>

Transform the data to have columnwise zero mean and unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler().fit(X)

In [None]:
X = pd.DataFrame(scaler.transform(X), columns=X.columns)

# Predictive Modeling
<a id='3'></a>

Train/test split:

In [None]:
X_train = X.loc[y.notnull()]
X_test = X.loc[y.isnull()]
y_train = y[y.notnull()]

How much training/test data is there?

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, make_scorer

## Random Forest
<a id='3.1'></a>

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
param_grid = {}

In [None]:
param_grid = {'n_estimators': [10, 50, 250],
              'max_depth': [5, 8, 15, 25, 30, None],
              'min_samples_split': [2, 5, 10, 15, 100],
              'min_samples_leaf': [1, 2, 5, 10],
              'max_features': ['log2', 'sqrt', None],
              'n_jobs': [-1]
             }

In [None]:
model_rfc = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, scoring=make_scorer(accuracy_score), n_jobs=-1, verbose=1)

In [None]:
model_rfc.fit(X_train, y_train)

In [None]:
model_rfc.best_params_

In [None]:
model_rfc.best_score_

In [None]:
plt.figure(figsize=(4,12), dpi=90)
sns.barplot(y=X_train.columns, x=model_rfc.best_estimator_.feature_importances_, color='darkblue', orient='h')
plt.xlabel('Feature')
plt.ylabel('RF Importance')
plt.title('Random Forest Feature Importances')

## Gradient Boosting
<a id='3.2'></a>

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
param_grid = {'max_depth': [3, 12, 25],
              'subsample': [0.6, 0.8, 1.0],
              'max_features': [None, 'sqrt', 'log2']
             }

In [None]:
model_gb = GridSearchCV(estimator=GradientBoostingClassifier(), param_grid=param_grid, scoring=make_scorer(accuracy_score), 
                        n_jobs=-1, verbose=1)

In [None]:
model_gb.fit(X_train, y_train)

In [None]:
model_gb.best_params_

In [None]:
model_gb.best_score_

## Logistic Regression
<a id='3.3'></a>

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
param_grid = {'penalty': ['l1', 'l2'],
              'C': [10**k for k in range(-3,3)],
              'class_weight': [None, 'balanced'],
              'warm_start': [True]
             }

In [None]:
model_logreg = GridSearchCV(estimator=LogisticRegression(), param_grid=param_grid, scoring=make_scorer(accuracy_score), 
                            n_jobs=-1, verbose=1)

In [None]:
model_logreg.fit(X_train, y_train)

In [None]:
model_logreg.best_params_

In [None]:
model_logreg.best_score_

## Gaussian Naive Bayes
<a id='3.4'></a>

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
model_gnb = GaussianNB()

In [None]:
model_gnb.fit(X_train, y_train)

In [None]:
accuracy_score(y_train, model_gnb.predict(X_train))

## Support Vector Classifier
<a id='3.5'></a>

In [None]:
from sklearn.svm import SVC

In [None]:
param_grid = {'C': [10**k for k in range(-3,4)],
              'class_weight': [None, 'balanced'],
              'shrinking': [True, False]
             }

In [None]:
model_svc = GridSearchCV(estimator=SVC(), param_grid=param_grid, scoring=make_scorer(accuracy_score), 
                         n_jobs=-1, verbose=1)

In [None]:
model_svc.fit(X_train, y_train)

In [None]:
model_svc.best_params_

In [None]:
model_svc.best_score_

## k-Nearest Neighbors
<a id='3.6'></a>

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
param_grid = {'n_neighbors': [1, 2, 4, 8, 16, 32, 64, 128, 256],
              'weights': ['uniform', 'distance'],
              'p': [1, 2, 3, 4, 5]
             }

In [None]:
model_knn = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=param_grid, scoring=make_scorer(accuracy_score), 
                    n_jobs=-1, verbose=1, return_train_score=True)

In [None]:
model_knn.fit(X_train, y_train)

In [None]:
model_knn.best_params_

In [None]:
model_knn.best_score_

# Conclusion
<a id='4'></a>

## Results Summary
<a id='4.1'></a>

In [None]:
print('Training Accuracy Scores')
print('Random Forest: ', model_rfc.best_score_)
print('Gradient Boosting: ', model_gb.best_score_)
print('Logistic Regression: ', model_logreg.best_score_)
print('Gaussian Naive Bayes: ', accuracy_score(y_train, model_gnb.predict(X_train)))
print('Support Vector Classifier: ', model_svc.best_score_)
print('k-Nearest Neighbors: ', model_knn.best_score_)

The Random Forest is our best-performing model.

## Test Set Predictions
<a id='4.2'></a>

In [None]:
y_preds = model_rfc.predict(X_test)

In [None]:
submission = pd.DataFrame({'PassengerId':TestPassengerId, 'Survived':np.uint8(y_preds)})

In [None]:
submission.shape

In [None]:
submission.sample(5)

In [None]:
submission.to_csv('submission.csv', index=False)