# Titanic Dataset ML Addendum
Kasey Cox / March 2018

### Question: Did a passenger survive the sinking of the Titanic or not?
My previous exploration of the Titanic dataset -- finding which passenger characteristics correlate with survival -- will serve as a basis for feature selection in this addendum.

For this part of the project, a machine learning algorithm will be developed and deployed to predict which passengers survived the sinking of the Titanic.

### Final output
_From Kaggle.com_:  
> You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.  
> 
> The file should have exactly 2 columns:  
> - PassengerId (sorted in any order)  
> - Survived (contains your binary predictions: 1 for survived, 0 for deceased)

In [1]:
# General imports and settings
%autosave 0
import numpy as np
import pandas as pd

Autosave disabled


***
# Strategy

1. Import and investigate (provided) train and test sets.
2. Feature selection
    - Use previous exploration to inform choices
3. Feature engineering
    - As appropriate
4. Select a classifier
    - Try and test (accuracy, precision, recall) classifiers
5. Dump predictions as csv

### 1. Import and investigate (provided) train and test sets.

In [2]:
# Import train.csv and test.csv as Pandas DataFrames
train_df = pd.read_csv('train.csv', header=0)
print "train:", train_df.shape, "\n", train_df.columns, "\n", len(train_df.columns), "\n"

test_df = pd.read_csv('test.csv', header=0)
print "test:", test_df.shape, "\n", test_df.columns, "\n", len(test_df.columns)

train: (891, 12) 
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object') 
12 

test: (418, 11) 
Index([u'PassengerId', u'Pclass', u'Name', u'Sex', u'Age', u'SibSp', u'Parch',
       u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object') 
11


In [3]:
# Rename some features to make meaning more clear
new_train_cols = ['Passenger_ID', 'Survived', 'Class', 'Name', 'Sex', 'Age',
       'Siblings_spouses_aboard', 'Parents_children_aboard', 'Ticket', 'Fare', 'Cabin_num', 'Port_of_Embarkation']
train_df.columns = new_train_cols
print train_df.columns, "\n", len(train_df.columns), "\n"

new_test_cols = ['Passenger_ID', 'Class', 'Name', 'Sex', 'Age',
       'Siblings_spouses_aboard', 'Parents_children_aboard', 'Ticket', 'Fare', 'Cabin_num', 'Port_of_Embarkation']
test_df.columns = new_test_cols
print test_df.columns, "\n", len(test_df.columns)

Index([u'Passenger_ID', u'Survived', u'Class', u'Name', u'Sex', u'Age',
       u'Siblings_spouses_aboard', u'Parents_children_aboard', u'Ticket',
       u'Fare', u'Cabin_num', u'Port_of_Embarkation'],
      dtype='object') 
12 

Index([u'Passenger_ID', u'Class', u'Name', u'Sex', u'Age',
       u'Siblings_spouses_aboard', u'Parents_children_aboard', u'Ticket',
       u'Fare', u'Cabin_num', u'Port_of_Embarkation'],
      dtype='object') 
11


In [4]:
# Check distribution of Survived in training set
print train_df['Survived'].unique(), "\n"

print "Distribution:\n", train_df['Survived'].value_counts()

[0 1] 

Distribution:
0    549
1    342
Name: Survived, dtype: int64


In [5]:
# Check for NaNs in training
print "NaNs in training set features:"
for col in train_df.columns:
    print str(col) + ":", train_df[train_df[col].isnull()].shape[0]

NaNs in training set features:
Passenger_ID: 0
Survived: 0
Class: 0
Name: 0
Sex: 0
Age: 177
Siblings_spouses_aboard: 0
Parents_children_aboard: 0
Ticket: 0
Fare: 0
Cabin_num: 687
Port_of_Embarkation: 2


Age did not correlate with survival, so it does not matter that there are many missing values since we will not select it as a feature.

In [6]:
# Check for NaNs in test
print "NaNs in test set features:"
for col in test_df.columns:
    print str(col) + ":", test_df[test_df[col].isnull()].shape[0]

NaNs in test set features:
Passenger_ID: 0
Class: 0
Name: 0
Sex: 0
Age: 86
Siblings_spouses_aboard: 0
Parents_children_aboard: 0
Ticket: 0
Fare: 1
Cabin_num: 327
Port_of_Embarkation: 0


### 2. Feature selection

### 3. Feature Engineering

### 4. Select a classifier

### 5. Dump predictions as csv