# Titanic Dataset ML Addendum
Kasey Cox / March 2018

### Question: Did a passenger survive the sinking of the Titanic or not?
My previous exploration of the Titanic dataset -- finding which passenger characteristics correlate with survival -- will serve as a basis for feature selection in this addendum.

For this part of the project, a machine learning algorithm will be developed and deployed to predict which passengers survived the sinking of the Titanic.

### Final output
_From Kaggle.com_:  
> You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.  
> 
> The file should have exactly 2 columns:  
> - PassengerId (sorted in any order)  
> - Survived (contains your binary predictions: 1 for survived, 0 for deceased)

In [1]:
# General imports and settings
%autosave 0
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

Autosave disabled


***
# Strategy

1. Import and investigate (provided) train and test sets.
2. Feature selection
    - Use previous exploration to inform choices
3. Feature engineering
    - As appropriate
4. Select a classifier
    - Try and test (accuracy, precision, recall) classifiers
5. Dump predictions as csv

### 1. Import and investigate (provided) train and test sets.

In [2]:
# Import train.csv and test.csv as Pandas DataFrames
train_df = pd.read_csv('train.csv', header=0)
print "train:", train_df.shape, "\n", train_df.columns, "\n", len(train_df.columns), "\n"

test_df = pd.read_csv('test.csv', header=0)
print "test:", test_df.shape, "\n", test_df.columns, "\n", len(test_df.columns)

train: (891, 12) 
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object') 
12 

test: (418, 11) 
Index([u'PassengerId', u'Pclass', u'Name', u'Sex', u'Age', u'SibSp', u'Parch',
       u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object') 
11


In [3]:
# Rename some features to make meaning more clear
new_train_cols = ['PassengerId', 'Survived', 'Class', 'Name', 'Sex', 'Age',
       'Siblings_spouses_aboard', 'Parents_children_aboard', 'Ticket', 'Fare', 'Cabin_num', 'Port_of_Embarkation']
train_df.columns = new_train_cols
print train_df.columns, "\n", len(train_df.columns), "\n"

new_test_cols = ['PassengerId', 'Class', 'Name', 'Sex', 'Age',
       'Siblings_spouses_aboard', 'Parents_children_aboard', 'Ticket', 'Fare', 'Cabin_num', 'Port_of_Embarkation']
test_df.columns = new_test_cols
print test_df.columns, "\n", len(test_df.columns)

Index([u'PassengerId', u'Survived', u'Class', u'Name', u'Sex', u'Age',
       u'Siblings_spouses_aboard', u'Parents_children_aboard', u'Ticket',
       u'Fare', u'Cabin_num', u'Port_of_Embarkation'],
      dtype='object') 
12 

Index([u'PassengerId', u'Class', u'Name', u'Sex', u'Age',
       u'Siblings_spouses_aboard', u'Parents_children_aboard', u'Ticket',
       u'Fare', u'Cabin_num', u'Port_of_Embarkation'],
      dtype='object') 
11


In [4]:
# Check distribution of Survived in training set
print train_df['Survived'].unique(), "\n"

print "Distribution:\n", train_df['Survived'].value_counts()

[0 1] 

Distribution:
0    549
1    342
Name: Survived, dtype: int64


In [5]:
# Check for NaNs in training
print "NaNs in training set features:"
print train_df.isnull().sum()

NaNs in training set features:
PassengerId                  0
Survived                     0
Class                        0
Name                         0
Sex                          0
Age                        177
Siblings_spouses_aboard      0
Parents_children_aboard      0
Ticket                       0
Fare                         0
Cabin_num                  687
Port_of_Embarkation          2
dtype: int64


Age did not correlate with survival, so it does not matter that there are many missing values since we will not select it as a feature.

In [6]:
# Check for NaNs in test
print "NaNs in test set features:"
print test_df.isnull().sum()

NaNs in test set features:
PassengerId                  0
Class                        0
Name                         0
Sex                          0
Age                         86
Siblings_spouses_aboard      0
Parents_children_aboard      0
Ticket                       0
Fare                         1
Cabin_num                  327
Port_of_Embarkation          0
dtype: int64


### 2. Feature selection
Based on work from previous investigation and the lack of missing values in the training and test data sets, I will be selecting the following features:
1. `Sex`
2. `Class`
3. `Fare`

**Note 1:** because `Sex` is a string, it must be converted to a numerical value for scikit-learn to accept it. 0 will correspond to `male` and 1 will correspond to `female`. Both the training and testing sets will undergo this change.

In [7]:
# Transforming 'Sex' column values
train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
test_df['Sex'] = test_df['Sex'].map({'male': 0, 'female': 1})

# Checking that it worked...
print train_df['Sex'].head()
print test_df['Sex'].head()

0    0
1    1
2    1
3    1
4    0
Name: Sex, dtype: int64
0    0
1    1
2    0
3    0
4    1
Name: Sex, dtype: int64


**Note 2:** Because in the test set `Fare` contains a missing value (NaN), it will not work as is with scikit-learn's classes' fitting methods. I am putting the average `Fare` of those in the same `Class` in place of the NaN since it will likely reflect what the true `Fare` might be (`Fare` and `Class` are related). Because it is only one data point, this should not heavily influence the resulting model.

In [8]:
# The row with the NaN
test_df[test_df['PassengerId'] == 1044]

Unnamed: 0,PassengerId,Class,Name,Sex,Age,Siblings_spouses_aboard,Parents_children_aboard,Ticket,Fare,Cabin_num,Port_of_Embarkation
152,1044,3,"Storey, Mr. Thomas",0,60.5,0,0,3701,,,S


In [9]:
# Replace
test_df.loc[152, 'Fare'] = test_df['Fare'].mean()

# Correct Sex and Class back to integer type
test_df['Sex'] = test_df['Sex'].apply(int)
test_df['Class'] = test_df['Class'].apply(int)

In [10]:
# Checking that it worked...
test_df[test_df['PassengerId'] == 1044]

Unnamed: 0,PassengerId,Class,Name,Sex,Age,Siblings_spouses_aboard,Parents_children_aboard,Ticket,Fare,Cabin_num,Port_of_Embarkation
152,1044,3,"Storey, Mr. Thomas",0,60.5,0,0,3701,35.627188,,S


### 3. Feature Engineering
No feature engineering seems appropriate at this stage.

### 4. Select a classifier
In this step, try and evaluate (cross-validate) classifiers.

For splitting into training and validation sets (from the original training data), I am simply using scikit-learn's `train_test_split` splitter function. Something more complicated like `StratifiedShuffleSplit` is not necessay because there isn't an overly skewed distribution of survival status in the original training data.

_However_, because each random state is random in `train_test_split`, there could be splits that are heavily skewed. For that reason, precision and recall are useful metrics here.

In [11]:
# Features to use
features = ['Sex', 'Fare', 'Class']

# Build both a features data array and a labels array (to be given to sklearn classes)
def extract_features_labels(df, features, include_survived=False):
    """
    Takes training data as a Pandas DataFrame (df) and a list of features as strings (features).
    
    Extracts desired features from each record and the label.
    
    Returns two np.array: a features data array and a labels array.
    """
    feat_data_list = []
    labels_list = []
    
    # include 'Survived' labels
    if include_survived != False:
        for record in df.iterrows():
            temp_list = []
            for feature in (['Survived'] + features):
                if feature == 'Survived':
                    # Label
                    labels_list.append(record[1][feature])
                else:
                    # Feature
                    temp_list.append(record[1][feature])

            # Add this record's features data to feat_data_list
            feat_data_list.append(temp_list)

        return feat_data_list, labels_list
    
    # do not include 'Survived' labels
    else:
        for record in df.iterrows():
            temp_list = []
            for feature in (features):
                # Feature
                temp_list.append(record[1][feature])

            # Add this record's features data to feat_data_list
            feat_data_list.append(temp_list)

        return feat_data_list, None
        

# Extract the features and labels from the training set
features_data, labels = extract_features_labels(train_df, features, include_survived=True)

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(features_data, labels, test_size = 0.25, random_state=22)

**Naive Bayes**

In [12]:
# Fit classifier with training set
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)

# Evaluate with validation set
print "Score:", nb_clf.score(X_val, y_val)
print "Recall:", recall_score(y_val, nb_clf.predict(X_val))
print "Precision:", precision_score(y_val, nb_clf.predict(X_val))

Score: 0.757847533632
Recall: 0.688888888889
Precision: 0.704545454545


**Decision Tree**

In [13]:
# GridSearchCV (parameter tuning)
from sklearn.model_selection import GridSearchCV
clf_to_tune = DecisionTreeClassifier()
parameters = {'min_samples_split':(2, 3, 4, 5, 6), 'max_depth':(None, 5, 10, 15, 20)}
cv_clf = GridSearchCV(clf_to_tune, parameters)
cv_clf.fit(X_train, y_train)
print cv_clf.best_params_

{'min_samples_split': 5, 'max_depth': 10}


In [14]:
# Fit classifier with training set
dt_clf = DecisionTreeClassifier(min_samples_split=5, max_depth=10)
dt_clf.fit(X_train, y_train)

# Evaluate with validation set
print "Score:", dt_clf.score(X_val, y_val)
print "Recall:", recall_score(y_val, dt_clf.predict(X_val))
print "Precision:", precision_score(y_val, dt_clf.predict(X_val))

Score: 0.820627802691
Recall: 0.7
Precision: 0.828947368421


In [15]:
# Correct PassengerId back to integer type
train_df['PassengerId'] = train_df['PassengerId'].apply(int)
test_df['PassengerId'] = test_df['PassengerId'].apply(int)

### 5. Dump predictions as csv

In [16]:
data, labels = extract_features_labels(test_df, features, include_survived=False)

preds = dt_clf.predict(data)

submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived': preds})

submission.to_csv("submission.csv", index=False)