# Logistic Regression on Titanic Survivorship

This notebook performs a simple logistic regression on the data from the Kaggle project on predicting survivorship of the passengers of the Titanic.  We also grade the results by using the full survivorship data that was downloaded from [Vanderbuilt Biostats](http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets) from the file titanic3.csv

It is important that the column names between these various csv's match and that the passenger names match as well so that we can perform table merges correctly.  The Kaggle data has capitalized column names while the Vanderbuilt does not, so I modified titanic3.csv so that, where appropriate, the column names match.

Also, these two data sources treat embedded double quotes differently.  The Vanderbuilt data is consistent, but the Kaggle date is really borked on this one.  So I removed all embedded double quotes from the files.  This can be done by the following shell commands.

    sed -e 's/\"\"\"\"\"/\"/g' -e 's/\"\"\"/\"/g' -e 's/\"\"//g' train.csv >train_mod.csv
    sed -e 's/\"\"\"\"\"/\"/g' -e 's/\"\"\"/\"/g' -e 's/\"\"//g' test.csv >test_mod.csv

Now for the code.  First the imports and a define of a custom exception

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from sklearn.linear_model import LogisticRegression

class PredictionError(Exception):
    pass

We will read in the data (eventually) using the pandas read_csv method.  I chose to read all of it in and clean it in a separate step.  That is done by this function, which is largely what the Kaggle Python code does (only a little less ugly)

In [2]:
def clean_data(tdf):
    tdf['gender'] = tdf['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

    # Missing values will be filled with medians
    tdf = tdf.fillna(tdf.median())

    # Drop all the features that will not be used in logistic regression
    tdf = tdf.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'PassengerId',
                    'Embarked'], axis=1)
    return tdf

We would like to find out how accurate our predictions are and since we have the full survivorship data in titanic3.csv (or titanic3_mod.csv) we can do that.  Here we assume that we have a DataFrame that has the ground truth column as well as the prediction column.  We pass in the dataframe, the name of the ground truth column and the prediction column.

In [3]:
def score_results(df, gt, pred):
    """ Score the results in a DataFrame that has both the ground
        truth and the prediction. gt is the name of the ground truth
        column etc.  Return the raw accuracy, the precision and the recall.
        Note that precision and recall are most useful for evaluating
        predictions on rare events. """
    correct = df.ix[df[gt] == df[pred]]
    incorrect = df.ix[df[gt] != df[pred]]
    true_pos = len(correct.ix[correct[gt] == 1])
    true_neg = len(correct.ix[correct[gt] == 0])
    false_neg = len(incorrect.ix[incorrect[pred] == 0])
    false_pos = len(incorrect.ix[incorrect[pred] == 1])

    precision = true_pos / (true_pos + false_pos)
    recall = true_pos / (true_pos + false_neg)
    num_correct = len(correct)
    tot_pred = len(df)
    accuracy = num_correct / tot_pred
    return accuracy, precision, recall

Now finally we can read in the data, clean it, and feed it to the sklearn logistic regression model.

In [4]:
if __name__ == '__main__':
    trdf = pd.read_csv("train_mod.csv", header=0)
    trdf = clean_data(trdf)

    train = trdf.values
    logit = LogisticRegression()
    logit.fit(train[:, 1:], train[:, 0])

    testdf = pd.read_csv("test_mod.csv", header=0)
    pass_survival = testdf.copy()
    testdf = clean_data(testdf)

    c = logit.predict(testdf.values)

    # add the survival prediction column
    pass_survival['pred_survival'] = c

    # Get the ground truth dataset
    gtdf = pd.read_csv("titanic3_mod.csv", header=0)

    # Now merge the predictions with the ground truth.  This
    # is like a table join in the database world.
    ansdf = pd.merge(gtdf, pass_survival)

    # check to see if we got all the answers
    if len(pass_survival) != len(ansdf):
        raise PredictionError("Did not match all predictions with ground truth")

    accuracy, precision, recall = score_results(ansdf, 'Survived', 
                                                'pred_survival')

    print("accuracy: {}   precision: {}   recall: {}".format(accuracy,
           precision, recall))

accuracy: 0.7655502392344498   precision: 0.6973684210526315   recall: 0.6708860759493671
