Exploring Passenger Titles
==========================

The "Name" parameter from the dataset was parsed with another script into "Title" and "LastName" features. Initial analysis indicates that the "Title" feature could incorporate gender and age information. This notebook explore how these new features impact the overall predictability of the dataset

In [257]:
# set up code and dataframe with training data
import pandas as pd
import numpy as np

from sklearn import svm
from sklearn.model_selection import train_test_split

df = pd.read_csv("../data/kaggle/train_expanded.csv")
print(df.shape)

(891, 14)


In [258]:
x_col_name, y_col_name = "Title", "Survived"

# transform the dataframe by title, so that we can determine which titles are most survivable
df_group_count = df.groupby([x_col_name, y_col_name])[x_col_name].count()
df_grouped = df_group_count.unstack(y_col_name)

df_grouped.fillna(0, inplace=True)

# Survival Rate by Title

Once we have passengers grouped by title, we can analyze the survival rates by title. This shows us what titles have a high survival rate (e.g. "Mrs") versus what titles have a low survival rate (e.g. "Mr").

In [259]:
# functions to explore survival rate
def get_total( row ):
    return row[0] + row[1]

def get_survival_rate( row ):
    return row[1] / ( row[0] + row[1] )

In [260]:
# analyze and print survival rate
df_grouped['Total']        = df_grouped.apply(get_total, axis=1)
df_grouped['SurvivalRate'] = round(df_grouped.apply(get_survival_rate, axis=1), 4)

df_grouped.sort_values(by='SurvivalRate', ascending=False, inplace=True)

print(df_grouped)

Survived          0      1  Total  SurvivalRate
Title                                          
the Countess    0.0    1.0    1.0        1.0000
Mlle            0.0    2.0    2.0        1.0000
Sir             0.0    1.0    1.0        1.0000
Ms              0.0    1.0    1.0        1.0000
Lady            0.0    1.0    1.0        1.0000
Mme             0.0    1.0    1.0        1.0000
Mrs            26.0   99.0  125.0        0.7920
Miss           55.0  127.0  182.0        0.6978
Master         17.0   23.0   40.0        0.5750
Col             1.0    1.0    2.0        0.5000
Major           1.0    1.0    2.0        0.5000
Dr              4.0    3.0    7.0        0.4286
Mr            436.0   81.0  517.0        0.1567
Jonkheer        1.0    0.0    1.0        0.0000
Rev             6.0    0.0    6.0        0.0000
Don             1.0    0.0    1.0        0.0000
Capt            1.0    0.0    1.0        0.0000


# Non-Surviving Title

From the above analysis, we can see that the follow titles have survival rate under 50%: "Capt", "Don", "Rev", "Johnkheer", "Mr", "Dr". We can create a function that derives survivability based on title. By breaking the training data down into test and train subsets, we can get an estimate on the mean accuracy score for the survival title model.

In [261]:
def non_surviving_title( row ):
    nonsurviving_titles = ['Capt', 'Don', 'Rev', 'Jonkheer', 'Mr', 'Dr']

    if row['Title'] in nonsurviving_titles:
        return 0
    else:
        return 1

In [262]:
df['NonSurvivalTitle'] = df.apply(non_surviving_title, axis=1)

In [263]:
# calculate the mean accuracy score for a sample of the training data
x_col_name, y_col_name = "NonSurvivalTitle", "Survived"

X = df[[ x_col_name ]]
y = df[ y_col_name ].values

X_train, X_test, y_train, y_test = train_test_split( X, y, \
                                     test_size=0.2, random_state=5)

clf = svm.SVC( kernel='linear', C=1 ).fit( X_train, y_train )
clf.score(X_test, y_test)

0.8044692737430168

# Training Group by Title and Class

Some titles appear to be (anecdotally) high class (e.g. "the Countess", "Sir"). However some other titles are less clear, like "Mr" and "Mrs". This section of the model will list out a table showing the number of passengers for each title and pclass. From there, we can build a matrix of survivability for each feature.

In [264]:
x_col_name, y_col_name = "Title", "Pclass"

# transform the dataframe by title, so that we can determine which titles are most survivable
df_group_count = df.groupby([x_col_name, y_col_name])[x_col_name].count()
df_pclass_group = df_group_count.unstack(y_col_name)

df_pclass_group.fillna(0, inplace=True)

print(df_pclass_group)

Pclass            1     2      3
Title                           
Capt            1.0   0.0    0.0
Col             2.0   0.0    0.0
Don             1.0   0.0    0.0
Dr              5.0   2.0    0.0
Jonkheer        1.0   0.0    0.0
Lady            1.0   0.0    0.0
Major           2.0   0.0    0.0
Master          3.0   9.0   28.0
Miss           46.0  34.0  102.0
Mlle            2.0   0.0    0.0
Mme             1.0   0.0    0.0
Mr            107.0  91.0  319.0
Mrs            42.0  41.0   42.0
Ms              0.0   1.0    0.0
Rev             0.0   6.0    0.0
Sir             1.0   0.0    0.0
the Countess    1.0   0.0    0.0


# Test Group by Title and Class

Compare the break down of title and class in the expanded test dataset to the training dataset

In [265]:
df_test = pd.read_csv("../data/kaggle/test_expanded.csv")
x_col_name, y_col_name = "Title", "Pclass"

# transform the dataframe by title, so that we can determine which titles are most survivable
df_group_count = df_test.groupby([x_col_name, y_col_name])[x_col_name].count()
df_grouped = df_group_count.unstack(y_col_name)

df_grouped.fillna(0, inplace=True)

print(df_grouped)

Pclass     1     2      3
Title                    
Col      2.0   0.0    0.0
Dona     1.0   0.0    0.0
Dr       1.0   0.0    0.0
Master   2.0   2.0   17.0
Miss    14.0  16.0   48.0
Mr      52.0  59.0  129.0
Mrs     35.0  14.0   23.0
Ms       0.0   0.0    1.0
Rev      0.0   2.0    0.0
