## Investigating deaths on the titanic
This notebook cleans the data from the titanic kaggle competition and then uses 3 alorithms to try and predict who survived (Logistic Regression, Random Forest, Gradient Boosting). Much of the code was copied from or inspired by these 2 sources: https://www.dataquest.io/mission/75/improving-your-submission and https://github.com/elenacuoco/kaggle-competitions/blob/master/Titanic-For_Blog.ipynb. 

The imports and setup

In [146]:
import pandas as pd
import numpy as np
import string
import operator
from sklearn.cross_validation import KFold
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

le = preprocessing.LabelEncoder()
enc=preprocessing.OneHotEncoder()

Read in the Data

In [147]:
rawdf=pd.read_csv("train.csv")
rawdf_test=pd.read_csv("test.csv")

## Cleaning the Data

The cleaning functions (source: https://github.com/elenacuoco/kaggle-competitions/blob/master/Titanic-For_Blog.ipynb)

In [148]:
###utility to clean and munge data
def substrings_in_string(big_string, substrings):
    for substring in substrings:
        if string.find(big_string, substring) != -1:
            return substring
    print big_string
    return np.nan

# A dictionary mapping family name to id
family_id_mapping = {}

def clean_and_munge_data(df):
    #setting silly values to nan
    df.Fare = df.Fare.map(lambda x: np.nan if x==0 else x)
    
    #Creating new family_size column
    df['Family_Size']=df['SibSp']+df['Parch']
    df['Family']=df['SibSp']*df['Parch']
    
    #creating a title column from name
    title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                'Don', 'Jonkheer']
    df['Title']=df['Name'].map(lambda x: substrings_in_string(x, title_list))

    #replacing all titles with mr, mrs, miss, master
    def replace_titles(x):
        title=x['Title']
        if title in ['Mr','Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
            return 'Mr'
        elif title in ['Master']:
            return 'Master'
        elif title in ['Countess', 'Mme','Mrs']:
            return 'Mrs'
        elif title in ['Mlle', 'Ms','Miss']:
            return 'Miss'
        elif title =='Dr':
            if x['Sex']=='Male':
                return 'Mr'
            else:
                return 'Mrs'
        elif title =='':
            if x['Sex']=='Male':
                return 'Master'
            else:
                return 'Miss'
        else:
            return title
        
    ##Family Grouping code taken from: https://www.dataquest.io/mission/75/improving-your-submission    

    # A function to get the id given a row
    def get_family_id(row):
        # Find the last name by splitting on a comma
        last_name = row["Name"].split(",")[0]
        # Create the family id
        family_id = "{0}{1}".format(last_name, row["Family_Size"])
        # Look up the id in the mapping
        if family_id not in family_id_mapping:
            if len(family_id_mapping) == 0:
                current_id = 1
            else:
                # Get the maximum id from the mapping and add one to it if we don't have an id
                current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
            family_id_mapping[family_id] = current_id
        return family_id_mapping[family_id]
    
    # Get the family ids with the apply method
    family_ids = df.apply(get_family_id, axis=1)

    # There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
    family_ids[df["Family_Size"] < 3] = -1

    # Print the count of each unique id.
    print(pd.value_counts(family_ids))

    df["FamilyId"] = family_ids

    df['Title']=df.apply(replace_titles, axis=1)




    #imputing nan values
    df.loc[ (df.Fare.isnull())&(df.Pclass==1),'Fare'] =np.median(df[df['Pclass'] == 1]['Fare'].dropna())
    df.loc[ (df.Fare.isnull())&(df.Pclass==2),'Fare'] =np.median( df[df['Pclass'] == 2]['Fare'].dropna())
    df.loc[ (df.Fare.isnull())&(df.Pclass==3),'Fare'] = np.median(df[df['Pclass'] == 3]['Fare'].dropna())

    df['AgeFill']=df['Age']
    mean_ages = np.zeros(4)
    mean_ages[0]=np.average(df[df['Title'] == 'Miss']['Age'].dropna())
    mean_ages[1]=np.average(df[df['Title'] == 'Mrs']['Age'].dropna())
    mean_ages[2]=np.average(df[df['Title'] == 'Mr']['Age'].dropna())
    mean_ages[3]=np.average(df[df['Title'] == 'Master']['Age'].dropna())
    df.loc[ (df.Age.isnull()) & (df.Title == 'Miss') ,'AgeFill'] = mean_ages[0]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Mrs') ,'AgeFill'] = mean_ages[1]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Mr') ,'AgeFill'] = mean_ages[2]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Master') ,'AgeFill'] = mean_ages[3]

    df['AgeCat']=df['AgeFill']
    df.loc[ (df.AgeFill<=10) ,'AgeCat'] = 'child'
    df.loc[ (df.AgeFill>60),'AgeCat'] = 'aged'
    df.loc[ (df.AgeFill>10) & (df.AgeFill <=30) ,'AgeCat'] = 'adult'
    df.loc[ (df.AgeFill>30) & (df.AgeFill <=60) ,'AgeCat'] = 'senior'

    df.Embarked = df.Embarked.fillna('S')


    #Special case for cabins as nan may be signal
    df.loc[ df.Cabin.isnull()==True,'Cabin'] = 0.5
    df.loc[ df.Cabin.isnull()==False,'Cabin'] = 1.5
   #Fare per person

    df['Fare_Per_Person']=df['Fare']/(df['Family_Size']+1)

    #Age times class

    df['AgeClass']=df['AgeFill']*df['Pclass']
    df['ClassFare']=df['Pclass']*df['Fare_Per_Person']


    df['HighLow']=df['Pclass']
    df.loc[ (df.Fare_Per_Person<8) ,'HighLow'] = 'Low'
    df.loc[ (df.Fare_Per_Person>=8) ,'HighLow'] = 'High'
    
    #df['Title']=df['Name'].map(lambda x: substrings_in_string(x, title_list))

    le.fit(df['Sex'] )
    x_sex=le.transform(df['Sex'])
    df['Sex']=x_sex.astype(np.float)

    le.fit( df['Ticket'])
    x_Ticket=le.transform( df['Ticket'])
    df['Ticket']=x_Ticket.astype(np.float)

    le.fit(df['Title'])
    x_title=le.transform(df['Title'])
    df['Title'] =x_title.astype(np.float)

    le.fit(df['HighLow'])
    x_hl=le.transform(df['HighLow'])
    df['HighLow']=x_hl.astype(np.float)


    le.fit(df['AgeCat'])
    x_age=le.transform(df['AgeCat'])
    df['AgeCat'] =x_age.astype(np.float)

    le.fit(df['Embarked'])
    x_emb=le.transform(df['Embarked'])
    df['Embarked']=x_emb.astype(np.float)

    df = df.drop(['Name','Age','Cabin'], axis=1) #remove Name,Age and PassengerId


    return df

Use the cleaning function to clean the data

In [149]:
df=clean_and_munge_data(rawdf)
df_test=clean_and_munge_data(rawdf_test)

-1      800
 14       8
 149      7
 63       6
 50       6
 59       6
 17       5
 384      4
 27       4
 25       4
 162      4
 8        4
 84       4
 340      4
 43       3
 269      3
 58       3
 633      2
 167      2
 280      2
 510      2
 90       2
 83       1
 625      1
 376      1
 449      1
 498      1
 588      1
dtype: int64
-1      384
 149      4
 25       3
 280      3
 27       2
 59       2
 633      2
 510      2
 167      2
 90       2
 162      1
 759      1
 449      1
 84       1
 269      1
 58       1
 43       1
 794      1
 918      1
 17       1
 14       1
 8        1
dtype: int64


In [150]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Ticket,Fare,Embarked,Family_Size,Family,Title,FamilyId,AgeFill,AgeCat,Fare_Per_Person,AgeClass,ClassFare,HighLow
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.647587,0.523008,0.381594,338.52862,32.689318,1.536476,0.904602,0.567901,1.860831,14.232323,29.819131,1.59147,20.401486,65.062477,32.118852,0.417508
std,257.353842,0.486592,0.836071,0.47799,1.102743,0.806057,200.850657,49.611639,0.791503,1.613459,1.979287,0.721066,69.886368,13.285423,1.428952,35.894413,33.676295,35.84521,0.493425
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0125,0.0,0.0,0.0,0.0,-1.0,0.42,0.0,1.132143,0.92,3.396429,0.0
25%,223.5,0.0,2.0,0.0,0.0,0.0,158.5,7.925,1.0,0.0,0.0,2.0,-1.0,21.835616,0.0,7.5896,39.5,21.545883,0.0
50%,446.0,0.0,3.0,1.0,0.0,0.0,337.0,14.5,2.0,0.0,0.0,2.0,-1.0,30.0,2.0,8.6625,63.0,24.15,0.0
75%,668.5,1.0,3.0,1.0,1.0,0.0,519.5,31.275,2.0,1.0,0.0,2.0,-1.0,35.841667,3.0,24.5,91.75,28.5,1.0
max,891.0,1.0,3.0,1.0,8.0,6.0,680.0,512.3292,2.0,10.0,16.0,3.0,633.0,80.0,3.0,512.3292,222.0,512.3292,1.0


Logistic Regression Model. Kaggle Score: (Don't have enough submissions to investigate, probably around 0.76077).

In [151]:
# The columns we'll use to predict the target
predictorsLog = ["Pclass", "Sex", "AgeCat", "AgeFill", "Family_Size","Fare_Per_Person", "Fare", "Embarked"]

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, df[predictorsLog], df["Survived"], cv=10)
# Take the mean of the scores (because we have one for each fold)
print scores.mean()

# Train the algorithm using all the training data
alg.fit(df[predictorsLog], df["Survived"])

# Make predictions using the test set.
predictionsLog = alg.predict_proba(df_test[predictorsLog])[:,1]

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": df_test["PassengerId"],
        "Survived": np.round(predictionsLog).astype(int)
    })
print sum(submission.Survived)
submission.to_csv("kaggleLogReg.csv", index=False)

0.792404097151
148


This is an attempt to use a random forest classifier. Results on Kaggle: 0.79904. Without the algorithm's parameters the results were only 0.72727 on Kaggle. Using 5000 estimators and max_depth 10 the kaggle score reduced to 0.77990.

In [152]:
# The columns we'll use to predict the target
predictorsRan = ["Pclass", "Sex", "AgeCat", "AgeFill", "Family_Size", "Fare_Per_Person", "Fare", "Embarked", "Title", "ClassFare", "FamilyId"]

# Initialize our algorithm (The parameters are taken from: https://github.com/elenacuoco/kaggle-competitions/blob/master/Titanic-For_Blog.ipynb)
alg = RandomForestClassifier(n_estimators=350, criterion='entropy', max_depth=5, min_samples_split=2,
  min_samples_leaf=2, max_features='auto',    bootstrap=False, oob_score=False, n_jobs=1, random_state=2,
  verbose=0)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, df[predictorsRan], df["Survived"], cv=10)
# Take the mean of the scores (because we have one for each fold)
print scores.mean()
# Train the algorithm using all the training data
alg.fit(df[predictorsRan], df["Survived"])

# Make predictions using the test set.
predictionsRan = alg.predict_proba(df_test[predictorsRan])[:,1]

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": df_test["PassengerId"],
        "Survived": np.round(predictionsRan).astype(int)
    })
print sum(submission.Survived)
submission.to_csv("kaggleRanFor.csv", index=False)

0.837261945296
148


Gradient Boost Kaggle Score: 0.78947

In [153]:
# The columns we'll use to predict the target
predictorsGrad = ["Pclass", "Sex", "AgeFill", "Family_Size", "Fare", "Embarked", "Title", "FamilyId"]

# Initialize our algorithm (The parameters are taken from: https://github.com/elenacuoco/kaggle-competitions/blob/master/Titanic-For_Blog.ipynb)
alg = GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, df[predictorsGrad], df["Survived"], cv=10)
# Take the mean of the scores (because we have one for each fold)
print scores.mean()
# Train the algorithm using all the training data
alg.fit(df[predictorsGrad], df["Survived"])

# Make predictions using the test set.
predictionsGrad = alg.predict_proba(df_test[predictorsGrad])[:,1]

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": df_test["PassengerId"],
        "Survived": np.round(predictionsGrad).astype(int)
    })
print sum(submission.Survived)
submission.to_csv("kaggleGrad.csv", index=False)

0.827212007718
143


Create Ensemble Prediction (Logistic regression, Random Forest, GradientBoosting). Overall the Random forest is the best model and so combining it with the others brought down the score to 0.79426 compared to 0.79904 with the Random Forest alone. While I would have hoped that the ensemble would beat the Random Forest alone, it makes sense that the others would bring down the score.

In [154]:
# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": df_test["PassengerId"],
        "Survived": np.round((3*predictionsRan + predictionsLog + predictionsGrad)/5).astype(int)
    })
submission.to_csv("kaggleEnsem.csv", index=False)
print sum(submission.Survived)

149


In the future I would like to try the ensemble method but having each algorithm vote instead of averaging the results. I think it would also be cool if I could analyse the ethnicity of the names and see if that correlated with survival. 