# Titanic

This is a quick response to some thoughts I had on [Chris Deotte's](https://www.kaggle.com/cdeotte) great kernel [Titanic using Name only](https://www.kaggle.com/cdeotte/titanic-using-name-only-0-81818/notebook), which I definitely suggest you look at if you haven't already.

To sumarise quickly, his approach makes predictions based on the following rules:

* All males die except boys in families where all females and boys live.

* All females live except those in families where all females and boys die.

Boys are defined as passengers with the title "Master", and families are defined as groups of people with the same surname.

I was wondering what would happen if you group passengers sharing same ticket number, instead of passengers with the same surname. My main thought process for this was:

* Passengers with the same surname may not come from the same family (could come from multiple families).

* There may be important non-surname based groups, for example travelling friends, non-married couples etc.  

So this notebook uses Chris' approach with ticket grouping and compares the results.

UPDATE: Now using Chris' suggestion of the engineered TicketId feature.

In [1]:
# load the data
import pandas as pd
df = pd.read_csv('train.csv',index_col='PassengerId')

In [2]:
# select females and masters (boys)
boy = (df.Name.str.contains('Master')) | ((df.Sex=='male') & (df.Age<13))
female = df.Sex=='female'
boy_or_female = boy | female

#TicketId improvement suggested by Chris Deotte again
# Ticket no. without last digit with fare appended
df['TicketId'] = df.Ticket.str[:-1] + '-' + df.Fare.astype(str)

In [3]:
# function to calculate boy+female survival rate
# amongst passengers grouped by "group"
def group_survival(group):
    # no. passengers in group
    n_group = df[boy_or_female].groupby(group).Survived.count()
    
    # survival rate in group
    surv_group = df[boy_or_female].groupby(group).Survived.mean()
    
    return n_group, surv_group

In [4]:
# function to create relevant features for test data
def create_features(frame, group):
    
    # new features to engineer from test data columns
    frame['Boy'] = (frame.Name.str.contains('Master')) | ((frame.Sex=='male') & (frame.Age<13))
    frame['Female'] = (frame.Sex=='female').astype(int)
   
    frame['TicketId'] = frame.Ticket.str[:-1] + '-' + frame.Fare.astype(str)

    # female+boy survival in training data grouped by 'group'
    n_group, surv_group = group_survival(group)
    
    # if group exists in training data, fill NGroup with no. women+boys
    # in that group in the training data.
    frame['NGroup'] = frame[group].replace(n_group)
    # otherwise NGroup=0
    frame.loc[~frame[group].isin(n_group.index),'NGroup']=0

    # if group exists in training data, fill GroupSurv with
    # women+boys survival rate in training data  
    frame['GroupSurv'] = frame[group].replace(surv_group)
    # otherwise TicketSurv=0
    frame.loc[~frame[group].isin(surv_group.index),'GroupSurv']=0

    # return data frame only including features needed for prediction
    return frame[['Female','Boy','NGroup','GroupSurv']]


In [5]:
# predict survival for a passenger
def did_survive(row):
    if row.Female:
        # predict died if all women+boys in group died
        if (row.NGroup>0) and (row.GroupSurv==0):
            return 0
        # predict survived for all other women
        else:
            return 1
        
    elif row.Boy:
        # predict survived if all women+boys in group survived
        if (row.NGroup>0) and (row.GroupSurv==1):
            return 1
        # predict died for all other boys
        else:
            return 0
        
    else:
        # predict all men die
        return 0

In [6]:
# load test data
df_test = pd.read_csv('test.csv',index_col='PassengerId')

# extract the features to use
X = create_features(df_test,'TicketId')

# predict test data
pred = X.apply(did_survive,axis=1)

# create submission file
pred = pd.DataFrame(pred) 
pred.rename(columns={0:'Survived'},inplace=True)
pred.to_csv('submission.csv')

print(pred.Survived.value_counts())

0    270
1    148
Name: Survived, dtype: int64
