# Titanic

This is a quick response to some thoughts I had on [Chris Deotte's](https://www.kaggle.com/cdeotte) great kernel [Titanic using Name only](https://www.kaggle.com/cdeotte/titanic-using-name-only-0-81818/notebook), which I definitely suggest you look at if you haven't already.

To sumarise quickly, his approach makes predictions based on the following rules:

* All males die except boys in families where all females and boys live.

* All females live except those in families where all females and boys die.

Boys are defined as passengers with the title "Master", and families are defined as groups of people with the same surname.

I was wondering what would happen if you group passengers sharing same ticket number, instead of passengers with the same surname. My main thought process for this was:

* Passengers with the same surname may not come from the same family (could come from multiple families).

* There may be important non-surname based groups, for example travelling friends, non-married couples etc.  

So this notebook uses Chris' approach with ticket grouping and compares the results.

UPDATE: Now using Chris' suggestion of the engineered TicketId feature.

In [1]:
# load the data
import pandas as pd
df = pd.read_csv('train.csv',index_col='PassengerId')

In [2]:
# select females and masters (boys)
boy = (df.Name.str.contains('Master')) | ((df.Sex=='male') & (df.Age<13))
female = df.Sex=='female'
boy_or_female = boy | female

#TicketId improvement suggested by Chris Deotte again
# Ticket no. without last digit with fare appended
#df['TicketId'] = df.Ticket.str[:-1] + '-' + df.Fare.astype(str)

df['TicketId'] = df.Ticket.str.split().str[-1].str[:-1] + '-' + df.Fare.astype(str)
#str.split(): split in to words
#.str[-1]: select last word
#.str[:-1]: remove last digit from last word

In [3]:
# function to calculate boy+female survival rate
# amongst passengers grouped by "group"
def group_survival(group):
    # no. passengers in group
    n_group = df[boy_or_female].groupby(group).Survived.count()
    
    # survival rate in group
    surv_group = df[boy_or_female].groupby(group).Survived.mean()
    
    return n_group, surv_group

In [4]:
# function to create relevant features for test data
def create_features(frame, group):
    
    # new features to engineer from test data columns
    frame['Boy'] = (frame.Name.str.contains('Master')) | ((frame.Sex=='male') & (frame.Age<13))
    frame['Female'] = (frame.Sex=='female').astype(int)
    #frame['TicketId'] = frame.Ticket.str[:-1] + '-' + frame.Fare.astype(str)
    frame['TicketId'] = frame.Ticket.str.split().str[-1].str[:-1] + '-' + frame.Fare.astype(str)

    # female+boy survival in training data grouped by 'group'
    n_group, surv_group = group_survival(group)
    
    # if group exists in training data, fill NGroup with no. women+boys
    # in that group in the training data.
    frame['NGroup'] = frame[group].replace(n_group)
    # otherwise NGroup=0
    frame.loc[~frame[group].isin(n_group.index),'NGroup']=0

    # if group exists in training data, fill GroupSurv with
    # women+boys survival rate in training data  
    frame['GroupSurv'] = frame[group].replace(surv_group)
    # otherwise TicketSurv=0
    frame.loc[~frame[group].isin(surv_group.index),'GroupSurv']=0

    # return data frame only including features needed for prediction
    return frame[['Female','Boy','NGroup','GroupSurv']]


In [5]:
# predict survival for a passenger
def did_survive(row):
    if row.Female:
        # predict died if all women+boys in group died
        if (row.NGroup>0) and (row.GroupSurv==0):
            return 0
        # predict survived for all other women
        else:
            return 1
        
    elif row.Boy:
        # predict survived if all women+boys in group survived
        if (row.NGroup>0) and (row.GroupSurv==1):
            return 1
        # predict died for all other boys
        else:
            return 0
        
    else:
        # predict all men die
        return 0

In [6]:
# load test data
df_test = pd.read_csv('test.csv',index_col='PassengerId')

# extract the features to use
X = create_features(df_test,'Ticket')

# predict test data
pred = X.apply(did_survive,axis=1)

# create submission file
pred = pd.DataFrame(pred) 
pred.rename(columns={0:'Survived'},inplace=True)
#pred.to_csv('submission.csv')

print(pred.Survived.value_counts())

0    265
1    153
Name: Survived, dtype: int64


# Results

This approach gives me a leaderboard score of 0.813 compared to Chris Deotte's score of 0.823, so is marginally worse than the surname based approach.

By comparing with his output file, I see that I predict three passengers in the test data survive, whilst Chris predicts they die. These three passengers are displayed below:

In [7]:
# passengers where I predict differently
# (based on comparison with Chris Deotte's output file)
id_nomatch = [910,929,1172]

# display these passengers
display(df_test.loc[id_nomatch])

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boy,Female,TicketId,NGroup,GroupSurv
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
910,3,"Ilmakangas, Miss. Ida Livija",female,27.0,1,0,STON/O2. 3101270,7.925,,S,False,1,310127-7.925,0,0
929,3,"Cacic, Miss. Manda",female,21.0,0,0,315087,8.6625,,S,False,1,31508-8.6625,0,0
1172,3,"Oreskovic, Miss. Jelka",female,23.0,0,0,315085,8.6625,,S,False,1,31508-8.6625,0,0


And the passengers in the training data with surnames matching the three test data passengers above are:

In [8]:
# families in training data with member in test data that I predict differently
display(df.loc[df.Name.str.contains('Ilmakangas')])
display(df.loc[df.Name.str.contains('Cacic')])
display(df.loc[df.Name.str.contains('Oreskovic')])

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketId
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
730,0,3,"Ilmakangas, Miss. Pieta Sofia",female,25.0,1,0,STON/O2. 3101271,7.925,,S,310127-7.925


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketId
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
472,0,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.6625,,S,31508-8.6625
535,0,3,"Cacic, Miss. Marija",female,30.0,0,0,315084,8.6625,,S,31508-8.6625


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketId
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
405,0,3,"Oreskovic, Miss. Marija",female,20.0,0,0,315096,8.6625,,S,31509-8.6625
726,0,3,"Oreskovic, Mr. Luka",male,20.0,0,0,315094,8.6625,,S,31509-8.6625


The Ilmakangas appear to be adult sisters travelling together (indicated by SibSp=1), but who bought tickets separately. But the Cacics and Oreskovics have no siblings, spouses, children or parents onboard according to the data. However, they all travel in the same class, with the same fare and with similar ticket numbers. They may be unrelated, but are likely to be travelling cousins/similar who bought tickets separately.

In summary, whilst I was hoping grouping by ticket would catch a greater variety of groups I'd missed the point that groups of friends etc. are very likely to have bought tickets separately. So only grouping by ticket appears to miss a few relationships, rather than catch more. 

Also, as the additional rules (vs. a gender only prediction) only apply to groups of women and boys who either all survive or die, there are not so many and cases of unrelated families having the same surname are unlikely.

In [9]:
# predict using TicketId
X = create_features(df_test,'TicketId')
pred_ticketId = X.apply(did_survive,axis=1)
pred_ticketId = pd.Series(pred_ticketId) 
print(pred_ticketId.value_counts())

#compared Ticket and TicketId results
pred.rename(columns={'Survived':'Ticket'},inplace=True)
pred['TicketId'] = pred_ticketId
pred['Match'] = pred['Ticket']==pred['TicketId']

display(pred[pred.Match==False])
display(df_test.loc[pred.Match[pred.Match==False].index].sort_values(by='TicketId'))
display(df[df.TicketId.isin(df_test.loc[pred.Match[pred.Match==False].index].TicketId)].sort_values(by='TicketId'))

# create submission file using TicketId results
pred_ticketId = pd.DataFrame(pred_ticketId) 
pred_ticketId.rename(columns={0:'Survived'},inplace=True)
pred_ticketId.to_csv('submission.csv')

0    272
1    146
dtype: int64


Unnamed: 0_level_0,Ticket,TicketId,Match
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
910,1,0,False
929,1,0,False
964,1,0,False
980,1,0,False
1030,1,0,False
1141,1,0,False
1172,1,0,False
1174,1,0,False
1231,0,1,False


Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boy,Female,TicketId,NGroup,GroupSurv
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1231,3,"Betros, Master. Seman",male,,0,0,2622,7.2292,,C,True,0,262-7.2292,1,1
1141,3,"Khalil, Mrs. Betros (Zahie Maria"" Elias)""",female,,1,0,2660,14.4542,,C,False,1,266-14.4542,2,0
910,3,"Ilmakangas, Miss. Ida Livija",female,27.0,1,0,STON/O2. 3101270,7.925,,S,False,1,310127-7.925,1,0
964,3,"Nieminen, Miss. Manta Josefina",female,29.0,0,0,3101297,7.925,,S,False,1,310129-7.925,1,0
929,3,"Cacic, Miss. Manda",female,21.0,0,0,315087,8.6625,,S,False,1,31508-8.6625,1,0
1172,3,"Oreskovic, Miss. Jelka",female,23.0,0,0,315085,8.6625,,S,False,1,31508-8.6625,1,0
980,3,"O'Donoghue, Ms. Bridget",female,,0,0,364856,7.75,,Q,False,1,36485-7.75,1,0
1174,3,"Fleming, Miss. Honora",female,,0,0,364859,7.75,,Q,False,1,36485-7.75,1,0
1030,3,"Drapkin, Miss. Jennie",female,23.0,0,0,SOTON/OQ 392083,8.05,,S,False,1,39208-8.05,1,0


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketId
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C,262-7.2292
368,1,3,"Moussa, Mrs. (Mantoura Boulos)",female,,0,0,2626,7.2292,,C,262-7.2292
112,0,3,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,266-14.4542
241,0,3,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,266-14.4542
105,0,3,"Gustafsson, Mr. Anders Vilhelm",male,37.0,2,0,3101276,7.925,,S,310127-7.925
393,0,3,"Gustafsson, Mr. Johan Birger",male,28.0,2,0,3101277,7.925,,S,310127-7.925
730,0,3,"Ilmakangas, Miss. Pieta Sofia",female,25.0,1,0,STON/O2. 3101271,7.925,,S,310127-7.925
529,0,3,"Salonen, Mr. Johan Werner",male,39.0,0,0,3101296,7.925,,S,310129-7.925
817,0,3,"Heininen, Miss. Wendla Maria",female,23.0,0,0,STON/O2. 3101290,7.925,,S,310129-7.925
383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S,310129-7.925
