## **Titanic Kaggle competition, XGBoost, public score 0.79425 (top 6%)**
<br>

**The main idea that worked for me was to group passengers based on ticket number and further break men and women with kids to calculate survival rate for each subgroup. Aslo to recalculate fare - divide fare by number of people in the group. This is it. Now i will provide more details on the above.**
<br>
<br>

First i tried grouping by ticket number and got poor results. Overall survival rate for men is much lower that of the women and kids. That's why every group needs to be split by this characteristic. If someone from the same group survived that should increase survival chances for other people. Interesting that sometimes last names of the people in the same group differ. I guess people were travelling with friends.
<br>

Male kids can be easily identified by title 'Master' in case age in missing. Sex3 variable split population into men and women with kids.
<br>

Then fare should be recalculated based on size of each group. After doing that fare distribution by pclass starts making sense.
<br>

There are passengers travelling on their own or travelling in groups without anyone being present in the train dataset for example. In those situations survival rate will be assigned based on pclass and new sex variable (men vs women/kids).
<br>
<br>

Age information didn't work for me:
<br>

I also tried to populate age using other variables such as pclass, title and whether passenger is travelling with someone else. Kids can't travel on their own so median age for title 'Miss' will be different for passengers in groups and not in groups. Unfortunately age variable didn't help to improve results.
<br>
<br>

I tried several ML techniques but xgboost came out the winner. Variables that gave the best result are:
<br>


**Pclass, Sex3_t(men vs women with kids), survival_rate_final, Fare_new**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('classic')
from matplotlib.ticker import PercentFormatter
import re
pd.set_option('display.max_columns', None)

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

<br>
<br>

#### Extracting last names and titles

In [None]:
train['train'] = 1
test['train'] = 0
train_test = pd.concat((train, test))
train_test['Last_name'] = train_test.Name.apply(lambda x: x.split(',')[0])
train_test['Title'] = train_test.Name.apply(lambda x: x.split(',')[1].split('.')[0][1:])

<br>
<br>

#### New variable Sex2 separates kids out of males population using age and title. Sex3 just combines 'young males' with females

In [None]:
train_test['Sex2'] = train_test.Sex
train_test.loc[(train_test.Sex == 'male')&((train_test.Age < 11)|(train_test.Title == ' Master')), 'Sex2'] = 'young male'
train_test['Sex3'] = train_test.Sex2
train_test.loc[(train_test.Sex2 == 'female')|(train_test.Sex2 == 'young male'), 'Sex3'] = 'female/kids'

<br>
<br>

#### This code organizes females into 2 groups 'Miss' and 'Mrs' so missing age variable can be populated later more accurately based on this information.

In [None]:
def title2(x):
    if x['Sex2'] == 'female' and x['Title'] in ['Miss', 'Lady', 'Mlle', 'Ms']:
        return 'Miss'
    elif x['Sex2'] == 'female':
        return 'Mrs'
    elif x['Sex2'] == 'male':
        return 'Mr'
    else:
        return x['Title']
train_test['Title2'] = train_test.apply(title2, axis=1)

<br>
<br>

#### The below part calculates survival rates per group. Groups are based on same ticket values and Sex3 variable. Meaning that men were separated from women and kids. 

#### If someone from some group survived then other people from their group should have higher survival chances. Again, men and women/kids are evaluated separately even if they travel in the same group.

In [None]:
train_test['male_train'] = np.where((train_test.Sex3 == 'male')&(train_test.train == 1), 1, 0)
train_test['fk_train'] = np.where((train_test.Sex3 == 'female/kids')&(train_test.train == 1), 1, 0)

train_test['Survived_m'] = train_test.Survived
train_test.loc[(train_test.Sex3 != 'male')&(train_test.Survived == 1), 'Survived_m'] = 0
train_test['Survived_fk'] = train_test.Survived
train_test.loc[(train_test.Sex3 != 'female/kids')&(train_test.Survived == 1), 'Survived_fk'] = 0

In [None]:
# Ingroup means ticket number is not unique - passenger is travelling with someone else
# There are groups of passengers that are 100% in test dataset only
# I will assign average survival rate for ingroup passengers to those
# Passengers that are not in any group, travelling alone will be assigned average survival rate of that population
group = train_test.groupby('Ticket').agg(
    total = ('train', "count"),
    train_total = ('train', "sum"),
    survived = ('Survived', "sum"),
    male_train = ('male_train', "sum"),
    fk_train = ('fk_train', "sum"),
    Survived_m = ('Survived_m', "sum"),
    Survived_fk = ('Survived_fk', "sum"))

group = group[group.total > 1].copy()
group['train_p'] = group.train_total/group.total
group['survival_rate'] = group.survived/group.train_total
group['survival_rate_m'] = group.Survived_m/group.male_train
group['survival_rate_fk'] = group.Survived_fk/group.fk_train

group.drop(columns = ['train_total','survived','male_train','fk_train','Survived_m','Survived_fk'], inplace=True)
train_test = train_test.merge(group, how='left', left_on='Ticket', right_index=True)
train_test['Ingroup'] = 1
train_test.loc[train_test.total.isnull(), 'Ingroup'] = 0

In [None]:
def survival_rate_final(x):
    if x['Sex3'] == 'male':
        return x['survival_rate_m']
    else:
        return x['survival_rate_fk']

train_test['survival_rate_final'] = train_test.apply(survival_rate_final, axis=1)

<br>
<br>

#### There are passengers with unique ticket number so they are travelling on their own. Also there are passengers travelling with a group but everyone is in the test dataset so the group survival rate can not be calculated. I am going to assign 'general' survival rate for those based on Sex3 (male, female/kids) and pclass as it seems to be driven by these variables.

In [None]:
train_test[train_test.survival_rate_final.isna()].groupby(['Ingroup','Sex3']).PassengerId.count()

In [None]:
group = train_test[train_test.train == 1].groupby(['Sex3','Pclass'], as_index=False).Survived.mean()
group = group.rename(columns={'Survived':'survival_rate_general'})
train_test = train_test.merge(group, how='left', on=['Sex3','Pclass'])

In [None]:
group

<br>
<br>

#### Survival_rate_final2 is the field that will be used for predicting.

In [None]:
train_test['survival_rate_final2'] = train_test.survival_rate_final
train_test.loc[train_test.survival_rate_final.isna(), 'survival_rate_final2'] = train_test[train_test.survival_rate_final.isna()].survival_rate_general

In [None]:
train_test[train_test.train == 0].groupby(['Sex3','Ingroup','survival_rate_final2']).PassengerId.count()

<br>
<br>

#### Changing variable name

In [None]:
train_test.survival_rate_final = train_test.survival_rate_final2
train_test.drop(columns='survival_rate_final2', inplace=True)

<br>
<br>
**Age variable didn't help to improve the results so this part can be skipped.**

In [None]:
# Age_median - assign median age by pclass for male where age is missing
train_test = train_test.merge(
    train_test[train_test.Sex2 == 'male'].groupby(['Sex2','Pclass']).agg(Age_median=('Age', 'median')), 
    how='left', left_on=['Sex2','Pclass'], right_index=True)

train_test.loc[(train_test.Sex2 == 'male')&(~train_test.Age.isnull()), 'Age_median'] = \
    train_test.loc[(train_test.Sex2 == 'male')&(~train_test.Age.isnull()), 'Age']

<br>
<br>

#### To assign age for women i derive some insights from title and whether passenger is travelling in a group or on its own (Ingroup variable). I have noticed that title 'Miss' travelling in a group has a mix of kids and young females so median age is lower. If passenger with this title is travelling on its own then median age is higher as kids would not be allowed to travel alone.

In [None]:
# Age_median2 - assign median age by pclass for female title Mrs where age is missing
train_test = train_test.merge(
    train_test[(train_test.Sex2 == 'female')&(train_test.Title2 == 'Mrs')].groupby(['Sex2','Title2','Pclass']).agg(Age_median2=('Age', 'median')), 
    how='left', left_on=['Sex2','Title2','Pclass'], right_index=True)

train_test.loc[(train_test.Sex2 == 'female')&(train_test.Title2 == 'Mrs')&(~train_test.Age.isnull()), 'Age_median2'] = \
    train_test.loc[(train_test.Sex2 == 'female')&(train_test.Title2 == 'Mrs')&(~train_test.Age.isnull()), 'Age']

In [None]:
# Age_median3 - assign median age by pclass and Ingroup for female title Miss where age is missing
train_test = train_test.merge(
    train_test[(train_test.Sex2 == 'female')&(train_test.Title2 == 'Miss')].groupby(['Sex2','Title2','Pclass','Ingroup']).agg(Age_median3=('Age', 'median')), 
    how='left', left_on=['Sex2','Title2','Pclass','Ingroup'], right_index=True)

train_test.loc[(train_test.Sex2 == 'female')&(train_test.Title2 == 'Miss')&(~train_test.Age.isnull()), 'Age_median3'] = \
    train_test.loc[(train_test.Sex2 == 'female')&(train_test.Title2 == 'Miss')&(~train_test.Age.isnull()), 'Age']

In [None]:
# Age_median is new field without missing values
train_test.loc[train_test.Age_median.isnull(), 'Age_median'] = train_test.loc[train_test.Age_median.isnull(), 'Age_median2']
train_test.loc[train_test.Age_median.isnull(), 'Age_median'] = train_test.loc[train_test.Age_median.isnull(), 'Age_median3']
train_test.loc[train_test.Age_median.isnull(), 'Age_median'] = train_test.loc[train_test.Age_median.isnull(), 'Age'] 
train_test.drop(columns=['Age_median2','Age_median3'], inplace=True)

<br>
<br>

#### There is just one missing value for Fare variable

In [None]:
train_test.total.fillna(1, inplace=True)
train_test['Fare_new'] =  train_test.Fare/train_test.total
train_test.loc[train_test.Fare.isnull(), 'Fare_new'] = \
    train_test[train_test.Pclass == 3].Fare.median()/train_test.loc[train_test.Fare.isnull(), 'total']

<br>
<br>

**Multiple models were tested but xgboost performed the best.**

In [None]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV

In [None]:
le = preprocessing.LabelEncoder()
le.fit(train_test['Sex3'])
train_test['Sex3_t'] = le.transform(train_test['Sex3'])
print("Sex3:", train_test.Sex3_t.unique(), le.inverse_transform(train_test.Sex3_t.unique()))

In [None]:
columns = ['Pclass','Sex3_t','survival_rate_final','Fare_new']
x = train_test[train_test.train == 1][columns]
y = train_test[train_test.train == 1].Survived

**Simple gridsearchcv was used to tune the model and identify best hyper parameters which are used below for training.**

In [None]:
clf = xgb.XGBClassifier(
     gamma=0.1,
     learning_rate=0.2,
     max_depth=3,
     min_child_weight=6,
     n_estimators=100,
     reg_alpha=0,
     reg_lambda=3.5,
     subsample=0.6)
clf.fit(x, y)

In [None]:
pred = clf.predict(train_test[train_test.train == 0][columns]).astype(int)
submission = pd.DataFrame({'PassengerId':train_test[train_test.train == 0].PassengerId, 'Survived':pred})

In [None]:
submission.to_csv('/kaggle/working/titanic_submission_gb.csv', index=False)