# Titanic survival prediction from Name and Sex
This notebook is inspired by [Simple Titanic model using only Name](https://www.kaggle.com/cdeotte/titanic-using-name-only-0-81818). I take a similar approach and calculate the survival rate of the woman and children in a family with two primary differences.

First, I adjust the `WCSurvivedPct` on a per individual basis so that it represents the survival rate of the rest of the family, not including the current individual.

Second, I use a `DecisionTreeClassifier` to create slightly more complex classification rules instead of the two rules from Chris's notebook *1)all males die except boys in families where all woman and children survive, 2) all females live except for females in families where all woman and children die*.

These changes result in an improvement from 0.82296 to 0.83253.

In [None]:
import graphviz 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import cross_val_score, GridSearchCV

In [None]:
train = pd.read_csv('../input/train.csv').set_index('PassengerId')
test = pd.read_csv('../input/test.csv').set_index('PassengerId')
df = pd.concat([train, test], axis=0, sort=False)
df['Title'] = df.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
df['IsWomanOrChild'] = ((df.Title == 'Master') | (df.Sex == 'female'))
df['LastName'] = df.Name.str.split(',').str[0]

family = df.groupby(df.LastName).Survived
df['FamilyTotalCount'] = family.transform(lambda s: s[df.IsWomanOrChild].fillna(0).count())
df['FamilyTotalCount'] = df.mask(df.IsWomanOrChild, df.FamilyTotalCount - 1, axis=0)
df['FamilySurvivedCount'] = family.transform(lambda s: s[df.IsWomanOrChild].fillna(0).sum())
df['FamilySurvivedCount'] = df.mask(df.IsWomanOrChild, df.FamilySurvivedCount - df.Survived.fillna(0), axis=0)
df['FamilySurvivalRate'] = (df.FamilySurvivedCount / df.FamilyTotalCount.replace(0, np.nan))
df['IsSingleTraveler'] = df.FamilyTotalCount == 0

In [None]:
x = pd.concat([
    df.FamilySurvivalRate.fillna(0),
    df.IsSingleTraveler,
    df.Sex.replace({'male': 0, 'female': 1}),
], axis=1)
train_x, test_x = x.loc[train.index], x.loc[test.index]
train_y = df.Survived.loc[train.index]

In [None]:
clf = tree.DecisionTreeClassifier()
grid = GridSearchCV(clf, cv=5, param_grid={
    'criterion': ['gini', 'entropy'], 
    'max_depth': [2, 3, 4, 5]})
grid.fit(train_x, train_y)
grid.best_params_

In [None]:
model = grid.best_estimator_

In [None]:
graphviz.Source(tree.export_graphviz(model, feature_names=x.columns)) 

In [None]:
test_y = model.predict(test_x).astype(int)
pd.DataFrame({'Survived': test_y}, index=test.index) \
.reset_index() \
.to_csv(f'survived.csv', index=False)