In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../data/train.csv')

In [None]:
df.info()

### Data overview thoughts
Of the 891 passengers...
* All have non-null info on Survived, Pclass, Sex, SibSp, Parch, Ticket, Fare.
* Embarked is only missing 2, so might be worth just dropping those two, if Embarked is useful. Otherwise, dropping the Embarked feature.
* Cabin is mostly nulls, so probably not worth using if only available 22% of the time.
* Age is probably useful, but 20% are missing that field.


In [None]:
df.head()

#### Contents thoughts
* **PClass**. (Ticket Class). Preferential treatment for higher class? Nice simple field to work with.  
* **Name**. Unique. Probably is not useful unless there's some bias in race or class that can be derived from the names. That would involve some deep analysis and probably an external dataset to find name embeddings. Although just the length of the name may indicate something.
* **SibSp** (Siblings + Spouses onboard). Safety in numbers? Would be interesting to correlate this with Name and see if there's any likely holes here.
* **Parch** (Parents + Children onboard). Very similar to SibSp, although from the movies "women and children first" would indicate this carriers more weight for adult females than adults males. Not that useful for children, since we already know they are from their Age. It's also difficult to tell if the person is the parent or the child, as even an adult could be the child. More broadly useful for safety in numbers?
* **Ticket**. Unique? Probably not usual, if unique, unless it indicates more information on where, when, or how the ticket was purchased that would compliment Class. Or if it gives information that relates to the Cabin.
* **Fare**. Relates to the Class. Variability might indicate more granular levels in Class or some preferencial treatment at the time of purchase.
* **Embarked** (Port of Embarkation). Again, this may add colour to the background of person that may have played a part. For instance, would someone with a certain accent be treated differently? Would someone from a different region/climate be more apt to surviving?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.hist(df.Age)

In [None]:
died = (df.Survived == 0.).sum()
survived = (df.Survived == 1.).sum()
percent_died = 100. * died / (died + survived)
print(f'{died} died ({percent_died:0.1f}%). {survived} survived ({100. - percent_died:0.1f}%)')

In [None]:
df.Pclass.value_counts()

### Age Groups
Let's just use Fibonacci for now. Looks about right. Life expectancy of white men and women around 1912 was around 50-55 years old.

In [None]:
age_bins = [0., 1., 2., 3., 5., 8., 13., 21., 34., 55., 89.]
age_group_labels = [f'Under {b}' for b in age_bins[1:]]
df['AgeGroup'] = pd.cut(df.Age, bins=age_bins, labels=age_group_labels)

In [None]:
df['AgeGroup'].value_counts().plot.bar()

In [None]:
age_group_survived = df['AgeGroup'].where(df['Survived'] == 1.).value_counts()
age_group_died = df['AgeGroup'].where(df['Survived'] == 0.).value_counts()
age_group_survived_df = pd.DataFrame({
    'Survived': age_group_survived,
    'Died': age_group_died,
}, index=df['AgeGroup'].cat.categories)

In [None]:
age_group_survived_df

In [None]:
age_group_survived_df.plot.bar(stacked=True)

In [None]:
age_group_survived_df['DeathRate'] = 100. * age_group_survived_df['Died']\
    / (age_group_survived_df['Died'] + age_group_survived_df['Survived'])
age_group_survived_df['DeathRate']

#### Death rate by age thoughts
* Under 8 years old is generally better for survival. In practically it probably on the survival of the parent and the parent probably gets better treatment if they they have children.
* Older than 55 has highest rate of mortality. This is inline with life-expectancy at the time.

_Note: I played with the age groups a little and didn't find any particular benefit to using non-fibonacci buckets_

### Pclass (Ticket Class) Analysis

In [None]:
import seaborn as sns

# some plots derived from https://seaborn.pydata.org/generated/seaborn.countplot.html

In [None]:
ax = sns.countplot(x='Pclass', data=df)

In [None]:
sns.countplot(x="Pclass", hue="Survived", data=df)

#### Huge bias for dying in 3rd Class


In [None]:
predictability = 0.
for c in [1, 2, 3]:
    in_class = df['Survived'].where(df['Pclass'] == c)
    died = in_class.where(df['Survived'] == 0.).count()
    survived = in_class.where(df['Survived'] == 1.).count()
    class_death_rate = died / (died + survived)
    prob_of_class = in_class.count() / len(df)
    if c == 1:
        # Let's always guess that 1st Class survives.
        # We'll be right most of the time.
        class_predictability = 1. - class_death_rate
    elif c == 2:
        # Let's always guess that 2nd Class dies.
        # We'll be right slightly more than we're wrong.
        class_predictability = class_death_rate
    elif c == 3:
        # Let's always guess that 3rd Class dies.
        # We'll be right most of the time.
        class_predictability = class_death_rate
        
    predictability += class_predictability * prob_of_class
    print((c, died, survived, class_death_rate, prob_of_class, class_predictability))

print(f'Predictability of death based on class data alone is {100. * predictability:0.1f}%')

Even the most basic model that always predicted you'd die in 3rd Class would be 75.76% accurate for just that class. Overall we could use this data to get 67.9% accuracy.

### Initial model - DecisionTreeClassifier
It would be good to start cleaning the data and get basic model going.
Let's train on the following fields for now...
* PClass. One-hot encode.
* Sex. Convert to Male (0. or 1.)
* Age. Convert NaNs to average age. Bucket as Fibonacci (see above) and one-hot encode.
* SibSp. Normalize to range 0. to 1. Possible test data has higher number, which could skew things.
* Parch. Normalize to range 0. to 1. Possible test data has higher number, which could skew things.
* Fare. Normalize to range 0. to 1. Better to bucket this?

Obviously output is Survival, which will be a binary classifer.

In [None]:
import numpy as np


def prep_input_data(titanic_df):
    unchanged_x_cols = ['Pclass', 'SibSp', 'Parch']

    df_clean = titanic_df[unchanged_x_cols].copy()
    df_clean['Male'] = 0.
    df_clean.loc[titanic_df['Sex'] == 'male', 'Male'] = 1.
    
    x_cols = unchanged_x_cols + ['Male']

    return df_clean, x_cols


def prep_training_data(titanic_df):
    df_clean, x_cols = prep_input_data(titanic_df)

    y_col = ['Survived']
    df_clean[y_col] = titanic_df[y_col]
    
    # shuffle
    df_clean = df_clean.sample(frac=1).reset_index(drop=True)
    
    train_len = int(0.8 * len(df_clean))
    
    df_train = df_clean.head(train_len)
    df_val = df_clean.tail(len(df_clean) - train_len)
    
    return df_train, df_val, x_cols, y_col

In [None]:
from sklearn import tree
from sklearn import metrics

df_train, df_val, x_cols, y_col = prep_training_data(df)

df_train.info()

clf = tree.DecisionTreeClassifier(max_depth=3)
clf.fit(df_train[x_cols], df_train[y_col])

In [None]:
train_predictions = clf.predict(df_train[x_cols])

train_accuracy = metrics.accuracy_score(df_train[y_col], train_predictions)
train_accuracy

In [None]:
val_predictions = clf.predict(df_val[x_cols])

val_accuracy = metrics.accuracy_score(df_val[y_col], val_predictions)
val_accuracy

In [None]:
df_test = pd.read_csv('../data/test.csv')
df_test.info()
df_test.head()

In [None]:
df_test_input, _ = prep_input_data(df_test)

df_test_input.info()
df_test_input.head()

In [None]:
test_predictions = clf.predict(df_test_input)

df_submission = pd.DataFrame({
    'PassengerId': df_test['PassengerId'],
    'Survived': test_predictions,
})

df_submission.to_csv('../output/submission.csv', index=False)

#### Submission 1

This submission got **0.77272**, but isn't too bad considering how basic it is.