This is an attempt to create a model that predicts whether passengers on the Titanic will live or die. This notebook includes exploratory data analysis and the model I ended up using was RandomForestClassifier. 

I learned a lot of the EDA techniques from other notebooks submitted to this challenge. 

In [None]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

In [None]:
os.chdir('../input')
df_train = pd.read_csv('train.csv', header=0, index_col=0, sep=',')

# create copy just for EDA
df_eda = df_train.copy()
df_test = pd.read_csv('test.csv', header=0, index_col=0, sep=',')

columns = df_eda.columns

Find missing data

In [None]:
def null_percentage(column):
    df_name = column.name
    nans = np.count_nonzero(column.isnull().values)
    total = column.size
    frac = nans / total
    perc = int(frac * 100)
    print('%d%% of values or %d missing from %s column.' % (perc, nans, df_name))

for col in columns:
    null_percentage(df_eda[col])

Let's take a look at what the values look like. 

In [None]:
df_eda.head(10)

Age and Cabin have a significant amount of missing data. We'll need to deal with that later. For now, let's convert the Sex column into numerical data and take a look at the overall heatmap. I'm also going to add columns that count known and unknown values in Age and Cabin columns. 

In [None]:
map_sex = {'male': 1, 'female': 0}
df_eda.Sex = df_eda.Sex.replace(map_sex)

In [None]:
df_eda['age_known'] = df_eda.Age.replace(np.nan, 0).astype(int)
df_eda['age_known'][df_eda['age_known'] != 0] = 1
df_eda.age_known.value_counts()

In [None]:
df_eda['cabin_known'] = df_eda.Cabin.replace(np.nan, 0)
df_eda['cabin_known'][df_eda['cabin_known'] != 0] = 1
df_eda.cabin_known.value_counts()

In [None]:
# generate a correlation matrix and build a heatmap
plt.figure('heatmap')
_ = sns.heatmap(df_eda.corr(), vmax=0.6, square=True, annot=True)
plt.show()

Some initial observations: 
1. Sex and Pclass are highly correlated to survival. 
2. Fare is also somewhat correlated to survival, but it's also correlated with Pclass. This data might be useful or it might be overlapping. 
3. SibSp and Parch are correlated. Having a spouse or sibling on board also means you might have a parent or child. 
4. Age and SibSp are correlated. If you're older, you're more likely to bring a spouse or younger kids may be accompanied by a sibling. 
5. Age is surprisingly not very correlated with survival. Women and children were rescued first, but maybe the number of children is too small to impact the data. 

Let's look at a few statistics. Let's look at gender first, because that's where we find the strongest correlation. 

In [None]:
tab = pd.crosstab(df_train['Sex'], df_train['Survived'])
tab.plot(kind='bar', stacked='true', color=['red','blue'], grid=False)
print(tab)
plt.show()

Women had a much better chance at survival than men! Let's look into the age correlation and see if the "women and children first" rule really applied.

In [None]:
plt.figure('age distribution', figsize=(18,12))
plt.subplot(411)
ax = sns.distplot(df_eda.Age.dropna().values, bins=range(0,81,1), kde=False, axlabel='Age')
plt.subplot(412)
sns.distplot(df_eda[df_eda['Survived'] == 1].Age.dropna().values, bins = range(0, 81, 1), 
             color='blue', kde=False)
sns.distplot(df_eda[df_eda['Survived'] == 0].Age.dropna().values, bins = range(0, 81, 1), 
             color='red', kde=False, axlabel='All survivors by age')
plt.subplot(413)
sns.distplot(df_eda[(df_eda['Sex']==0) & (df_eda['Survived'] == 1)].Age.dropna().values, bins = range(0, 81, 1), 
             color='blue', kde=False)
sns.distplot(df_eda[(df_eda['Sex']==0) & (df_eda['Survived'] == 0)].Age.dropna().values, bins = range(0, 81, 1), 
             color='red', kde=False, axlabel='Female survivors by age')
plt.subplot(414)
sns.distplot(df_eda[(df_eda['Sex']==1) & (df_eda['Survived'] == 1)].Age.dropna().values, bins = range(0, 81, 1), 
             color='blue', kde=False)
sns.distplot(df_eda[(df_eda['Sex']==1) & (df_eda['Survived'] == 0)].Age.dropna().values, bins = range(0, 81, 1), 
             color='red', kde=False, axlabel='Male survivors by age')
plt.show()



While age overall has minimal correlation with the survival rate, it seems that children under 9 years old of both sexes actually have a higher chance of surviving. This gets lost in the overall figure because this group makes a small percentage of the total. But this data could be useful for the predictive model. Being older than 65 also has a correlation with survival, but because no females over 65 were present and the sample size is so small I don't think we can rely on this data. 

And like children, women had much better odds of survival than men. This confirms that the women and children were likely given priority boarding the very limited lifeboats.

But it's also important to note that these charts exclude the 19% of passengers whose ages were unknown. That's a large margine of error, and I'm not sure they are evenly distibuted between ages and whether or not the passengers survived. Let's actually look into that:

In [None]:
tab = pd.crosstab(df_eda['age_known'], df_eda['Survived'])
print(tab)
tab.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)
plt.show()

Huh, there's a 39% chance you survived if the history books know your age, and a 32% chance your survived if they didn't. Maybe we can break Age down into two caegorical variables to make the model more accurate: child/adult or age known and unknown. 
(A quick model with the naive bayes algorithm just imputing the mean or median age performs fairly poorly.)

We also saw from the heatmap that there is a significant correlation between Pclass and Survived. Let's see if holding a first class ticket can tip the scale in your favor. 

In [None]:
tab = pd.crosstab(df_eda['Survived'], df_eda['Pclass'])
print(tab)
tab.plot(kind='bar', stacked=True, color=['darkgreen', 'green', 'lightgreen'], grid=False)
plt.show()
#tab.div(tab.sum(1).astype(float), axis=0).plot(kind="bar", color=['darkgreen', 'green', 'lightgreen'], stacked=True)
#plt.show()


Interesting correlation. First class passengers had pretty good odds of surviving. Second class is a crapshoot, and you realy don't want to be in third class. 

This provides some interesting info. We don't really have enough information about the cabin information to say where people were staying, but maybe we don't need that information. The cheap rooms are generally near the interior of the boat, so it would seem that those with baconies or better locations had a better chance of surviving. Or maybe important people just had priority for the lifeboats. Either way, this is good information for our model. 

But let's check in on cabin number anyway, as there are a lot of missing values. 

In [None]:
tab = pd.crosstab(df_eda['cabin_known'], df_eda['Survived'])
print(tab)
tab.plot(kind='bar', stacked=True, color=['red', 'blue'], grid=False)
plt.show()

Okay, that's pretty dramatic. You have about a 2/3 chance of survival if your cabin number was recorded and a 1/3 chance of survival if it wasn't. Maybe the record book went down with the ship and this list was built by memories of the survivors or reciepts or something. 

(By the way, thanks to a number of Kaggle contributors for showing me that missing data is actually data in itself! On my first try I just dropped the cabin column and filled in missing Age values with the median.)

Fare is highly correlated with Pclass. Lets see if that has similar correlation with surviving. 

In [None]:
sns.violinplot(x='Pclass', y='Fare', hue='Survived', data=df_eda, split=True)
plt.hlines([0,512], xmin=-1, xmax=3)
plt.show()

Looks like within each class of ticket there's a higher chance of survival the more you paid for your ticket, with your chances of survival looking pretty good over $300. Note that these lines don't necessarily accurately represent the data, but rather trends. The lines represent min and max ticket prices. Is this valuable data, though? We have to keep outliers in mind, as there were a number of free tickets as well as super expensive tickets. 

And finally, let's look at Embarked because we couldn't visualize that in the heatmap. 

In [None]:
tab = pd.crosstab(df_eda['Embarked'], df_eda['Survived'])
print(tab)
tab.plot(kind='bar', stacked=True, color=['red', 'blue'], grid=False)
plt.show()

Cherbourg has an excellent survival rate. Maybe that's where the more affluent people got on...

In [None]:
tab = pd.crosstab(df_eda['Embarked'], df_eda['Pclass'])
print(tab)

Yep, lots of first class passengers got on here vs other ports. Queensland must have been running a discount. 

I'm skipping over Parch and SibSp because they're not really correlated to surviving. Though I'm sure it's possible to drill down and build some categories based on that to improve the model, like "kids with more than one sibling" or something like that. I plan to revisit it later once my visualization skills have imprived. 

## Editing the features

For this one I'm going to build a function that cleans the data and creates the features we want. That way it can be used to easily transform new data to fit the model we're going to create. I'm not sure if this is a good practice, I'm just a beginner. 

Here's what I'm going for: 

1. Pclass (1,2,3)
2. Cabin Known (0,1)
3. Age Known (0,1)
4. Is a child age 8 or younger (0,1)
5. Sex (0,1)
6. Fare (continuous)
7. Embarked (C, Q, dummies)

Let's just make sure there's no extra missing data in the test data set. 

In [None]:
for col in columns[1:]:
    null_percentage(df_test[col])

Fare has a missing value in df_test and from before two Embarked were missing from df_train. Since both are strongly correlated with class, let's replace with the mean and most common value respectively. 

I couldn't figure out how to copy a slice since np.nan == np.nan returns False, so I'm going to print the lines and look up the PassengerId. 

In [None]:
print(df_test[df_test['Fare'].isnull()])

In [None]:
df_train['Embarked'][df_train['Embarked'] == np.nan] = 'S'
df_test['Fare'][df_test['Name'] == 'Storey, Mr. Thomas'] = df_test['Fare'][df_test['Pclass'] == 3].median()

# select target values
y_targets = df_train.iloc[:,0].values

# combine for transformation
df_train['train'] = 1
df_test['train'] = 0
df = pd.concat([df_train, df_test], ignore_index=False, axis=0)

# select the columns to persist after transforming
train_cols = ['Pclass', 'Sex', 'age_known', 'cabin_known', 'Young', 'Fare', 
             'Embarked_Q', 'Embarked_S', 'train']

map_sex = {'male': 1, 'female': 0}
df.Sex = df.Sex.replace(map_sex)
df['age_known'] = df.Age.replace(np.nan, 0).astype(int)
df['age_known'][df['age_known'] != 0] = 1
    
df['cabin_known'] = df.Cabin.replace(np.nan, 0)
df['cabin_known'][df['cabin_known'] != 0] = 1

young_bool = (df.age_known == 1) & (df.Age < 9)
df['Young'] = young_bool.astype(int)
    
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

df = df[train_cols]

# split back into training and test set after transforming
df_train = df[df['train'] == 1].drop(['train'], axis = 1)
df_test = df[df['train'] == 0].drop(['train'], axis = 1)

Create training and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_train, y_targets, test_size=0.3, random_state=42)

## Building a Model: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

Let's look at the confusion matric and cross_val_score to see how accurate it is:

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

from sklearn.model_selection import cross_val_score
cvs = cross_val_score(classifier, X_test, y_test, cv=10)
print(cvs)

Not bad! Final step: 

In [None]:
y_pred_test = classifier.predict(df_test)
df_test['Survived'] = y_pred_test

Position 2,316 out of 6,986 with a score of 0.78947. Not terrible! 