GitHub: https://github.com/janderss0n/titanic_ml_example

Data from: https://www.kaggle.com/c/titanic/overview

# Predict survival on Titanic
Supervised learning, we have labeled dataset we can train a model on.
Our target, survived or not (0,1), is categorical so we can use a classification model.

In [None]:
import pandas as pd
import numpy as np

Load the training data

In [None]:
original_data = pd.read_csv('titanic_data/train.csv')
original_data.head()

Survived is our target.
Let's use Pclass, Sex, Age, Cabin, Embarked.

In [None]:
target = 'Survived'
interesting_columns = ['Pclass', 'Sex', 'Age', 'Cabin', 'Embarked']
data = original_data.loc[:, interesting_columns + [target]]

# Exploratory data analysis (EDA)

In [None]:
import seaborn as sns
sns.set(context='notebook', style='whitegrid', 
        palette='pastel', font='sans-serif', 
        font_scale=2, color_codes=True, rc=None)

In [None]:
sns.countplot('Pclass', data=data)

In [None]:
sns.countplot('Sex', data=data)

In [None]:
sns.distplot(data.loc[data['Age'].notnull(),'Age'])

In [None]:
sns.countplot('Embarked', data=data)

We can combine two columns

In [None]:
sns.countplot('Sex', hue='Survived', data=data)

Important to look at the balance between each class in the target.
If it is imbalanced between classes you will have to fix that.

If 0 would be 95% and 1 5%, then a trained model always guessing 0 would have 95% acc.

In [None]:
sns.countplot(target, data=data)

## Data Preprocessing

Choose equal number of examples from each target class. Then shuffle dataset.

In [None]:
min_target_class_samples = data[target].value_counts().min()
non_survival_ex = data.loc[data[target]==0,:]
proc_data = non_survival_ex.sample(n=min_target_class_samples, 
                                           replace=False, random_state=42)
proc_data = proc_data.append(data.loc[data[target]==1,:])
proc_data = proc_data.sample(frac=1).reset_index(drop=True)
print(proc_data[target].value_counts())

Some alg. can't handle string values/categorical columns. These need to be processed.

In [None]:
# Replace value with 1 and nan with 0 in Cabin column.
proc_data.loc[proc_data['Cabin'].notnull(), 'Cabin'] = 1
proc_data.loc[proc_data['Cabin'].isnull(), 'Cabin'] = 0

In [None]:
# Split Sex and Embarked categories to separate columns
new_embarked = pd.get_dummies(proc_data['Embarked'], prefix='Embarked')
new_sex = pd.get_dummies(proc_data['Sex'])
proc_data = pd.concat([proc_data, new_embarked, new_sex], axis=1)
proc_data.head()

Check if we have any NaN values

In [None]:
proc_data.isnull().sum()

Remove rows with NaN from training data or replace with a clever value.

In [None]:
proc_data.describe()

In [None]:
proc_data.Age.median()

In [None]:
# Replace nan with median age
median_age = proc_data.Age.median()
proc_data.loc[proc_data.Age.isnull(), 'Age'] = median_age

How does it look now?

In [None]:
proc_data.isnull().sum()

In [None]:
proc_data.describe()

## Train a Random Forest Classifier model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

Split the data into train and validation. OBS! You should always have a 3:rd dataset saved for a final, final testing to reduce bias in your score, called test or holdout set.

In [None]:
X = proc_data.drop(columns=[target]).select_dtypes(include='number')
y = proc_data[target]
print(X.head())
print(y.head())

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

Make prediction on val set

In [None]:
y_pred = model.predict(X_val)
print(X_val[0:5])
y_pred[0:5]

# Plot ROC curve for evaluation

In [None]:
# Predict probabilities
probs = model.predict_proba(X_val)
probs = probs[:, 1] # keep probabilities for the positive outcome only

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_val, probs)
plt.plot([0, 1], [0, 1], linestyle='--') # plot no skill
plt.plot(fpr, tpr, marker='.') # plot the roc curve for the model
plt.title('ROC curve')
plt.show()

# Get all of the trees in the model

In [None]:
import os
import six
import pydot
from sklearn import tree
from sklearn.tree import export_graphviz

estimator = model.estimators_[5]
dotfile = six.StringIO()
i_tree = 0
for tree_in_forest in model.estimators_:
    export_graphviz(tree_in_forest,out_file='tree.dot',
    feature_names=X_train.columns,
    filled=True,
    rounded=True)
    (graph,) = pydot.graph_from_dot_file('tree.dot')
    name = 'tree' + str(i_tree)
    graph.write_png(name+  '.png')
    os.system('dot -Tpng tree.dot -o tree.png')
    i_tree +=1


<img src='tree.png'>

# Save the trained model to file
For later use in API

In [None]:
import pickle
pickle.dump(model, open('model_titanic_survival.pkl', 'wb'))