# Random Forest Titanic Disaster Survival Prediction 

This notebook will attempt to solve the Titanic dataset from the kaggle competition https://www.kaggle.com/c/titanic using Random Forest. 


In [1]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import np_utils
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, Imputer

Using TensorFlow backend.


# Step 1. Data Preprocessing

The data comes in two .CSV files as train and test set. The test set is unlabelled data used to make predictions for the competition. So, we have to split up the train set in order to evaluate our model before making the competition predictions.

Four steps for preprocessing are:
    1. Handle missing data
    2. One-hot encode categorical data
    3. Split data into training set and test set
    4. Reshape data to fit model


In [2]:
# Read CSV Data
trainData = pd.read_csv('A:/programming/titanic/train.csv')
testData = pd.read_csv('A:/programming/titanic/test.csv')

# Split dependent and independent variables
X = trainData[['Pclass','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']]
Y = np.array(trainData[['Survived']])
compX = testData[['Pclass','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']]

# Saving feature names
feature_names = list(trainData.columns)
print(feature_names)

# Fill in missing values
mean_value = X['Age'].mean()
X['Age'] = X['Age'].fillna(mean_value)
X = X.fillna('')
compX = compX.fillna('')

X.head(10)

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,female,35.0,1,0,113803,53.1,C123,S
4,3,male,35.0,0,0,373450,8.05,,S
5,3,male,29.699118,0,0,330877,8.4583,,Q
6,1,male,54.0,0,0,17463,51.8625,E46,S
7,3,male,2.0,3,1,349909,21.075,,S
8,3,female,27.0,0,2,347742,11.1333,,S
9,2,female,14.0,1,0,237736,30.0708,,C


In [3]:
# One-hot encode the data using pandas get_dummies
X = pd.get_dummies(X)
compX = pd.get_dummies(compX)
# Display the first 5 rows
X.iloc[:,5:].head(5)

Unnamed: 0,Sex_female,Sex_male,Ticket_110152,Ticket_110413,Ticket_110465,Ticket_110564,Ticket_110813,Ticket_111240,Ticket_111320,Ticket_111361,...,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_,Embarked_C,Embarked_Q,Embarked_S
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [4]:
# Convert to numpy arr
X = np.array(X)
compX = np.array(compX)
# Display the first 5 rows
print(X[:5])

[[ 3. 22.  1. ...  0.  0.  1.]
 [ 1. 38.  1. ...  1.  0.  0.]
 [ 3. 26.  0. ...  0.  0.  1.]
 [ 1. 35.  1. ...  0.  0.  1.]
 [ 3. 35.  0. ...  0.  0.  1.]]


In [5]:
print('Training Features Shape:', X.shape)
print('Training Labels Shape:', Y.shape)

Training Features Shape: (891, 840)
Training Labels Shape: (891, 1)


In [6]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size = 0.25, random_state = 42)

# Step 2. Build and Train Model

Using sklearn we will fit the model to the traning set.

In [7]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 10000, random_state = 42)

# Train the model on training data
rf.fit(Xtrain, Ytrain.ravel());

# Step 3. Quality Evaluation

Next we will make our predictions on the test set to analyse the quality of the model.

Assumption: prediction greater than 50% indicates 1 (survived), less than 50% indicates 0 (dead)

In [8]:
# Use the forest's predict method on the test data
predictions = rf.predict(Xtest)

# Calculate results (correct is 1, wrong is 0)
correct = 0
results = []
for i in range(len(predictions)):
    if predictions[i] > 0.5 and Ytest[i] == 1:
        results.append(1)
        correct += 1
    elif predictions[i] < 0.5 and Ytest[i] == 0:
        results.append(1)
        correct += 1
    else:
        results.append(0)

# Calculate accuracy of correct
acc = 100 * correct / len(results)

# Display results
print('Correct: ', correct)
print('Total: ', len(results))
print('Accuracy: ', round(acc, 2), '%')

Correct:  181
Total:  223
Accuracy:  81.17 %


In [9]:
# Calculate the absolute errors
errors = abs(predictions - Ytest)

# Calculate mean absolute percentage error (MAPE)
mape = []
for i in range(len(errors)):
    if Ytest[0] == 0:
        m = errors[0]
    else:
        m = 100 * (errors[0] / Ytest[0])
    mape.append(m)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Mean Absolute Accuracy:', round(accuracy, 2), '%.')

Mean Absolute Error: 0.47
Mean Absolute Accuracy: 35.91 %.


# Step 4. Competition Predictions

In [10]:
predictions = rf.predict(compX)

ValueError: Number of features of the model must match the input. Model n_features is 840 and input n_features is 698 

# References

https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

https://stackoverflow.com/questions/44601533/how-to-use-onehotencoder-for-multiple-columns-and-automatically-drop-first-dummy