## Titanic Analysis - Using Random Forest Classifier

In [492]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

In [493]:
# Read the train data set and split it into train and validation sets.
X = pd.read_csv("train.csv")
X_train, X_val = train_test_split(X, test_size = 0.2)

y = X.pop("Survived")
y_train = X_train.pop('Survived')
y_val = X_val.pop('Survived')

In [494]:
# First, I'd like to attempt for a quick and dirty model using only the numerical columns. 
# Let's look at the quick summary of the numerical columns

X.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,2.0,20.125,0.0,0.0,7.9104
50%,446.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,3.0,38.0,1.0,0.0,31.0
max,891.0,3.0,80.0,8.0,6.0,512.3292


In [495]:
# Exploring the numerical columns, I observed that Age has some missing values. 
# And to run the Random Forest Classifier, these null values need to be fixed.
# I'm just imputing the null values with the mean values.

X["Age"].fillna(X_train['Age'].mean(), inplace=True)
X_train["Age"].fillna(X_train['Age'].mean(), inplace=True)
X_val["Age"].fillna(X_val['Age'].mean(), inplace=True)

In [496]:
# Now, let's build this quick and dirty model using the RandomForestClassifier.
# I'm going with 1000 estimators and setting the parallelization to match the no. of cores on my machine.

clf = RandomForestClassifier(n_estimators=1000, n_jobs = -1)

# I'm dropping the Passenger Id column, as it seems to be irrelevant. 
# But it might be an interesting observation to make in the future models. For now, I'm skipping it.

features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

clf.fit(X_train[features], y_train)

y_pred_train = clf.predict(X_train[features])
y_pred_val = clf.predict(X_val[features])

# Let's measure the performance of this model on training and validation data sets, 
# which can be later used for model comparisons without making submissions on Kaggle.

print "Confusion Matrix - Training Set \n", confusion_matrix(y_train, y_pred_train)
print "Confusion Matrix - Validation Set \n", confusion_matrix(y_val, y_pred_val)

print "Accuracy Score - Training Set", accuracy_score(y_train, y_pred_train)
print "Accuracy Score - Validation Set", accuracy_score(y_val, y_pred_val)

Confusion Matrix - Training Set 
[[428  12]
 [ 17 255]]
Confusion Matrix - Validation Set 
[[90 19]
 [33 37]]
Accuracy Score - Training Set 0.959269662921
Accuracy Score - Validation Set 0.709497206704


In [497]:
# Next, we'll do the predictions for the test data set,
# but before that, I'm guessing rebuilding this model with the complete training data set would be more performant.

clf.fit(X[features], y)
y_pred = clf.predict(X[features])

print confusion_matrix(y, y_pred)
print accuracy_score(y, y_pred)

[[535  14]
 [ 24 318]]
0.957351290685


In [498]:
# Let's do the predictions for the test data set now
X_test = pd.read_csv("test.csv")

dummy_sex = pd.get_dummies(X_test['Sex'], prefix='Sex')
dummy_embarked = pd.get_dummies(X_test['Embarked'], prefix='Embarked')

X_test = pd.concat([X_test, dummy_sex], axis = 1)
X_test = pd.concat([X_test, dummy_embarked], axis = 1)
X_test["Age"].fillna(X_test['Age'].mean(), inplace=True)
X_test["Fare"].fillna(0, inplace=True)

y_pred_test = clf.predict(X_test[features])

In [499]:
# submission = pd.DataFrame({'PassengerId': X_test['PassengerId'], 'Survived': y_pred_test})
# submission.to_csv("submission_1.csv", index=False)

#### I have submitted the above predictions and got a score of 0.60766 on Kaggle.

In [501]:
# Now, let's try to improvise this model by using the features - Sex and Embarked. 
# To do this, we need to encode these categorical columns into dummy variables

X = pd.concat([X, pd.get_dummies(X['Sex'], prefix='Sex')], axis = 1)
X_train = pd.concat([X_train, pd.get_dummies(X_train['Sex'], prefix='Sex')], axis = 1)
X_val = pd.concat([X_val, pd.get_dummies(X_val['Sex'], prefix='Sex')], axis = 1)

X = pd.concat([X, pd.get_dummies(X['Embarked'], prefix='Embarked')], axis = 1)
X_train = pd.concat([X_train, pd.get_dummies(X_train['Embarked'], prefix='Embarked')], axis = 1)
X_val = pd.concat([X_val, pd.get_dummies(X_val['Embarked'], prefix='Embarked')], axis = 1)

In [502]:
# Now, let's use the new feature set to build and run the model
features = ['Pclass', 'Age', 'Sex_male', 'Sex_female', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S']

clf.fit(X_train[features], y_train)

y_pred_train = clf.predict(X_train[features])
y_pred_val = clf.predict(X_val[features])

# Measure the performance of this model on training and validation data set.
print "Confusion Matrix - Training Set \n", confusion_matrix(y_train, y_pred_train)
print "Confusion Matrix - Validation Set \n", confusion_matrix(y_val, y_pred_val)

print "Accuracy Score - Training Set", accuracy_score(y_train, y_pred_train)
print "Accuracy Score - Validation Set", accuracy_score(y_val, y_pred_val)

Confusion Matrix - Training Set 
[[437   3]
 [  9 263]]
Confusion Matrix - Validation Set 
[[97 12]
 [21 49]]
Accuracy Score - Training Set 0.983146067416
Accuracy Score - Validation Set 0.815642458101


Looking at these performance measurements, this model certainly looks better than our first model. Let's go ahead and do the predictions.

In [503]:
# Again, let's rebuild this model with the complete training set before running the predictions

clf.fit(X[features], y)
y_pred = clf.predict(X[features])

print confusion_matrix(y, y_pred)
print accuracy_score(y, y_pred)

[[544   5]
 [ 11 331]]
0.982042648709


In [504]:
# Let's do the predictions now
X_test = pd.read_csv("test.csv")

dummy_sex = pd.get_dummies(X_test['Sex'], prefix='Sex')
dummy_embarked = pd.get_dummies(X_test['Embarked'], prefix='Embarked')

X_test = pd.concat([X_test, dummy_sex], axis = 1)
X_test = pd.concat([X_test, dummy_embarked], axis = 1)
X_test["Age"].fillna(X_test['Age'].mean(), inplace=True)
X_test["Fare"].fillna(0, inplace=True)

y_pred_test = clf.predict(X_test[features])

In [505]:
# submission = pd.DataFrame({'PassengerId': X_test['PassengerId'], 'Survived': y_pred_test})
# submission.to_csv("submission_2.csv", index=False)

#### I have submitted the above predictions and got a score of 0.75598