# Excercise: Random Tree Clasiffier on Titanic Dataset

### Pedro Bueso-Inchausti García

## 1. Objective

Implement a Random Tree Classifier and/or Gradient Boosted Tree ensembles to solve the Kaggle's Titatic competition. 

*The sinking of the Titanic is one of the most infamous shipwrecks in history. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (name, age, gender, socio-economic class).*

## 2. Pre-requisites

We import all the library and functions needed.

In [1]:
import numpy
import pandas
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import IntegerType
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, Imputer, VectorAssembler, VectorIndexer
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

We initialise the Spark Context and the SQL Context.

In [2]:
sc = SparkContext(master = "local[4]")
sqlc = SQLContext(sc)

We load both the training and testing dataset.

In [3]:
train_data = sqlc.read.format('csv').option('header', 'true').option('inferSchema', 'true').load('train.csv')
test_data = sqlc.read.format('csv').option('header', 'true').option('inferSchema', 'true').load('test.csv')

In [4]:
train_data.toPandas().head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
test_data.toPandas().head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## 3. Dataset preprocessing

The features we are going to study are the following:

    Survived: a boolean feature identifying whether a passenger survived (1) or not (0).
    Sex: a boolean feature identifying the sex of the passenger: male (0) or female (1).
    Age: a discrete feature identifying the age of the passenger.
    SibSp: a discrete feature identifying the number of siblings/ spouses aboard.
    Parch: a discrete feature identifying the number of parents/ childrens aboard.
    Pclass: a discrete feature identifying the class to which a passenger belongs: upper (1), middle (2) and lower (3).
    Embarked: a categorical feature identifying the port of embarkation: Cherbourg (1), Queenstown (2) and Southampton (3).
    Fare: a continuous feature identifying the cost of the ticket.

We erase the "Name", "Ticket" and "Cabin" features because they give no relevant information to our model.

In [6]:
columns_to_drop = ['Name','Ticket','Cabin']
train_data = train_data.drop(*columns_to_drop)
test_data = test_data.drop(*columns_to_drop)

We convert "Sex" into index feature and "Embarked" into one hot vector feature. We treat missing values in "Age" and "Fare", assigning them the median.

In [7]:
ind1 = StringIndexer(inputCol='Sex', outputCol='newSex', handleInvalid='skip')
ind2 = StringIndexer(inputCol='Embarked', outputCol='newEmbarked', handleInvalid='skip')
ohe = OneHotEncoderEstimator(inputCols=['newEmbarked'], outputCols=['onehotEmbarked'], handleInvalid='keep')
imp = Imputer(strategy='median', inputCols=['Age','Fare'], outputCols=['newAge','newFare'])
pl1 = Pipeline(stages=[ind1, ind2, ohe, imp])

train_data = pl1.fit(train_data).transform(train_data)
test_data = pl1.fit(test_data).transform(test_data)

We remove the altered features and replace them with the new ones

In [8]:
columns_to_drop = ['Sex','Embarked','Age','Fare','newEmbarked']
train_data = train_data.drop(*columns_to_drop)
test_data = test_data.drop(*columns_to_drop)

## 4. Modelling

We prepare data for modelization, this requires creating an indexedLabel and an indexedFeatures columns

In [9]:
assem = VectorAssembler(inputCols=['Pclass','SibSp','Parch','newSex','onehotEmbarked','newAge','newFare'], outputCol='features')
ind3 = StringIndexer(inputCol='Survived', outputCol='indexedLabel', handleInvalid='skip')
ind4 = VectorIndexer(inputCol='features', outputCol='indexedFeatures', maxCategories=10)
pl2 = Pipeline(stages=[assem, ind3, ind4])
pl3 = Pipeline(stages=[assem, ind4])

train_data = pl2.fit(train_data).transform(train_data)
test_data = pl3.fit(test_data).transform(test_data)

### Random forest classifier

In [10]:
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
pg_rf = ParamGridBuilder().addGrid(rf.numTrees, [5,10,15,20]).build()
eva_rf = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
cv_rf = CrossValidator(estimator=rf, estimatorParamMaps=pg_rf, evaluator=eva_rf, numFolds=5)

model_rf = cv_rf.fit(train_data)
evaluate_rf = model_rf.transform(train_data)
predict_rf = model_rf.transform(test_data)

print("Accuracy in the train data")
print(eva_rf.evaluate(evaluate_rf))

Accuracy in the train data
0.8458942632170978


### Gradient boosted tree classifier

In [11]:
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
pg_gbt = ParamGridBuilder().addGrid(gbt.maxIter, [5,10,15,20]).build()
eva_gbt = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
cv_gbt = CrossValidator(estimator=gbt, estimatorParamMaps=pg_gbt, evaluator=eva_gbt, numFolds=5)

model_gbt = cv_gbt.fit(train_data)
evaluate_gbt = model_gbt.transform(train_data)
predict_gbt = model_gbt.transform(test_data)

print("Accuracy in the train data")
print(eva_gbt.evaluate(evaluate_gbt))

Accuracy in the train data
0.8582677165354331


## 5. Results

We can see how both of the methodologies give an accuracy, in the train data, close to 85%. The gradient boosted tree, however, is a bit more accurate (although it requires more computational power). Down below, I include the predictions for each algorithm and whether they agree or not in the predictions.

In [12]:
results_rf= predict_rf.toPandas()[['PassengerId', 'prediction']]
results_gbt = predict_gbt.toPandas()[['PassengerId', 'prediction']]
results = pandas.merge(results_rf, results_gbt, on='PassengerId')
results.columns = ['PassengerId','Survived_RandomForest', 'Survived_GradientBoostedTree']
results = results.applymap(numpy.int64)
results.head()

Unnamed: 0,PassengerId,Survived_RandomForest,Survived_GradientBoostedTree
0,892,0,0
1,893,0,0
2,894,0,0
3,895,0,0
4,896,1,0


In [13]:
def agreement(row):
    if row.Survived_RandomForest == row.Survived_GradientBoostedTree: return "Agree"
    else: return "Disagree" 
    
results['Agreement'] = results.apply(lambda row: agreement(row), axis=1)
results.head()

Unnamed: 0,PassengerId,Survived_RandomForest,Survived_GradientBoostedTree,Agreement
0,892,0,0,Agree
1,893,0,0,Agree
2,894,0,0,Agree
3,895,0,0,Agree
4,896,1,0,Disagree


In [14]:
results['Agreement'].value_counts()

Agree       371
Disagree     47
Name: Agreement, dtype: int64