# Titanic - Machine Learning from Disaster

In this lab you will be faced with the legendary Titanic competition from Kaggle:
* [The legendary Titanic ML competition](https://www.kaggle.com/c/titanic)

The purpose of the competition is simple: create a model that can predict which passengers survived the Titanic shipwreck. 

### Variables

For this we have to our disposal a dataset with the following variables:


| Variable | Definition | Key |
|----------|------------|-----|
|survival|Survival| 0 = No, 1 = Yes|
|pclass|Ticket class| 1 = 1st, 2 = 2nd, 3 = 3rd|
|sex|Sex||
|Age |Age in years||
|sibsp|# of siblings / spouses aboard the Titanic||
|parch|# of parents / children aboard the Titanic||
|ticket|Ticket number ||
|fare|Passenger fare ||
|cabin|Cabin number||
|embarked|Port of Embarkation |C = Cherbourg, Q = Queenstown, S = Southampton|

- pclass: A proxy for socio-economic status (SES)
 - 1st = Upper
 - 2nd = Middle
 - 3rd = Lower
- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
- sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
- parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# Load Data
We have to start downloading the data. It is available under the data tab.

* [The legendary Titanic ML competition](https://www.kaggle.com/c/titanic)

We will need the following files that are in CSV format:
- train.csv
- test.csv

Then we have to upload the files to HDFS, to the `datasets/titanic` directory.

As you can see the data have been already splitted in the training and testing sets.

We can now now load the data and explore it:

In [1]:
passengers = spark.read.csv('datasets/titanic/train.csv', header=True,
                       inferSchema=True)

Let's see how the data looks like:

In [2]:
passengers.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [3]:
passengers.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [4]:
passengers.count()

891

## Feature engineering

To keep feature engineering simple we will select the following features:
- Age: probably children should be given priority
- Pclass: the ticket class is a proxy for socio-economic status, so we would expect that this could have an impact
- Sex: the gender could have an impact, because usually women and children are given priority

To use the Sex variable we will need a StringIndexer Estimator to convert gender to a numeric index so we can use it as a feature:

In [5]:
from pyspark.ml.feature import StringIndexer

In [6]:
sex_indexer = StringIndexer(inputCol='Sex', outputCol='Sex_feature')

Now we have to assemble the selected features into a vector to use it with our model:

In [7]:
from pyspark.ml.feature import VectorAssembler

In [8]:
assembler = VectorAssembler(inputCols=['Sex_feature', 'Pclass', 'Age'], outputCol='features')

From our previous exploration of the first 20 rows we see that there are some null values in the Sex column, so we have to filter them out:

In [9]:
from pyspark.sql.functions import col

In [10]:
passengers.where(col('Age').isNull()).count()

177

In [11]:
passengers.count()

891

In [12]:
passengers_not_null = passengers.where(col('Age').isNotNull())

In [13]:
passengers_not_null.count()

714

## Training

To do the classification we will use a simple LogisticRegression as the Estimator:

In [14]:
from pyspark.ml.classification import LogisticRegression

In [15]:
lr = LogisticRegression(maxIter=10, regParam=0.01)

We can now create the pipeline:

In [16]:
from pyspark.ml import Pipeline

In [17]:
pipeline = Pipeline(stages=[sex_indexer, assembler, lr])

As 'label' column in this case we can use the 'Survived' column, so we rename it, alternatively we could also pass the `labelCol='Survived'` argument to the LogisticRegression Estimator. For optimization, it is also good to include only the columns that we need:

In [18]:
data = passengers_not_null.select('Sex', 'Age', 'Pclass', 'Survived').withColumnRenamed('Survived', 'label')

In [19]:
data.show(5)

+------+----+------+-----+
|   Sex| Age|Pclass|label|
+------+----+------+-----+
|  male|22.0|     3|    0|
|female|38.0|     1|    1|
|female|26.0|     3|    1|
|female|35.0|     1|    1|
|  male|35.0|     3|    0|
+------+----+------+-----+
only showing top 5 rows



In [20]:
training, test = data.randomSplit([0.8, 0.2])

In [21]:
%%time
model = pipeline.fit(training)

CPU times: user 23 ms, sys: 5.3 ms, total: 28.3 ms
Wall time: 2.85 s


## Evaluation

In [22]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()

In [23]:
test.show(5)

+------+---+------+-----+
|   Sex|Age|Pclass|label|
+------+---+------+-----+
|female|2.0|     1|    0|
|female|2.0|     3|    1|
|female|3.0|     2|    1|
|female|7.0|     2|    1|
|female|8.0|     2|    1|
+------+---+------+-----+
only showing top 5 rows



In [24]:
predictions = model.transform(test)

In [25]:
predictions.show(5)

+------+---+------+-----+-----------+-------------+--------------------+--------------------+----------+
|   Sex|Age|Pclass|label|Sex_feature|     features|       rawPrediction|         probability|prediction|
+------+---+------+-----+-----------+-------------+--------------------+--------------------+----------+
|female|2.0|     1|    0|        1.0|[1.0,1.0,2.0]|[-3.4134077135509...|[0.03187905802545...|       1.0|
|female|2.0|     3|    1|        1.0|[1.0,3.0,2.0]|[-0.9876899990232...|[0.27136858747461...|       1.0|
|female|3.0|     2|    1|        1.0|[1.0,2.0,3.0]|[-2.1706054986822...|[0.10242135580875...|       1.0|
|female|7.0|     2|    1|        1.0|[1.0,2.0,7.0]|[-2.0508320682626...|[0.11396833240767...|       1.0|
|female|8.0|     2|    1|        1.0|[1.0,2.0,8.0]|[-2.0208887106577...|[0.11702712756269...|       1.0|
+------+---+------+-----+-----------+-------------+--------------------+--------------------+----------+
only showing top 5 rows



In [26]:
aur = evaluator.evaluate(predictions)

Area under ROC:

In [27]:
aur

0.8142923219241442

## Hyperparameter Tuning

In [28]:
%%time
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
param_grid = ParamGridBuilder() \
            .addGrid(lr.regParam, [0.0, 0.01, 0.1, 1.0]) \
            .addGrid(lr.maxIter, [10, 20, 50]) \
            .build()
            
cv = (CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(param_grid)
      .setNumFolds(3))

cv_model = cv.fit(training)

CPU times: user 1.65 s, sys: 458 ms, total: 2.11 s
Wall time: 34.1 s


In [29]:
new_predictions = cv_model.transform(test)
new_aur = evaluator.evaluate(new_predictions)

Area under ROC:

In [30]:
new_aur

0.815217391304348

## Final note

This was just a quick and simple solution but you can definitely improve this!!

What are your results?