# Assignment: Predict who survived the Titanic disaster

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

![Sinking Titanic](./assets/titanic.jpeg)

In this assignment you're going to build an app that can predict which Titanic passengers survived the disaster. You will use a decision tree classifier to make your predictions.

The first thing you will need for your app is the passenger manifest of the Titanic's last voyage. You will use the famous [Kaggle Titanic Dataset](https://github.com/sbaidachni/MLNETTitanic/tree/master/MLNetTitanic) which has data for a subset of 891 passengers.

The training and testing data files have already been downloaded and are available to your code as **test_data.csv** and **train_data.csv**.

The training data file looks like this:

![Training data](./assets/data.jpg)

It’s a CSV file with 12 columns of information:

* The passenger identifier
* The label column containing ‘1’ if the passenger survived and ‘0’ if the passenger perished
* The class of travel (1–3)
* The name of the passenger
* The gender of the passenger (‘male’ or ‘female’)
* The age of the passenger, or ‘0’ if the age is unknown
* The number of siblings and/or spouses aboard
* The number of parents and/or children aboard
* The ticket number
* The fare paid
* The cabin number
* The port in which the passenger embarked

The second column is the label: 0 means the passenger perished, and 1 means the passenger survived. All other columns are input features from the passenger manifest.

You're gooing to build a binary classification model that reads in all columns and then predicts for each passenger if he or she survived.

## Get started

Let’s get started. You will need to install the correct NuGet packages:

In [1]:
#r nuget:Microsoft.ML
#r nuget:Microsoft.ML.FastTree

This will install the Microsoft ML.NET library and an additional library for fast decision tree learners. 

Now you are ready to add code. Run the following code block:

In [34]:
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;
using XPlot.Plotly;

You’ll also need a class to hold passenger data, and another class to hold your model predictions.

In [9]:
public class Passenger
{
    public bool Label;
    public float Pclass;
    public string Name;
    public string Sex;
    public string RawAge;
    public float SibSp;
    public float Parch;
    public string Ticket;
    public float Fare;
    public string Cabin;
    public string Embarked;
}

public class PassengerPrediction
{
    [ColumnName("PredictedLabel")] public bool Prediction;
    public float Probability;
    public float Score;
}

The **Passenger** class holds one single passenger record. There's also a **PassengerPrediction** class which will hold a single passenger prediction. There's a boolean **Prediction**, a **Probability** value, and the **Score** the model will assign to the prediction.

## Loading the data

Now you're going to load the training data in memory:

In [5]:
// filenames for training and test data
private static string trainingDataPath = Path.Combine(Environment.CurrentDirectory, "train_data.csv");
private static string testDataPath = Path.Combine(Environment.CurrentDirectory, "test_data.csv");

// set up a machine learning context
var mlContext = new MLContext();

// set up a text loader
var textLoader = mlContext.Data.CreateTextLoader(
    new TextLoader.Options() 
    {
        Separators = new[] { ',' },
        HasHeader = true,
        AllowQuoting = true,
        Columns = new[] 
        {
            new TextLoader.Column("Label", DataKind.Boolean, 1),
            new TextLoader.Column("Pclass", DataKind.Single, 2),
            new TextLoader.Column("Name", DataKind.String, 3),
            new TextLoader.Column("Sex", DataKind.String, 4),
            new TextLoader.Column("RawAge", DataKind.String, 5),  // <-- not a float!
            new TextLoader.Column("SibSp", DataKind.Single, 6),
            new TextLoader.Column("Parch", DataKind.Single, 7),
            new TextLoader.Column("Ticket", DataKind.String, 8),
            new TextLoader.Column("Fare", DataKind.Single, 9),
            new TextLoader.Column("Cabin", DataKind.String, 10),
            new TextLoader.Column("Embarked", DataKind.String, 11)
        }
    }
);

// load training and test data
Console.Write("Loading data...");
var trainingDataView = textLoader.Load(trainingDataPath);
var testDataView = textLoader.Load(testDataPath);
Console.WriteLine("done");

Loading data...done


This code uses the **CreateTextLoader** method to create a CSV data loader. The **TextLoader.Options** class describes how to load each field. Then we call the text loader’s **Load** method twice to load the train- and test data in memory.

Let's see if that worked.  We're going to deserialize the training data into an enumeration of **Passenger** instances and do a quick visual check of the data:

In [8]:
// get an array of heartdata instances
var data = mlContext.Data.CreateEnumerable<Passenger>(trainingDataView, reuseRowObject: false).ToArray();

// display the result
display(data.Take(10));

index,Label,Pclass,Name,Sex,RawAge,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,True,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2,,C
1,True,1,"Silverthorne, Mr. Spencer Victor",male,35.0,0,0,PC 17475,26.2875,E24,S
2,False,3,"Asim, Mr. Adola",male,35.0,0,0,SOTON/O.Q. 3101310,7.05,,S
3,False,3,"Ali, Mr. William",male,25.0,0,0,SOTON/O.Q. 3101312,7.05,,S
4,False,3,"Allum, Mr. Owen George",male,18.0,0,0,2223,8.3,,S
5,False,3,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",female,40.0,1,0,7546,9.475,,S
6,False,1,"Smart, Mr. John Montgomery",male,56.0,0,0,113792,26.55,,S
7,True,3,"Dean, Master. Bertram Vere",male,1.0,1,2,C.A. 2315,20.575,,S
8,True,1,"Hoyt, Mrs. Frederick Maxfield (Jane Anne Forby)",female,35.0,1,0,19943,90.0,C93,S
9,False,3,"Jussila, Miss. Mari Aina",female,21.0,1,0,4137,9.825,,S


Notice the **Label** field that indicates if the passenger survived the Titanic disaster or perished. We also have the cabin class, passenger name, age, and gender, the number of accompanying children and parents, the ticket number, paid fare, cabin number, and embarkation point.

But now look at the **Age** column. Did you notice that our first passenger, mrs Gertrude Thorne, has an empty age? This means we don't know her exact age at the time of the disaster, it wasn't recorded anywhere at the time.

How can we train a machine learning model on the passenger age if some values are missing?

## Cleaning the data

ML.NET can actuall deal with missing data in CSV files, but it needs any missing data to appear as a ‘?’. Unfortunately in the Titanic dataset the missing ages appear as empty strings. So the first thing you need to do is replace all empty age strings occurrences with ‘?’.

We'll need two extra classes to help us with that:

In [15]:
/// <summary>
/// The RawAge class is a helper class for a column transformation.
/// </summary>
public class FromAge
{
    public string RawAge;
}

/// <summary>
/// The ProcessedAge class is a helper class for a column transformation.
/// </summary>
public class ToAge
{
    public string Age;
}

The **FromAge** class will hold the original age as a string and the **ToAge** class will hold the processed age where all empty strings have been replaced with the question mark character.

Now let's set up a machine learning pipeloe that implements this specific age value replacement.

We're also going to add some extra code to remove the Name, Cabin, and Ticket columns because we don't need them to make predictions.

Add the following code:

In [17]:
// set up a training pipeline
// step 1: drop the name, cabin, and ticket columns
var pipeline = mlContext.Transforms.DropColumns("Name", "Cabin", "Ticket")

    // step 2: replace missing ages with '?'
    .Append(mlContext.Transforms.CustomMapping<FromAge, ToAge>(
        (inp, outp) => { outp.Age = string.IsNullOrEmpty(inp.RawAge) ? "?" : inp.RawAge; },
        "AgeMapping"
    ));

Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.

The first **DropColumn** component drops the Name, Cabin, and Ticket columns from the dataset. The next **CustomMapping** component converts empty age strings to ‘?’ values.

Let's see if that worked. We're going to run the training data through the pipeline and display the results of the first 10 passengers:

In [20]:
// train the model
var model = pipeline.Fit(trainingDataView);

// get predictions for every passenger
var predictions = model.Transform(testDataView);

This code calls **Fit** to train the model on the training data, and **Transform** to generate predictions for each passenger in the test set. Unfortunately we cannot display these predictions directly because the Jupyter server doesn't have built-in support to render the output correctly.

However we can easily fix that by adding a helper method:

In [21]:
using Microsoft.AspNetCore.Html;
Formatter<DataDebuggerPreview>.Register((preview, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(preview.ColumnView.Select(c => (IHtmlContent) th(c.Column.Name)));
    var rows = new List<List<IHtmlContent>>();
    var count = 0;
    foreach (var row in preview.RowView)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(count));
        foreach (var obj in row.Values)
        {
            cells.Add(td(obj.Value));
        }
        rows.Add(cells);
        count++;
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");

With the helper method fully set up, we can now render the output of the machine learning pipeline:

In [23]:
// show a 5-record preview of the output of the machine learning pipeline
var preview = predictions.Preview(maxRows: 10);
display(preview);

index,Label,Pclass,Sex,RawAge,SibSp,Parch,Fare,Embarked,Age
0,True,1,female,38.0,1,0,71.2833,C,38
1,True,3,female,26.0,0,0,7.925,S,26
2,True,3,female,27.0,0,2,11.1333,S,27
3,True,3,female,4.0,1,1,16.7,S,4
4,True,2,male,,0,0,13.0,S,?
5,False,3,female,8.0,3,1,21.075,S,8
6,False,3,male,,0,0,7.8958,S,?
7,False,1,male,40.0,0,0,27.7208,C,40
8,False,2,male,66.0,0,0,10.5,S,66
9,False,3,male,21.0,0,0,8.05,S,21


Compare the **RawAge** and **Age** columns. For every passenger with an empty value in the RawAge column, we now have a corresponding question mark in the Age column. The conversion works perfectly.

ML.NET is now happy with the age values. You will now convert the string ages to numeric values and instruct ML.NET to replace any missing values with the mean age over the entire dataset:

In [25]:
// step 3: convert string ages to floats
var pipeline2 = pipeline
    .Append(mlContext.Transforms.Conversion.ConvertType(
        "Age",
        outputKind: DataKind.Single
    ))

    // step 4: replace missing age values with the mean age
    .Append(mlContext.Transforms.ReplaceMissingValues(
        "Age",
        replacementMode: MissingValueReplacingEstimator.ReplacementMode.Mean));


The **ConvertType** component converts the Age column to a single-precision floating point value. And the **ReplaceMissingValues** component replaces any missing values with the mean value of all ages in the entire dataset. 

Let's see if that worked. We'll call **Fit** and **Transform** again to generate a new set of predictions using this modified pipeline:

In [27]:
// get predictions using the new pipeline
var model2 = pipeline2.Fit(trainingDataView);
predictions = model2.Transform(testDataView);
display(predictions.Preview(maxRows: 10));

index,Label,Pclass,Sex,RawAge,SibSp,Parch,Fare,Embarked,Age,Age.1,Age.2
0,True,1,female,38.0,1,0,71.2833,C,38,38.0,38.0
1,True,3,female,26.0,0,0,7.925,S,26,26.0,26.0
2,True,3,female,27.0,0,2,11.1333,S,27,27.0,27.0
3,True,3,female,4.0,1,1,16.7,S,4,4.0,4.0
4,True,2,male,,0,0,13.0,S,?,,30.149336
5,False,3,female,8.0,3,1,21.075,S,8,8.0,8.0
6,False,3,male,,0,0,7.8958,S,?,,30.149336
7,False,1,male,40.0,0,0,27.7208,C,40,40.0,40.0
8,False,2,male,66.0,0,0,10.5,S,66,66.0,66.0
9,False,3,male,21.0,0,0,8.05,S,21,21.0,21.0


ML.NET has added two extra **Age** columns to perform the data conversion. The final column is all the way to the right and contains a valid numerical age for every passenger. All missing ages have been replaced with 30.15 years which is the mean over the entire dataset.

## Training the model

Now let's process the rest of the data columns. The Sex and Embarked columns are enumerations of string values. As you've learned in the Processing Data section, you'll need to one-hot encode them first:

In [30]:
// step 5: replace sex and embarked columns with one-hot encoded vectors
var pipeline3 = pipeline2
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("Sex"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("Embarked"));

The **OneHotEncoding** component takes an input column, one-hot encodes all values, and produces a new column with the same name holding the one-hot vectors. 

Now let's wrap up the pipeline:

In [31]:
// step 6: concatenate everything into a single feature column 
var pipeline4 = pipeline3
    .Append(mlContext.Transforms.Concatenate(
        "Features", 
        "Age",
        "Pclass", 
        "SibSp",
        "Parch",
        "Sex",
        "Embarked"))

    // step 7: use a fasttree trainer
    .Append(mlContext.BinaryClassification.Trainers.FastTree(
        labelColumnName: "Label", 
        featureColumnName: "Features"));

The **Concatenate** component concatenates all remaining feature columns into a single column for training. This is required because ML.NET can only train on a single input column.

And the **FastTreeBinaryClassificationTrainer** is the algorithm that's going to train the model. You're going to build a decision tree classifier that uses the Fast Tree algorithm to train on the data and configure the tree.

Now all you need to do now is train the model on the entire dataset:

In [32]:
// train the model
Console.Write("Training model...");
var trainedModel = pipeline4.Fit(trainingDataView);
Console.WriteLine("done");

Training model...done


## Evaluating the model

Now let's see how accurate these predictions are. We're going to generate predictions for every passenger and compare these predictions to the dataset labels, and then calculate a couple of metrics to evaluate the quality of the model:

In [33]:
// make predictions for the test data set
Console.WriteLine("Evaluating model...");
var predictions = trainedModel.Transform(testDataView);

// compare the predictions with the ground truth
var metrics = mlContext.BinaryClassification.Evaluate(
    data: predictions, 
    labelColumnName: "Label", 
    scoreColumnName: "Score");

// report the results
Console.WriteLine($"  Accuracy:          {metrics.Accuracy:P2}");
Console.WriteLine($"  Auc:               {metrics.AreaUnderRocCurve:P2}");
Console.WriteLine($"  Auprc:             {metrics.AreaUnderPrecisionRecallCurve:P2}");
Console.WriteLine($"  F1Score:           {metrics.F1Score:P2}");
Console.WriteLine($"  LogLoss:           {metrics.LogLoss:0.##}");
Console.WriteLine($"  LogLossReduction:  {metrics.LogLossReduction:0.##}");
Console.WriteLine($"  PositivePrecision: {metrics.PositivePrecision:0.##}");
Console.WriteLine($"  PositiveRecall:    {metrics.PositiveRecall:0.##}");
Console.WriteLine($"  NegativePrecision: {metrics.NegativePrecision:0.##}");
Console.WriteLine($"  NegativeRecall:    {metrics.NegativeRecall:0.##}");

Evaluating model...
  Accuracy:          83.80%
  Auc:               88.29%
  Auprc:             86.75%
  F1Score:           77.52%
  LogLoss:           0.63
  LogLossReduction:  0.32
  PositivePrecision: 0.76
  PositiveRecall:    0.79
  NegativePrecision: 0.88
  NegativeRecall:    0.86


This code calls **Transform** to set up a prediction for each passenger, and **Evaluate** to compare these predictions to the label and automatically calculate all evaluation metrics:

* **Accuracy**: this is the number of correct predictions divided by the total number of predictions.
* **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
* **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
* **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
* **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
* **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
* **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
* **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
* **NegativePrecision**: this is the fraction of negative predictions that are correct.
* **NegativeRecall**: this is the fraction of negative predictions out of all negative cases.

The dataset is reasonably balanced. Out of 891 passengers, 342 survived. That's 38% of the dataset, so we are looking at a 62/38 balance. It's not perfect, but not too bad either. Given the equal cost of false positives and negatives, and the level of balance in the dataset, we can safely use the Accuracy metric to evaluate the model.

We're getting an accuracy of 83.8%. It means that for every 100 Titanic passengers the model is able to predict 83 of them correctly. That’s not bad at all.

We're also getting an AUC value of 0.8829. This is great, it means the model has good (almost excellent) predictive ability.

Let's wrap the evaluation up by plotting the confusion matrix:

In [42]:
// plot the confusion matrix
var n = metrics.ConfusionMatrix.NumberOfClasses;
var chart = Chart.Plot(
    new Graph.Scattergl()
    {
        x = (from i in Enumerable.Range(0,n) from j in Enumerable.Range(0,n) select j),
        y = (from i in Enumerable.Range(0,n) from j in Enumerable.Range(0,n) select i),
        mode = "markers",
        marker = new Graph.Marker()
        {
            symbol = "square",
            size = 164,
            color = from i in Enumerable.Range(0,n) from j in Enumerable.Range(0,n) select n-metrics.ConfusionMatrix.Counts[j][i],
            colorscale = "Greys"
        }
    }
);
chart.WithXTitle("Predicted value");
chart.WithYTitle("Actual value");
chart.WithTitle("Confusion matrix");
chart.Width = 600;
chart.Height = 600;
display(chart);

The confusion matrix clearly shows that the model is very good at predicting survivors but struggles a bit when predicting the victims: we have a solid black true positives cell but a dark grey true negatives cell. There are almost no false negatives but we do have some false positives where the model predicts that a passenger survived but he or she actually perished. 

## Making a prediction

To wrap up, let's have some fun and pretend that I’m going to take a trip on the Titanic too. I will embark in Southampton and pay $70 for a first-class cabin. I travel on my own without parents, children, or my spouse. 

What are my odds of surviving?

Add the following code:

In [41]:
// set up a prediction engine
Console.WriteLine("Making a prediction...");
var predictionEngine = mlContext.Model.CreatePredictionEngine<Passenger, PassengerPrediction>(trainedModel);

// create a sample record
var passenger = new Passenger()
{ 
    Pclass = 1,
    Name = "Mark Farragher",
    Sex = "male",
    RawAge = "48",
    SibSp = 0,
    Parch = 0,
    Fare = 70,
    Embarked = "S"
};

// make the prediction
var prediction = predictionEngine.Predict(passenger);

// report the results
Console.WriteLine($"Passenger:   {passenger.Name} ");
Console.WriteLine($"Prediction:  {(prediction.Prediction ? "survived" : "perished" )} ");
Console.WriteLine($"Probability: {prediction.Probability} ");

Making a prediction...
Passenger:   Mark Farragher 
Prediction:  survived 
Probability: 0.8422213 


This code uses the **CreatePredictionEngine** method to set up a prediction engine. The two type arguments are the input data class and the class to hold the prediction. And once the prediction engine is set up, you can simply call **Predict** to make a single prediction.

So would I have survived the Titanic disaster?

 I’m happy to learn that I survived the Titanic disaster. My model predicts that I had a 84.22% chance of making it off the ship alive. It’s probably because I booked a first-class cabin and traveled alone.
 
 ## Further improvements
 
 How do you think this model can be improved even more?