#Titanic App in an interactive notebook

Let's take another look at out titanic app this time using an interactive notebook. We start off by importing and declaring some of the dependencies we will require. Fetching the nuget packages may take a few moments the first time it is run.

In [None]:
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json" 
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-tools/nuget/v3/index.json" 

#r "nuget:Microsoft.ML, 1.6.0"
#r "nuget:Microsoft.ML.AutoML, 0.18.0"
#r "nuget:Microsoft.Data.Analysis, 0.18.0"
#r "nuget: XPlot.Plotly.Interactive, 4.0.3"

We will include the ML.NET imports later when we come to training our model. 

In [None]:
using System.IO;
using System.Net.Http;
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using Microsoft.DotNet.Interactive.Formatting;
using Microsoft.Data.Analysis;
using XPlot.Plotly;
using Microsoft.AspNetCore.Html;

Now define a formatter for the training data we are going to load.

In [None]:
Formatter.Register<DataFrame>((dataFrame, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(dataFrame.Columns.Select(col => (IHtmlContent) th(col.Name)));
    var rows = new List<List<IHtmlContent>>();
    var take = Math.Min(20, dataFrame.Rows.Count);
    for (var i = 0; i < take; i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var item in dataFrame.Rows[i])
        {
            cells.Add(td(item));
        }
        rows.Add(cells);
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");

Fetch and save the data file if we do not already have it locally (you will need a kaggle account to download this data).

In [None]:
string dataFilePath = "titanic.zip";

if (!File.Exists(dataFilePath))
{
    var contents = await new HttpClient()
        .GetStringAsync("https://www.kaggle.com/hesh97/titanicdataset-traincsv/download");
        
    File.WriteAllText("titanic.zip", contents);
}

Let's take a look at what the data looks like using some of the dataframe builtin functionality. From this we can see the column labels and some statistics on the data. We can see the first lines of the file using ```display(titanicData.Display());```. We can also use ```display(titanicData.Info());``` to get column datatypes. If you are running this as an interactive notebook you can try these for yourself. 

In [None]:
var titanicData = DataFrame.LoadCsv("train2.csv");
display(titanicData.Info());

We have imported charting capabilities so lets take a look at one of the columns visually.

In [None]:
Chart.Plot(
    new Histogram()
    {
        x = titanicData.Columns["Age"],
        nbinsx = 20
    }
)

So that gave us the age demographics across the passengers. 

In [None]:
var chart = Chart.Plot(
    new Scattergl()
    {
        x = titanicData.Columns["Age"],
        y = titanicData.Columns["Sex"],
        mode = "markers",
        marker = new Marker()
        {
            color = titanicData.Columns["Survived"],
            colorscale = "Jet"
        }
    }
);

chart.Width = 800;
chart.Height = 600;
display(chart)

Now that we have an idea of what the basic data looks like let's prepare our training and test data. Remember we want two seperate datasets to allow us to validate out model more thoroughly. We are going to randomize the data and split off 15% to use as test data and use the remainder as our training data. You can experiment with using smaller or larger training datasets. 

In [None]:
static T[] Shuffle<T>(T[] array)
{
    Random rand = new Random();
    for (int i = 0; i < array.Length; i++)
    {
        int r = i + rand.Next(array.Length - i);
        T temp = array[r];
        array[r] = array[i];
        array[i] = temp;
    }
    return array;
}

int[] randomIndices = Shuffle(Enumerable.Range(0, (int)titanicData.Rows.Count).ToArray());
int testSize = (int)(titanicData.Rows.Count * .15);
int[] trainRows = randomIndices[testSize..];
int[] testRows = randomIndices[..testSize];

//Split the data into training and test data frames
DataFrame trainingData = titanicData[trainRows];
DataFrame testData = titanicData[testRows];


// Output our row counts
display($"Training row count {trainingData.Rows.Count}");
display($"Testing row count {testData.Rows.Count}");

So we have 758 rows of training data and 133 rows of test data. We can now train our model to predict ```Survived```. We set an experiment time of 15 seconds for this iteration but as with dataset size you can experiment with longer/shorter periods to explore the effect on model accuracy. This is a binary problem so we use a binary classifier to train the model

In [None]:
#!time

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.AutoML;

var mlContext = new MLContext();

var experiment = mlContext.Auto().CreateBinaryClassificationExperiment(maxExperimentTimeInSeconds: 15);
// Train the model to predicted the value in the 'Survived' column based on the other values 
var result = experiment.Execute(trainingData, labelColumnName:"Survived");

What happened there? The output wall time was much longer than the 15 seconds we specified. That's because mlnet automatically evaluates multiple algorithms so the process takes a little longer. Now that we have a model let's take a look at some of the algorithms that were evaluated and their error scores. From these scores er can see SdcaLogisticRegressionBinary was the most accurate algorithhm for this training session.

In [None]:
var scatters = result.RunDetails.Where(d => d.ValidationMetrics != null).GroupBy(
    r => r.TrainerName,
    (name, details) => new Scattergl()
    {
        name = name,
        x = details.Select(r => r.RuntimeInSeconds),
        y = details.Select(r => r.ValidationMetrics.Accuracy),
        mode = "markers",
        marker = new Marker() { size = 12 }
    });

var chart = Chart.Plot(scatters);
chart.WithXTitle("Training Time");
chart.WithYTitle("Error");
display(chart);

Console.WriteLine($"Best Trainer:{result.BestRun.TrainerName}")

Now lets see how we perform against our training data set (data which the model has not previously seen) so we can make a more realistic assessment of our models accuracy.

In [None]:
var testResults = result.BestRun.Model.Transform(trainingData);

var actualValues = testResults.GetColumn<bool>("Survived");
var predictedValues = testResults.GetColumn<float>("Score");

var predictedVsActual = new Scattergl()
{
    x = actualValues,
    y = predictedValues,
    mode = "markers",
};

var maximumValue = 1;//Math.Max(actualValues.Max(), predictedValues.Max());

var perfectLine = new Plot()
{
    x = new[] {0, maximumValue},
    y = new[] {0, maximumValue},
    mode = "lines",
};

var chart = Chart.Plot(new[] {predictedVsActual, perfectLine });
chart.WithXTitle("Actual Values");
chart.WithYTitle("Predicted Values");
chart.WithLegend(true);
chart.Width = 800;
chart.Height = 600;
display(chart);