In [None]:
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json" 
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-tools/nuget/v3/index.json" 

#r "nuget:Microsoft.ML, 1.5.1"
#r "nuget:Microsoft.ML.AutoML, 0.17.1"
#r "nuget:Microsoft.Data.Analysis, 0.4.0"
#r "nuget: XPlot.Plotly.Interactive, 4.0.2"

Let's put the majority of our imports here upfront, for clarity we leave the mlnet imports till later when we come to training our model. 

In [None]:
using System.IO;
using System.Net.Http;

using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using Microsoft.DotNet.Interactive.Formatting;
using Microsoft.Data.Analysis;
using XPlot.Plotly;
using Microsoft.AspNetCore.Html;

Now define a formatter for the training data we are going to load.

In [None]:
Formatter.Register<DataFrame>((df, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c.Name)));
    var rows = new List<List<IHtmlContent>>();
    var take = 20;
    for (var i = 0; i < Math.Min(take, df.Rows.Count); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var obj in df.Rows[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");

Fetch and save the data file if we do not already have it locally.

In [None]:
string housingPath = "housing.csv";

if (!File.Exists(housingPath))
{
    var contents = await new HttpClient()
        .GetStringAsync("https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv");
        
    File.WriteAllText("housing.csv", contents);
}

Let's take a look at what the data looks like using some of the dataframe builtin functionality. From this we can see the column labels and some statistics on the data. We can see the first lines of the file using ```display(housingData.Display());```. We can also use ```display(housingData.Info());``` to get column datatypes, as the cell is interative you can try these for yourself. 

In [None]:
var housingData = DataFrame.LoadCsv(housingPath);
display(housingData.Info());

We have imported charting capabilities so lets take a look at one of the columns visually.

In [None]:
Chart.Plot(
    new Histogram()
    {
        x = housingData.Columns["median_house_value"],
        nbinsx = 20
    }
)

So that gave us a graph of the median_house_value column. What if we want to see how these values are distributed geographically?

In [None]:
var chart = Chart.Plot(
    new Scattergl()
    {
        x = housingData.Columns["longitude"],
        y = housingData.Columns["latitude"],
        mode = "markers",
        marker = new Marker()
        {
            color = housingData.Columns["median_house_value"],
            colorscale = "Jet"
        }
    }
);

chart.Width = 600;
chart.Height = 600;
display(chart)

Now that we have an idea of what the basic data looks like let's prepare our training and test data. Remember we want two seperate datasets to allow us to validate out model more thoroughly. We are going to randomize the data and split off 20% to use as test data and use the remainder as our training data. You can experiment with using smaller or larger training datasets. 

In [None]:
static T[] Shuffle<T>(T[] array)
{
    Random rand = new Random();
    for (int i = 0; i < array.Length; i++)
    {
        int r = i + rand.Next(array.Length - i);
        T temp = array[r];
        array[r] = array[i];
        array[i] = temp;
    }
    return array;
}

int[] randomIndices = Shuffle(Enumerable.Range(0, (int)housingData.Rows.Count).ToArray());
int testSize = (int)(housingData.Rows.Count * .2);
int[] trainRows = randomIndices[testSize..];
int[] testRows = randomIndices[..testSize];

DataFrame trainingData = housingData[trainRows];
DataFrame testData = housingData[testRows];

display($"Training row count {trainingData.Rows.Count}");
display($"Testing row count {testData.Rows.Count}");





Now we can train our model to predict ```median_house_value```. We set an experiment time of 15 seconds for this iteration but as with dataset size you can experiment with longer/shorter periods to explore the effect on model accuracy.

In [None]:
#!time

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.AutoML;

var mlContext = new MLContext();

var experiment = mlContext.Auto().CreateRegressionExperiment(maxExperimentTimeInSeconds: 15);
var result = experiment.Execute(trainingData, labelColumnName:"median_house_value");



What happened there? The output wall time was much longer than the 15 seconds we specified. That's because mlnet automatically evaluates multiple algorithms so the process takes a little longer. Now that we have a model let's take a look at some of the algorithms that were evaluated and their error scores. From these scores er can see LightGbmRegression was the most accurate algorithhm.

In [None]:
var scatters = result.RunDetails.Where(d => d.ValidationMetrics != null).GroupBy(
    r => r.TrainerName,
    (name, details) => new Scattergl()
    {
        name = name,
        x = details.Select(r => r.RuntimeInSeconds),
        y = details.Select(r => r.ValidationMetrics.MeanAbsoluteError),
        mode = "markers",
        marker = new Marker() { size = 12 }
    });

var chart = Chart.Plot(scatters);
chart.WithXTitle("Training Time");
chart.WithYTitle("Error");
display(chart);

Console.WriteLine($"Best Trainer:{result.BestRun.TrainerName}")



Now lets see how we perform against our training data set (data which the model has not previously seen) so we can make a more realistic assessment of our models accuracy.

In [None]:
var testResults = result.BestRun.Model.Transform(housing_test);

var actualValues = testResults.GetColumn<float>("median_house_value");
var predictedValues = testResults.GetColumn<float>("Score");

var predictedVsActual = new Scattergl()
{
    x = actualValues,
    y = predictedValues,
    mode = "markers",
};

var maximumValue = Math.Max(actualValues.Max(), predictedValues.Max());

var perfectLine = new Scattergl()
{
    x = new[] {0, maximumValue},
    y = new[] {0, maximumValue},
    mode = "lines",
};

var chart = Chart.Plot(new[] {predictedVsActual, perfectLine });
chart.WithXTitle("Actual Values");
chart.WithYTitle("Predicted Values");
chart.WithLegend(false);
chart.Width = 800;
chart.Height = 600;
display(chart);