# Iris Classification

Lets take a look at the well known Iris classification dataset as our entrypoint to interactive notebooks.

We begin by getting our nuget packages and importing some dependencies we will use. 

In [None]:
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json" 
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-tools/nuget/v3/index.json" 

#r "nuget:Microsoft.ML, 1.6.0"
#r "nuget:Microsoft.ML.AutoML, 0.18.0"
#r "nuget:Microsoft.Data.Analysis, 0.18.0"
#r "nuget: XPlot.Plotly.Interactive, 4.0.2"

Loading extensions from `Microsoft.Data.Analysis.Interactive.dll`

Loading extensions from `XPlot.Plotly.Interactive.dll`

Configuring PowerShell Kernel for XPlot.Plotly integration.

Installed support for XPlot.Plotly.

Let's put the majority of our imports here upfront, for clarity we leave the mlnet imports till later when we come to training our model. 

In [None]:
using System.IO;
using System.Net.Http;
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using Microsoft.DotNet.Interactive.Formatting;
using Microsoft.Data.Analysis;
using XPlot.Plotly;
using Microsoft.AspNetCore.Html;

Now define a formatter for the training data we are going to load so the data is more presentable.

In [None]:
Formatter.Register<DataFrame>((df, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c.Name)));
    var rows = new List<List<IHtmlContent>>();
    var take = 20;
    for (var i = 0; i < Math.Min(take, df.Rows.Count); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var obj in df.Rows[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");

Fetch and save the data file if we do not already have it locally.

In [None]:
string irisPath = "iris.csv";

if (!File.Exists(irisPath))
{
    var contents = await new HttpClient()
        .GetStringAsync("https://datahub.io/machine-learning/iris/r/iris.csv");
        
    File.WriteAllText("iris.csv", contents);
}

Let's take a look at what the data looks like using some of the dataframe builtin functionality. From this we can see the column labels and some statistics on the data. We can see the first five lines of the file using ```display(irisData.Head(5));```. We can also use ```display(irisData.Info());``` to get column datatypes, if you are running this as an interative notebook you can try these for yourself. 

In [None]:
var irisData = DataFrame.LoadCsv(irisPath);
display(irisData.Head(5));


So we can see we have 6 columns, index is a convenience colum so we are interested in sepallength, sepalwidth, petallength, petalwidth as our features and class is the value we will be trying to forecast.

We have imported charting capabilities so lets take a look at one of the columns visually.

In [None]:
Chart.Plot(
    new Histogram()
    {
        x = irisData.Columns["sepallength"],
        nbinsx = 20
    }
)

So that gave us a graph of the sepallength column. Lets see what the data types are for each column

In [None]:
display(irisData.Info());

So out data type is 'Single' amnd we have 150 values in each column.

Now that we have an idea of what the basic data looks like let's prepare our training and test data. Remember we want two seperate datasets to allow us to validate out model more thoroughly. We are going to randomize the data and split off 15% to use as test data and use the remainder as our training data. You can experiment with using smaller or larger training datasets. 

In [None]:
static T[] Shuffle<T>(T[] array)
{
    Random rand = new Random();
    for (int i = 0; i < array.Length; i++)
    {
        int r = i + rand.Next(array.Length - i);
        T temp = array[r];
        array[r] = array[i];
        array[i] = temp;
    }
    return array;
}

int[] randomIndices = Shuffle(Enumerable.Range(0, (int)irisData.Rows.Count).ToArray());
int testSize = (int)(irisData.Rows.Count * .15);
int[] trainRows = randomIndices[testSize..];
int[] testRows = randomIndices[..testSize];

DataFrame trainingData = irisData[trainRows];
DataFrame testData = irisData[testRows];

display($"Training row count {trainingData.Rows.Count}");
display($"Testing row count {testData.Rows.Count}");

Now we can train our model to predict ```class```. We set an experiment time of 15 seconds for this iteration but as with dataset size you can experiment with longer/shorter periods to explore the effect on model accuracy. We have three possible values as forecasts from the labels in the datafile so this is not a <span>Binary Classification</span> problem. We have multiple possible answers so we use '''MulticlassClassification''' as our experiment. 

In [None]:
#!time

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.AutoML;

var mlContext = new MLContext();

var experiment = mlContext.Auto().CreateMulticlassClassificationExperiment(maxExperimentTimeInSeconds: 15);
var result = experiment.Execute(trainingData, labelColumnName:"class");

OK so training takes much longer than the 15 seconds we specified. That's because mlnet automatically evaluates multiple algorithms so the process takes a little longer. Now that we have a model let's take a look at some of the algorithms that were evaluated and their error scores. From these scores we can see SdcaMaximumEntropyMulti was the most accurate algorithhm.

In [None]:
var scatters = result.RunDetails.Where(d => d.ValidationMetrics != null).GroupBy(
    r => r.TrainerName,
    (name, details) => new Scattergl()
    {
        name = name,
        x = details.Select(r => r.RuntimeInSeconds),
        y = details.Select(r => r.ValidationMetrics.MacroAccuracy),
        mode = "markers",
        marker = new Marker() { size = 12 }
    });

var chart = Chart.Plot(scatters);
chart.WithXTitle("Training Time");
chart.WithYTitle("Macro Accuracy");
display(chart);

Console.WriteLine($"Best Trainer:{result.BestRun.TrainerName} Macro Accuracy: {result.BestRun.ValidationMetrics.MacroAccuracy * 100} %")

Now lets see how we perform against our training data set (data which the model has not previously seen) so we can make a more realistic assessment of our models accuracy.

In [None]:
var evaluationData = result.BestRun.Model.Transform(testData);


var evaluationMetrics = mlContext.MulticlassClassification.Evaluate(evaluationData, labelColumnName:"class");

Console.WriteLine($"Model predicted on test data with Macro Accuracy of {evaluationMetrics.MacroAccuracy * 100} %");


So we can see our results on the test data are not quite as good as on the training data. This is to be expected as the model gains 'experience' with the training data over each iteration.


Lets try another approach. Instead of using a MulticlasClassification lets tell the model we want to use. We are going to use KMeans to train a model to predict one of three clusters.

In [None]:
var features = "Features";

var pipeline = mlContext.Transforms
    .Concatenate(features, "sepallength", "sepalwidth", "petallength", "petalwidth")
    .Append(mlContext.Clustering.Trainers.KMeans(features, numberOfClusters: 3));


var model = pipeline.Fit(trainingData);

public class IrisData
{

    public float sepallength;

    public float sepalwidth;

    public float petallength;
    public float petalwidth;
} 

public class ClusterPrediction
{
    [ColumnName("PredictedLabel")]
    public uint PredictedCluster;

    [ColumnName("Score")]
    public float[] Distances;
}

var predictor = mlContext.Model.CreatePredictionEngine<IrisData, ClusterPrediction>(model);

var prediction = predictor.Predict(new IrisData() {sepallength = 6.7F, sepalwidth = 3.0F, petallength = 5.2F, petalwidth = 2.3F});
Console.WriteLine($"Predicted cluster is: {prediction.PredictedCluster}");


So the model has predicted the correct cluster for our test case. Lets try again with a different iris from cluster 1.

In [None]:
prediction = predictor.Predict(new IrisData() {sepallength = 4.3F, sepalwidth = 3.0F, petallength = 1.1F, petalwidth = 0.1F});
Console.WriteLine($"Predicted cluster is: {prediction.PredictedCluster}");

