# Assignment: Recognize handwritten digits

In this article, You are going to build an app that recognizes handwritten digits from the famous MNIST machine learning dataset:

![MNIST digits](./assets/mnist.png)

Your app must read these images of handwritten digits and correctly predict which digit is visible in each image.

This may seem like an easy challenge, but look at this:

![Difficult MNIST digits](./assets/mnist_hard.png)

These are a couple of digits from the dataset. Are you able to identify each one? It probably won’t surprise you to hear that the human error rate on this exercise is around 2.5%.

The first thing you will need for your app is a data file with images of handwritten digits. We will not use the original MNIST data because it's stored in a nonstandard binary format.

Instead, we'll use these excellent [CSV files](https://www.kaggle.com/oddrationale/mnist-in-csv/) prepared by Daniel Dato on Kaggle.

The training and testing files **mnist_train.csv** and **mnist_test.csv** have already been downloaded and are available to your code. There are 50,000 images in the training file and 10,000 in the test file. Each image is monochrome and resized to 28x28 pixels.

The training file looks like this:

![Data file](./assets/datafile.png)

It’s a CSV file with 785 columns:

* The first column contains the label. It tells us which one of the 10 possible digits is visible in the image.
* The next 784 columns are the pixel intensity values (0..255) for each pixel in the image, counting from left to right and top to bottom.

You are going to build a multiclass classification machine learning model that reads in all 785 columns, and then makes a prediction for each digit in the dataset.

## Get started

Let’s get started. You're going to install the following ML.NET packages:

In [6]:
#r nuget:Microsoft.ML

This will install the Microsoft ML.NET machine learning library. 

Now let's add some code:

In [13]:
using System;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;
using XPlot.Plotly;

You will also need two classes: one to hold a digit, and one to hold your model prediction:

In [8]:
class Digit
{
    [ColumnName("PixelValues")]
    [VectorType(784)]
    public float[] PixelValues;

    [LoadColumn(0)]
    public float Number;
}

class DigitPrediction
{
    [ColumnName("Score")]
    public float[] Score;
}

The **Digit** class holds one single MNIST digit image. Note how the field is tagged with a **VectorTyp**e attribute. This tells ML.NET to combine the 784 individual pixel columns into a single vector value.

There's also a **DigitPrediction** class which will hold a single prediction. And notice how the prediction score is actually an array? The model will generate 10 scores, one for every possible digit value. 

## Loading the data

Next you'll need to load the data in memory:

In [9]:
// filenames for data set
private static string trainDataPath = Path.Combine(Environment.CurrentDirectory, "mnist_train.csv");
private static string testDataPath = Path.Combine(Environment.CurrentDirectory, "mnist_test.csv");

// create a machine learning context
var context = new MLContext();

// load data
Console.Write("Loading data....");
var columnDef = new TextLoader.Column[]
{
    new TextLoader.Column(nameof(Digit.PixelValues), DataKind.Single, 1, 784),
    new TextLoader.Column("Number", DataKind.Single, 0)
};
var trainDataView = context.Data.LoadFromTextFile(
    path: trainDataPath,
    columns : columnDef,
    hasHeader : true,
    separatorChar : ',');
var testDataView = context.Data.LoadFromTextFile(
    path: testDataPath,
    columns : columnDef,
    hasHeader : true,
    separatorChar : ',');
Console.WriteLine("done");

Loading data....done


This code uses the **LoadFromTextFile** method to load the CSV data directly into memory. Note the **columnDef** variable that instructs ML.NET to load CSV columns 1..784 into the PixelValues column, and CSV column 0 into the Number column.

Let's see if the data loaded correctly. We're going to deserialize the training data into an enumeration of **Digit** instances and do a quick visual check of the data:

In [10]:
// get an array of heartdata instances
var data = context.Data.CreateEnumerable<Digit>(trainDataView, reuseRowObject: false).ToArray();

// display the result
display(data.Take(10));

index,PixelValues,Number
0,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",5
1,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",0
2,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",4
3,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",1
4,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",9
5,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",2
6,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",1
7,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",3
8,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",1
9,"[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... (774 more) ]",4


That looks good. We have a **PixelValues** column with a 1-dimensional array of 768 elements, these are the pixels of the individual images. And the final column is **Number** which indicates what number is visible in the image.

Let's try to plot the first number in the sequence to get a feel for what the images look like:

In [18]:
// plot the first digit
var chart = Chart.Plot(
    new Graph.Scattergl()
    {
        x = (from i in Enumerable.Range(0,28) from j in Enumerable.Range(0,28) select j),
        y = (from i in Enumerable.Range(0,28) from j in Enumerable.Range(0,28) select 27-i),
        mode = "markers",
        marker = new Graph.Marker()
        {
            symbol = "square",
            size = 12,
            color = from v in data[0].PixelValues select 255-v,
            colorscale = "Greys"
        }
    }
);
chart.WithXTitle("X");
chart.WithYTitle("Y");
chart.WithTitle("The first digit");
chart.Width = 600;
chart.Height = 600;
display(chart);

You can see what our app is up against. These digits are not like neatly printed characters at all but instead resemble quite sloppy handwriting. Will our machine learning model be able to make sense out of them? 

Let's find out!

## Training the model

We're going to build the machine learning pipeline now:

In [19]:
// build a training pipeline
// step 1: map the number column to a key value and store in the label column
var pipeline = context.Transforms.Conversion.MapValueToKey(
    outputColumnName: "Label", 
    inputColumnName: "Number", 
    keyOrdinality: ValueToKeyMappingEstimator.KeyOrdinality.ByValue)

    // step 2: concatenate all feature columns
    .Append(context.Transforms.Concatenate(
        "Features", 
        nameof(Digit.PixelValues)))
        
    // step 3: cache data to speed up training                
    .AppendCacheCheckpoint(context)

    // step 4: train the model with SDCA
    .Append(context.MulticlassClassification.Trainers.SdcaMaximumEntropy(
        labelColumnName: "Label", 
        featureColumnName: "Features"))

    // step 5: map the label key value back to a number
    .Append(context.Transforms.Conversion.MapKeyToValue(
        outputColumnName: "Number",
        inputColumnName: "Label"));

// train the model
Console.Write("Training the model, this can take a few seconds. Please wait until the word 'done' appears....");
var model = pipeline.Fit(trainDataView);
Console.WriteLine("done");

Training the model, this can take a few seconds. Please wait until the word 'done' appears....done


Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.

This pipeline has the following components:

* **MapValueToKey** which reads the **Number** column and builds a dictionary of unique values. It then produces an output column called **Label** which contains the dictionary key for each number value. We need this step because we can only train a multiclass classifier on keys. 
* **Concatenate** which converts the PixelValue vector into a single column called Features. This is a required step because ML.NET can only train on a single input column.
* **AppendCacheCheckpoint** which caches all training data at this point. This is an optimization step that speeds up the learning algorithm which comes next.
* A **SdcaMaximumEntropy** classification learner which will train the model to make accurate predictions.
* A final **MapKeyToValue** step which converts the keys in the **Label** column back to the original number values. We need this step to show the numbers when making predictions. 

With the pipeline fully assembled, you can train the model with a call to **Fit**.

## Evaluating the model

You now have a fully- trained model. So now it's time to take the test set, predict the number for each digit image, and calculate the accuracy metrics of the model:

In [20]:
// use the model to make predictions on the test data
Console.WriteLine("Evaluating model....");
var predictions = model.Transform(testDataView);

// evaluate the predictions
var metrics = context.MulticlassClassification.Evaluate(
    data: predictions, 
    labelColumnName: "Number", 
    scoreColumnName: "Score");

// show evaluation metrics
Console.WriteLine($"Evaluation metrics");
Console.WriteLine($"    MicroAccuracy:    {metrics.MicroAccuracy:0.###}");
Console.WriteLine($"    MacroAccuracy:    {metrics.MacroAccuracy:0.###}");
Console.WriteLine($"    LogLoss:          {metrics.LogLoss:#.###}");
Console.WriteLine($"    LogLossReduction: {metrics.LogLossReduction:#.###}");
Console.WriteLine();

Evaluating model....
Evaluation metrics
    MicroAccuracy:    0.874
    MacroAccuracy:    0.872
    LogLoss:          .414
    LogLossReduction: .82



This code calls **Transform** to set up predictions for every single image in the test set. And the **Evaluate** method compares these predictions to the actual labels and automatically calculates four metrics:

* **MicroAccuracy**: this is the average accuracy (=the number of correct predictions divided by the total number of predictions) for every digit in the dataset.
* **MacroAccuracy**: this is calculated by first calculating the average accuracy for each unique prediction value, and then taking the averages of those averages.
* **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
* **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.

We get a MicroAccuracy value of 0.874 and a MacroAccuracy value of 0.872. These two values are very close together, which means that the dataset is not biased. Each digit occurs roughly the same number of times in the file.

A micro-accuracy of 87% is not a bad result, it means that out of 100 digits the model only makes 13 mistakes. But keep in mind that a human would only make 2.5 mistakes on the same task! This model is nowhere near human performance. 

Let's take a look at the confusion matrix:

In [34]:
// plot the confusion matrix
var n = metrics.ConfusionMatrix.NumberOfClasses;
var chart = Chart.Plot(
    new Graph.Scattergl()
    {
        x = (from i in Enumerable.Range(0,n) from j in Enumerable.Range(0,n) select j),
        y = (from i in Enumerable.Range(0,n) from j in Enumerable.Range(0,n) select i),
        mode = "markers",
        marker = new Graph.Marker()
        {
            symbol = "square",
            size = 32,
            color = from i in Enumerable.Range(0,n) from j in Enumerable.Range(0,n) select n-metrics.ConfusionMatrix.Counts[j][i],
            colorscale = "Greys"
        }
    }
);
chart.WithXTitle("Predicted digit");
chart.WithYTitle("Actual digit");
chart.WithTitle("Confusion matrix");
chart.Width = 600;
chart.Height = 600;
display(chart);

This looks great! We have a nice dark sequence along the main diagonal with all the correct predictions the model is making. There are a couple of grey cells visible that correspond to understandable mistakes: a 9 misidentified as a 4, a 3 that's taken for a 5, and a 5 that's taken for an 8, among others.

The model has clearly learned to identify circles and half-circles in digits, but it sometimes struggles to match them to the correct number. 

## Making a prediction

To wrap up, let’s use the model to make a prediction.

You will pick five arbitrary digits from the test set, run them through the model, and make a prediction for each one.

Here’s how to do it:

In [33]:
// grab five digits from the test data
var digits = context.Data.CreateEnumerable<Digit>(testDataView, reuseRowObject: false).ToArray();
var testDigits = new Digit[] { digits[5], digits[16], digits[28], digits[63], digits[129] };

// create a prediction engine
var engine = context.Model.CreatePredictionEngine<Digit, DigitPrediction>(model);

// predict each test digit
var results = from d in testDigits select new { Digit = d.Number, Predictions = from p in engine.Predict(d).Score select p.ToString("P") };
display(results);

index,Digit,Predictions
0,1,"[ 0.000%, 98.640%, 0.188%, 0.403%, 0.003%, 0.017%, 0.005%, 0.134%, 0.565%, 0.046% ]"
1,9,"[ 0.044%, 0.000%, 0.176%, 0.007%, 7.662%, 0.010%, 0.006%, 8.773%, 0.593%, 82.730% ]"
2,0,"[ 99.560%, 0.000%, 0.019%, 0.175%, 0.000%, 0.125%, 0.000%, 0.000%, 0.121%, 0.000% ]"
3,3,"[ 0.016%, 0.066%, 39.764%, 48.136%, 1.798%, 0.980%, 0.046%, 0.064%, 4.464%, 4.666% ]"
4,5,"[ 0.170%, 0.000%, 0.042%, 1.620%, 0.044%, 90.163%, 0.008%, 0.004%, 4.052%, 3.895% ]"


This code calls the **CreateEnumerable** method to convert the test dataview to an array of Digit instances. Then it picks five random digits for testing.

The **CreatePredictionEngine** method sets up a prediction engine. The two type arguments are the input data class and the class to hold the prediction.

And finally, the code uses a LINQ expression that repeatedly calls **Predict** to generate a table with digit labels and the 10 prediction scores for each possible result. 

And here are the results for the five test digits:

* The first prediction scores 98% on ‘1’ which is correct.
* The second prediction scores 82% on ‘9’ (correct) and 7% on ‘4’. And this make sense if you think about it — a 4 and a 9 do look very similar. But the model correctly picks ‘9’ as the most likely solution.
* The third prediction scores 99% on ‘0’ which is correct.
* The fourth prediction scores only 48% on ‘3’. The model thinks the number could also be a '2', a '9', an '8', or a '4'. This also makes sense - a very sloppily drawn 3 could indeed look like a 2, a 9, or an 8. 
* And the fifth prediction scores 90% on '5' which is correct.

All five test predictions are correct. The highest prediction score always corresponds to the correct digit.

## Further improvements

How do you think we could improve this model even further? 