# Tensorflow MNIST

## Introducting the dataset

The MNIST database is a set of handwritten digits captured and open sourced. 
The images have been labeled and makes for an excellent source for testing out a neural network.

You can read more about the dataset here:
https://en.wikipedia.org/wiki/MNIST_database

Below are some sample of the images
<img src="img/MnistExamples.png">

Training on the MNIST dataset is often talked about as the __hello world__ of machine learning.

## The plan

We will build a neural network using Tensorflow. The network will look something like this:

<img src="img/mnist_2layers.png">

We'll have a 784 (28x28) input layer and a 10 neuron output layer (one neuron for each digit). In addition, we'll use one hidden layer.

We will use Tensorflow to construct the neural net.


## Install Tensorflow (if not already present)

First, let's make sure that we have tensorflow installed.

In [None]:
%%!
pip install tensorflow
pip install tensorflow_datasets

## Imports

We have a few libraries that we'll use.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
%matplotlib inline


## Loading the data

The good news for this exercise is that we can load the data using the imports.

The data is split into two sets.

* `ds_train` is the set that we'll use to train our model
* `ds_test` we will use to measure how well our model is doing

The arguuments:

* `shuffle_files=True`: The MNIST data is only stored in a single file, but for larger datasets with multiple files on disk, it's good practice to shuffle them when training.
* `as_supervised=True`: Returns a tuple (img, label) instead of a dictionary {'image': img, 'label': label}.


In [None]:
(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)


## Checking the data set

Let's take a look at the size of each dataset:

In [None]:
print("Size of training set:", ds_train.cardinality().numpy())
print("Size of test set.   :", ds_test.cardinality().numpy())


## Define a normalization function

The image is using integers, we need floats.

In [None]:
def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label

## Setup training pipeline

Next, we have to build a training pipeline

1. TFDS provides os images where each pixel is exppressed as an integer (or a `tf.uint8`). For that reason, we defined the `normalize_image` function. We now have to apply this function.
2. `tf.data.Dataset.cache`: As you fit the dataset in memory, cache it before shuffling for a better performance.
3. `tf.data.Dataset.shuffle`: For true randomness, set the shuffle buffer to the full dataset size.
4. `tf.data.Dataset.batch`: Batch elements of the dataset after shuffling to get unique batches at each epoch.
5. `tf.data.Dataset.prefetch`: It is good practice to end the pipeline by prefetching for performance.


In [None]:
ds_train = ds_train.map(
    normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)


## Build an evaluation pipeline

Your testing pipeline is similar to the training pipeline with small differences:

You obviously don't need to call shuffle (it doesn't matter in what order we test).
Caching is done after batching because batches can be the same between epochs.


In [None]:
ds_test = ds_test.map(
    normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)

## Train the model

Plug the TFDS input pipeline into a simple Keras model, compile the model, and train it.

Notice a few things:

1. We are specifying the input shape to be 28 by 28 (size of the image)
2. We define one intermediate layer using 128 nodes (or neurons) with the activation function `relu`
3. The final output layer is of size 10 (obviously, as we're trying to see which single digit we believe the image matches)
4. We use the optimizer (or optimization function) called Adam
5. We use a loss function called `SpaseCategoricalAccuracy`
6. We use 6 epochs (An epoch is when all the training data is used at once and is defined as the total number of iterations of all the training data in one cycle for training the machine learning model. Another way to define an epoch is the number of passes a training dataset takes around an algorithm)



In [None]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(
    ds_train,
    epochs=6,
    validation_data=ds_test,
)


You should now (in theory at least) see the accuracy increasing with each epoch. I would expect ~98% accuracy.

You could increase the number of epochs to improve the accuracy. When I ran the experiment, I got to 1.0 accuracy after 35 epochs, but my suspicion is that the model is overfit at that point