LeNet-5 Convolutional Neural Network
====================

In this tutorial I will show you how to implement the LeNet-5 CNN with tensorflow. LeNet-5, a pioneering 7-level convolutional network that classifies digits, was applied by several banks to recognise hand-written numbers on checks (cheques) digitized in 28x28 pixel images. This tutorial is based on the article *"Gradient-based learning applied to document recognition"* of LeCun et al. (1998) that I suggest you to read.

This tutorial requires the **MNIST dataset**. The dataset can be downloaded and prepared following [this notebook](./../mnist/mnist.ipynb). Here we need the train and test files (in TFRecord format) that have been produced at the end of the notebook.

Overview of LeNet-5 architecture
----------------------------

In this section we will give a look to the structure of LeNet-5. This will give you a wider perspective on the tensorflow implementation. LeNet-5 comprises 7 layers (not counting the input) that are identified with the letters: C=convolutional layer, S=subsampling layer, F=fully connected layer. It is important to notice that the original LeNet-5 has been trained on a full set of characters that included letters, digits and symbols. Here we are going to use only digits.

<p align="center">
<img src="./etc/lenet5_architecture.png" width="750">
</p>

**Input**: the LeNet-5 has a 32x32 input matrix. However, the MNIST dataset has images of size 28x28. In the paper this is explained as a way to center distinctive features (e.g. stroke end-points, corners) in the receptive field of the highest-level feature detectors.

**Layer C1**: is a convolutional layer with 6 feauture maps of size 28x28, and each unit in the feauture map is connected to a 5x5 neighbrhood in the input. There are 156 trainable parameters.

**Layer S2**: is a sub-sampling layer with 6 feature maps of size 14x14, and each unit of the feature map is connected to a 2x2 neighborood in C1. The result of the sub-sampling is passed to a sigmoidal function. This layer has 12 trainable parameters.

**Layer C3**: is another convolutional layer with 16 feauture maps. Each unit is connected to a 5x5 neighborood in S2. 

**Layer S4**: is a sub-sampling layer with 16 feauture maps of size 5x5. This layer has 32 trainable parameters.

**Layer C5**: is a convolutional layer with 120 feauture maps connected to a 5x5 neighborood on all 16 feauture maps of S4. Since also the size of S4 is 5x5, the size of the feauture maps of C5 is 1x1 and it leads to a total of 48120 trainable parameters. This layer should be classified as fully-connected, however in the original paper they classified it as a convolutional layer because if the input features were made bigger (with everything else kept constant) the future maps dimension would be larget than 1x1.

**Layer F6**: is a fully-connected layer that computes a dot product between the input vector coming from C5 and the internal weights. The number of trainable parameters is 10164. The activation function is an hyperbolic tangent.

**Output**: is composed of Euclidean Radial Basis Function units (RBF) one for each class, with 84 inputs each. The output of each unit is computed as follows:

$$y_{i} = \sum_j{(x_{j} - w_{ij})^{2}}$$

The equation represents the Euclidean distance between the input vector (the output of F6) and the weights. The further away is the input vector from the weight vector, the larger is the RBF output. In the original paper the components of the parameters vectors were manually chosen and set to -1 or +1. An alternative is to chose the those parameters at random with equal probabilities for -1 and +1. The representation of the output is distributed, meaning that characters that are similar (e.g. zero and uppercase O, one and uppercase I) will have similar output codes. It is important to rememeber that in the original implementation the network has been trained on a full set of characters (letters, symbols, digits), whereas here we are going to train the network only on digits. 

**Output (alternative version)**: in order to avoid the hand-tuning of the RBFs in the output we are going to use a common fully connected layer. Moreover since we only have 10 possible classes, we use a place-code representation (aka grand-mother cell code) where each digit is coded as a one-hot vector. The raw output of the network is passed through a softmax function and the resulting vector is compared with the one-hot target through a cross entropy error function.


Implementation in Tensorflow
--------------------------------

Here I will use the recently introduced **Estimators**, a high-level tensorflow API that greatly simplifies machine learning programming. Estimators encapsulate the following actions: training, evaluation, prediction, export for serving. Using estimators you can develop a state of the art model with high-level intuitive code. In short, it is generally much easier to create models with Estimators than with the low-level TensorFlow APIs.

In [1]:
import tensorflow as tf

Couldn't import dot_parser, loading of dot files will not be possible.


The training phase can be embedded into a single function defined as `model_fn`. The idea is to write our own function containing the model and pass it at training time. In the same function we also have to add specific operations to do in the three modalities: predict, train, eval.

- When `model_fn` is called with `mode == ModeKeys.PREDICT`, the model function must return a `tf.estimator.EstimatorSpec` containing the following information: the mode, and the prediction. The model must have been trained prior to making a prediction. The trained model is stored on disk in the directory established when you instantiated the Estimator. 

- When `model_fn` is called with `mode == ModeKeys.TRAIN`, the model function must train the model.

- When `model_fn` is called with `mode == ModeKeys.EVAL`, the model function must evaluate the model, returning loss and possibly one or more metrics.


In [2]:
def my_model_fn(features, labels, mode):
    #Defining the CNN model
    input_layer = tf.reshape(features, [-1, 32, 32, 1])
    c1 = tf.layers.conv2d(inputs=input_layer, filters=6, kernel_size=[5, 5], 
                          padding="valid", activation=tf.nn.relu)
    s2 = tf.layers.max_pooling2d(inputs=c1, pool_size=[2, 2], strides=2, padding="valid")
    c3 = tf.layers.conv2d(inputs=s2, filters=16, kernel_size=[5, 5], 
                          padding="valid", activation=tf.nn.relu) 
    s4 = tf.layers.max_pooling2d(inputs=c3, pool_size=[2, 2], strides=2)
    s4_flat = tf.reshape(s4, [-1, 5 * 5 * 16])   
    c5 = tf.layers.dense(inputs=s4_flat, units=120, activation=tf.nn.relu)   
    f6 = tf.layers.dense(inputs=c5, units=84)
    logits = tf.layers.dense(inputs=f6, units=10)
    #loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
    accuracy = tf.metrics.accuracy(labels=tf.argmax(labels, axis=1), predictions=tf.argmax(logits, axis=1))
    #PREDICT mode
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {"classes": tf.argmax(input=logits, axis=1),
                       "probabilities": tf.nn.softmax(logits, name="softmax_tensor")}
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
    #TRAIN mode
    elif mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
        train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
        #Statistics/plot/summaries
        #accuracy = tf.equal(tf.argmax(labels, 1),tf.argmax(logits, 1))
        #accuracy = tf.reduce_mean(tf.cast(accuracy, tf.float32))      
        tf.summary.scalar('accuracy', accuracy[1]) #<-- accuracy[1] to grab the value
        tf.summary.image("input_features", tf.reshape(features, [-1, 32, 32, 1]), max_outputs=3)
        tf.summary.image("c1_k1_feature_maps", tf.reshape(c1[:, :, :, 0], [-1, 28, 28, 1]), max_outputs=3) #c1 1st kenerl
        tf.summary.image("c3_k2_feature_maps", tf.reshape(c3[:, :, :, 0], [-1, 10, 10, 1]), max_outputs=3) #c3 1st kernel
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
    #EVAL mode
    elif mode == tf.estimator.ModeKeys.EVAL:
        eval_metric = {"accuracy": accuracy}
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric)

Now we can create the etimator object and link our model function to it. We also have to specify a folder where the log will be stored.

In [3]:
lenet5 = tf.estimator.Estimator(model_fn=my_model_fn, model_dir="/tmp/tf_model")

Preparing the MNIST dataset
--------------------------------

The MNIST dataset can be downloaded and prepared following [this notebook](./../mnist/mnist.ipynb). Here we need the train and test files (in TFRecord format) that have been produced at the end of the notebook. We are going to use the tensorflow `Dataset` class to manage the features and labels.

In [4]:
def my_input_fn():
    def _parse_function(example_proto):
        features = {"image": tf.FixedLenFeature((), tf.string, default_value=""),
                    "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
        parsed_features = tf.parse_single_example(example_proto, features)
        image_decoded = tf.decode_raw(parsed_features["image"], tf.uint8) #char -> uint8
        image_reshaped = tf.reshape(image_decoded, [28, 28])
        padding = tf.constant([[2, 2,], [2, 2]])
        image_padded = tf.pad(image_reshaped, padding, "CONSTANT") #padding to 32x32
        image = tf.cast(image_padded, tf.float32)
        label_one_hot = tf.one_hot(parsed_features["label"], depth=10, dtype=tf.int32)
        #label = tf.reshape(label_one_hot, [1, 10])
        return image, label_one_hot

    tf_train_dataset = tf.data.TFRecordDataset("./mnist_train.tfrecord")
    #tf_test_dataset = tf.data.TFRecordDataset("./mnist_test.tfrecord")
    tf_train_dataset = tf_train_dataset.map(_parse_function)
    #tf_test_dataset = tf_test_dataset.map(_parse_function)

    tf_train_dataset.cache() # caches entire dataset
    tf_train_dataset = tf_train_dataset.shuffle(60000) # shuffle all the elements
    tf_train_dataset = tf_train_dataset.repeat(11) # repeats dataset this # times
    tf_train_dataset = tf_train_dataset.batch(32) # batch size
    
    iterator = tf_train_dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

Training the model
---------------------

Now we have the datasets objects and we are ready to train the estimator. Before starting the training we define a logging hook.

In [5]:
tensors_to_log = {"loss": "loss"} #, "probabilities": "softmax"}
logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=10)

In [6]:
lenet5.train(input_fn=my_input_fn, steps=2000, hooks=[logging_hook])

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



<tensorflow.python.estimator.estimator.Estimator at 0x7fe5f46c1d10>

Remember that you can follow the training on **tensorboard** running this command from terminal `tensorboard --logdir=/tmp/tf_model`. There are some useful information we may want to see.

- *global_step/sec*: a performance indicator showing how many batches (gradient updates) we processed per second as the model trains.

- *loss*: the reported loss

- *accuracy (evaluation)*: use `eval_metric_ops={'my_accuracy': accuracy})`

- *accuracy (training)*: use `tf.summary.scalar('accuracy', accuracy[1])`

- *images*: we can check input images using `tf.summary.image()`


Evaluating the model on test set
-------------------------------------

In [None]:
def my_eval_input_fn():
    def _parse_function(example_proto):
        features = {"image": tf.FixedLenFeature((), tf.string, default_value=""),
                    "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
        parsed_features = tf.parse_single_example(example_proto, features)
        image_decoded = tf.decode_raw(parsed_features["image"], tf.uint8) #char -> uint8
        image_reshaped = tf.reshape(image_decoded, [28, 28])
        padding = tf.constant([[2, 2,], [2, 2]])
        image_padded = tf.pad(image_reshaped, padding, "CONSTANT") #padding to 32x32
        image = tf.cast(image_padded, tf.float32)
        label_one_hot = tf.one_hot(parsed_features["label"], depth=10, dtype=tf.int32)
        #label = tf.reshape(label_one_hot, [1, 10])
        return image, label_one_hot

    tf_test_dataset = tf.data.TFRecordDataset("./mnist_test.tfrecord")
    tf_test_dataset = tf_test_dataset.map(_parse_function)

    tf_test_dataset.cache() # caches entire dataset
    tf_test_dataset = tf_test_dataset.repeat(1) # repeats dataset this # times
    
    iterator = tf_test_dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

In [8]:
lenet5.evaluate(input_fn=my_input_fn, steps=2)

{'accuracy': 0.890625, 'global_step': 2000, 'loss': 0.29600996}