## Exercise 7

Build your own CNN and try to achieve the highest possible accuracy on MNIST.

## Approach

As per [Caffe2's MNIST tutorial](https://caffe2.ai/docs/tutorial-MNIST.html), we are going to build a small CNN based on the LeNet architecture:

Layer  | Type            | Maps | Size    | Kernel size | Stride | Activation
-------|-----------------|------|---------|-------------|--------|-----------
logits | Fully Connected | -    | 10      | -           | -      | softmax
F5     | Fully Connected | -    | 500     | -           | -      | relu
S4     | Max Pooling     | 100  | 4 x 4   | 2 x 2       | 2      | -
C3     | Convolution     | 100  | 8 x 8   | 5 x 5       | 1      | tanh
S2     | Max Pooling     | 20   | 12 x 12 | 2 x 2       | 2      | -
C1     | Convolution     | 20   | 28 x 28 | 5 x 5       | 1      | tanh
X      | Input           | 1    | 28 x 28 | -           | -      | -

In [1]:
import tensorflow as tf

First define the data we will be feeding the network at each training step:

- `X` is a batch of 28 x 28 grayscale (= 1 channel) images. The MNIST dataset contains them as a sequence of 768 pixel values, so we'll first reshape them to a 2D image.
- `y` is a batch of class digits in the range 0-9.

In [2]:
X = tf.placeholder(tf.float32, shape=[None, 28 * 28 * 1])
y = tf.placeholder(tf.int32,   shape=[None])

Then we build the model according to the table above. This should be fairly straightforward, except perhaps for the last layer.

The last layer of the network contains ten output neurons. Like every neuron in the network, they multiply their inputs by their weights and add a bias term:

\begin{equation}
y = \sum{Wx} + b
\end{equation}

But note that they have no activation function. This means that they pass on the result of this calculation to their outputs as-is, and do not first perform a _tanh_ or _relu_ conversion. These output values are therefore not limited to a fixed range, such as [0, 1], but can take on any real number, both positive and negative.

Now, we want to teach each output neuron to output a higher value when it thinks the input image contains "its" digit, and a lower value when it does not.

We therefore treat these output values as [logits](https://stats.stackexchange.com/questions/52825/what-does-the-logit-value-actually-mean#52836), or _logarithmic odds_. Given a probability $p$, its odds is defined as:

\begin{equation}
\frac{p}{1 - p}
\end{equation}

And a logarithmic odd is simply the natural logarithm of this:

\begin{equation}
\ln \frac{p}{1 - p}
\end{equation}

The relationship can be shown graphically as well:

![Curve](https://i.stack.imgur.com/h6N7o.png)

The thing to take away is we want to teach an output neuron to output a higher _logit_ when it thinks the input image contains "its" digit, since a higher logit corresponds to a higher _probability_.

In [3]:
with tf.name_scope('cnn'):

    # Image: [768] --> [28 x 28 x 1].
    X_reshaped = tf.reshape(X, shape=[-1, 28, 28, 1])
    
    # Image: [28 x 28 x 1] --> [24 x 24 x 20].
    c1 = tf.layers.conv2d(
        X_reshaped,
        filters=20,
        kernel_size=[5, 5],
        strides=[1, 1],
        padding='valid',
        activation=tf.nn.tanh)

    # Image: [24 x 24 x 20] --> [12 x 12 x 20].
    s2 = tf.layers.max_pooling2d(
        c1,
        pool_size=[2, 2],
        strides=[2, 2],
        padding='valid')

    # Image: [12 x 12 x 20] --> [8 x 8 x 100].
    c3 = tf.layers.conv2d(
        s2,
        filters=100,
        kernel_size=[5, 5],
        strides=[1, 1],
        padding='valid',
        activation=tf.nn.relu)

    # Image: [8 x 8 x 100] --> [4 x 4 x 100].
    s4 = tf.layers.max_pooling2d(
        c3,
        pool_size=[2, 2],
        strides=[2, 2],
        padding='valid')

    # Image: [4 x 4 x 100] --> [1600].
    s4_reshaped = tf.reshape(s4, [-1, 4 * 4 * 100])

    # Image: [1600] --> [500].
    f5 = tf.layers.dense(
        s4_reshaped,
        units=500,
        activation=tf.nn.relu)

    # Image: [500] --> [10]
    logits = tf.layers.dense(
        f5,
        units=10,
        activation=None)

In order to learn something, we must define a loss or cost function that we will aim to minimize while learning. For each image we feed the network, the loss tells us in essence how large the difference was between the answer given by the network and the desired answer, that is, the ground truth. Ideally we can minimize the loss to a value close to zero.

Recall that when we feed the network a single image, we end up with ten $logits$ at the output end of the network. We also have one integer number in the range 0-9 that indicates the class of the digit visible in that image. How do we calculate a loss given this combination?

**TODO:** Use one-hot ending and the non-sparse cross entropy function below.


In [4]:
with tf.name_scope('loss'):

    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=y,
        logits=logits)

    loss = tf.reduce_mean(xentropy)

In [5]:
with tf.name_scope('eval'):
    
    correct  = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [6]:
learning_rate = 0.001

with tf.name_scope('train'):
    
    optimizer   = tf.train.AdamOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [9]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data')

X_test = mnist.test.images
y_test = mnist.test.labels

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [11]:
from datetime import datetime

now         = datetime.utcnow().strftime('%Y%m%d%H%M%S')
root_logdir = 'tf_logs'
logdir      = '{}/run-{}/'.format(root_logdir, now)

In [12]:
loss_summary      = tf.summary.scalar('loss', loss)
acc_train_summary = tf.summary.scalar('acc_train', accuracy)
acc_test_summary  = tf.summary.scalar('acc_test',  accuracy)

file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

In [None]:
n_epochs   = 5
batch_size = 100
n_batches  = mnist.train.num_examples // batch_size

with tf.Session() as sess:
    tf.global_variables_initializer().run()

    for epoch in range(n_epochs):
        
        # Each epoch we train all batches.
        for batch in range(n_batches):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

            # Evaluate the network now and then, so we can visualize
            # its progress in Tensorboard.
            if batch % 10 == 0:
                step = epoch * n_batches + batch
                file_writer.add_summary(
                    loss_summary.eval(feed_dict={X: X_batch, y: y_batch}),
                    step)
                file_writer.add_summary(
                    acc_train_summary.eval(feed_dict={X: X_batch, y: y_batch}),
                    step)
                file_writer.add_summary(
                    acc_test_summary.eval(feed_dict={X: X_test,  y: y_test}),
                    step)
        
        print("\r{} of {} epochs".format(epoch + 1, n_epochs), end='')