## Exercise 7

Build your own CNN and try to achieve the highest possible accuracy on MNIST.

### Introduction

Although TensorFlow r1.3 was just released, I'll stick to the r1.2 as used in the book.

What will I try?

1. Input images are 28x28 pixels grayscale, so one channel.
2. I'll add a Convolutional Layer with a certain number of 3x3 filters, which have to be trained. ReLU (Rectified Linear Units) as activation functions.
    - See [`tf.layers.conv2d`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/layers/conv2d)
3. Then a Pooling Layer to reduce the network output.
4. May repeat this a few times, but for now just try one of each.
5. End with a fully connected Neural Network into ten outputs. Softmax?

Useful links:
* [TensorFlow r1.2 Python API](https://www.tensorflow.org/versions/r1.2/api_docs/python/)
* [MNIST For ML Beginners](https://www.tensorflow.org/get_started/mnist/beginners)
* [TensorFlow Mechanics 101](https://www.tensorflow.org/get_started/mnist/mechanics)

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

TensorFlow has the MNIST data set as an example built-in.

The `one_hot=True` parameter has to do with how the targets or labels are encoded (see [this Quora question](https://www.quora.com/What-does-the-one_hot-True-parameter-on-the-MNIST-tensorflow-for-beginners-example-mean)). If a digit represents the number 3, then instead of the label being the integer 3, the label is a Tensor of ten booleans, where all are false except the one representing the class '3'.

In [2]:
mnist = input_data.read_data_sets('MNIST_data/', one_hot=False)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The `mnist` variable is a collection of three data sets:
- one set to train with
- one set to validate our trained network, to know whether we are improving or not
- one set to test our validated network, at the very end

In [3]:
mnist

Datasets(train=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x110434cc0>, validation=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x10d132b70>, test=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x110426908>)

Create a placeholder `X`, which we will feed with the images. Note that `tf.layers.conv2d` requires a 4D-Tensor as input, consisting of:
- batch size
- height
- width
- channel count

We also need a placeholder `y` for the labels:

In [4]:
height   = 28
width    = 28
channels = 1
X = tf.placeholder(np.float32, shape=(None, height, width, channels))
y = tf.placeholder(np.int32, shape=(None))

Let's create a convolutional layer. I have no idea what good values are for some hyperparameters, so let's try:

In [5]:
conv_layer = tf.layers.conv2d(inputs=X, filters=5, kernel_size=5, padding='same', activation=tf.nn.relu)
conv_layer

<tf.Tensor 'conv2d/Relu:0' shape=(?, 28, 28, 5) dtype=float32>

So we got a Tensor containing 5 feature maps, one for each filter. Each feature map is 28x28 neurons in size.

And a pooling layer to reduce the data:

In [6]:
pooling_layer = tf.layers.max_pooling2d(inputs=conv_layer, pool_size=3, strides=3, padding='same')
pooling_layer

<tf.Tensor 'max_pooling2d/MaxPool:0' shape=(?, 10, 10, 5) dtype=float32>

The pooling layer contains the same number of feature maps as before, but each one is reduced in size from 28x28 to 10x10 because of our stride being 3:

In [7]:
import math
math.ceil(28 / 3)

10

So we have 10x10x5 = 500 neurons there. We need to feed these to a fully-connected network.

In [8]:
output_layer = tf.contrib.layers.fully_connected(inputs=pooling_layer, num_outputs=10, activation_fn=None)
output_layer

<tf.Tensor 'fully_connected/BiasAdd:0' shape=(?, 10, 10, 10) dtype=float32>

I guess I still need to define a cost or loss function and optimize it. The output layer is a 4D-Tensor with shape (?, 10, 10, 10), meaning:
- a yet unknown number of instances, of
- 10 rows of
- 10 columns each per instance as input, and
- 10 output classes (representing the digits 0-9).

The cross entropy function below expects `labels` to be a 1D-Tensor with shape (?,) and `logits` to be a 2D-Tensor with shape (?, 10). That's why we slice the 4D-Tensor to a 2D-Tensor by throwing away the inner indices:

In [9]:
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=y,
    logits=output_layer[:,0,0,:])
loss = tf.reduce_mean(cross_entropy)

And train using the good-old gradient descent:

In [10]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
training_op = optimizer.minimize(loss)

Let's train!

In [None]:
n_epochs = 5
batch_size = 50
n_batches = mnist.train.num_examples // batch_size

Note that we must `reshape` the images in the batch, since `images` will be of shape `(50, 784)` and our convolutional layer requires an input of shape `(?, 28, 28, 1)`:

In [None]:
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    for epoch in range(n_epochs):
        print("\repoch {}/{}".format(epoch + 1, n_epochs), end='')
        for batch in range(n_batches):
            images, labels = mnist.train.next_batch(batch_size=50)
            reshaped = np.reshape(images, (batch_size, height, width, channels))
            sess.run(training_op, feed_dict={X: reshaped, y: labels})

epoch 5/5