Based on [Deep MNIST for Experts Tutorial](https://www.tensorflow.org/versions/r0.10/tutorials/mnist/pros/index.html)

This is the result of me going through the Deep MNIST for Experts Tutorial and adding some more explanations to areas I found confusing. I've also turned a lot of magic numbers into variables - both to explain and allow tweaking. I then decided to see what could be done with a smaller network - the tutorial says it may take half an hour to run, but clearly that's on beefier hardware than my mid-2014 13" Macbook Pro.

The original parameters in the tutorial are:
- First Convolutional Layer: 32
- Second Convolutional Layer: 64
- Fully Connected Layer: 1024

And is set to run for 20,000 iterations to acheive 99.2% accuracy.

I've reduced this to:
- First Convolutional Layer: 8
- Second Convolutional Layer: 16
- Fully Connected Layer: 64

For a maximum of 10001 iterations, with some extra stopping conditions when validation accuracy starts to decline, which results in about 98% accuracy. This takes about 35s per 100 training iterations on my laptop, as compared to about 160-170s with the original parameters, and will generally stop around 7000 iterations.

Basically, I was able to make the network quite a lot smaller and faster to train, while only making it a little bit dumber. This shows how that last little bit more performance can be quite expensive!

In [2]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


Start Interactive Session:

In [3]:
import tensorflow as tf
sess = tf.InteractiveSession()

What does a batch look like?


In [4]:
mnist.train.next_batch(1)

(array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.

In [5]:
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

**Variables**

A variable lives in TensorFlow's computation graph, and is used or modified by the computation. Model parameters are typically defined as variables. Variables must be initialized before they are used within a session; can be done for all variables at once with `initialize_all_variables`


For this CNN, we want weights with a bit of noise, and we want the bias to be slightly positive to avoid dead ReLU neurons.

Truncated normal: random values from a normal distribution, but ensures that all values are within 2 stddev of the mean.

In [6]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

Convolutions with a size of one, zero padded so the output is the same size.

Max pooling over 2x2 blocks.

In [7]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1,1,1,1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1,2,2,1], padding='SAME')

First convolutional layer:

Compute a bunch of features for each 5x5 patch in the image.
2x2 max pooling reduces the images to 14x14

In [8]:
conv1_num_features = 8
patch_size = 5

W_conv1 = weight_variable([patch_size, patch_size, 1, conv1_num_features])
b_conv1 = bias_variable([conv1_num_features])

In [9]:
x_image = tf.reshape(x, [-1, 28, 28, 1])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

Second convolutional layer:

More features for each 5x5 patch of the image, which has now been compressed to 14x14 by pooling.

In [10]:
conv2_num_features = 16
W_conv2 = weight_variable([5, 5, conv1_num_features, conv2_num_features])
b_conv2 = bias_variable([conv2_num_features])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

We've now reduced the image to 7x7 by applying 2x2 pooling again

Add a fully connected layer neurons to bring the pieces together and process the entire image.

In [11]:
fc_layer_size = 64
W_fc1 = weight_variable([7*7*conv2_num_features, fc_layer_size])
b_fc1 = bias_variable([fc_layer_size])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*conv2_num_features])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

Dropout to reduce overfitting

In [12]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

Readout layer - Softmax
Size is 10 because there are 10 digits

In [15]:
readout_size = 10
W_fc2 = weight_variable([fc_layer_size, readout_size])
b_fc2 = bias_variable([readout_size])

y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

Train & Evaluate the CNN model

- We will stop training:
    - After 10000 iterations
    - When the validation accuracy doesn't improve much for 3 validation cycles
    - The validation accuracy drops considerably below the best and doesn't recover for 3 validation cycles (may be combined with previous condition)


In [16]:
import time

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.initialize_all_variables())

max_iterations = 10001 # ensure we get the last validation!
prev_validation_accuracy = 0
max_validation_accuracy = 0

# stop if the validation accuracy doesn't improve by 0.001 for 3 iterations
validation_difference = 0.001
max_validations_with_no_improvement = 3
validations_with_no_improvement = 0

saver = tf.train.Saver()
best_model_path = ""
start_time = time.time()
for i in range(max_iterations):
    batch = mnist.train.next_batch(500)

    if i%100 == 0 and i != 0:
        train_accuracy = accuracy.eval(feed_dict={
                x:batch[0], y_:batch[1], keep_prob:1.0
            })
        print("step %d, trainining accuracy %g %g seconds elapsed"%(i, train_accuracy, time.time()-start_time))
    if i%500 == 0 and i != 0:
        val = mnist.validation.next_batch(1000)
        validation_accuracy = accuracy.eval(feed_dict={
                x:val[0], y_:val[1], keep_prob:1.0
            })
        print("**step %d, validation accuracy %g %g seconds**"%(i, validation_accuracy, time.time()-start_time))
        if validation_accuracy > max_validation_accuracy:
            best_model_path = saver.save(sess, 'mnist-tutorial', global_step=i)
            max_validation_accuracy = validation_accuracy
        if (validation_accuracy < (prev_validation_accuracy + validation_difference)) \
            or validation_accuracy < max_validation_accuracy - 0.1: # Sometimes weights hit 0 and accuracy plummets
            validations_with_no_improvement += 1
            if validations_with_no_improvement >= max_validations_with_no_improvement:
                break
        else:
            validations_with_no_improvement = 0
        
        prev_validation_accuracy = validation_accuracy
           
    train_step.run(feed_dict={x:batch[0], y_:batch[1], keep_prob:0.5})

print("test accuracy on last model %g"%accuracy.eval(feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob:1.0}))
# load best model to test
saver.restore(sess, best_model_path)
print("test accuracy on best model %g %s"% (accuracy.eval(feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob:1.0}), best_model_path))


step 100, trainining accuracy 0.562 34.3138 seconds elapsed
step 200, trainining accuracy 0.716 67.8487 seconds elapsed
step 300, trainining accuracy 0.814 101.361 seconds elapsed
step 400, trainining accuracy 0.864 134.265 seconds elapsed
step 500, trainining accuracy 0.882 166.92 seconds elapsed
**step 500, validation accuracy 0.881 167.303 seconds**
step 600, trainining accuracy 0.886 201.842 seconds elapsed
step 700, trainining accuracy 0.928 234.789 seconds elapsed
step 800, trainining accuracy 0.918 267.706 seconds elapsed
step 900, trainining accuracy 0.91 300.732 seconds elapsed
step 1000, trainining accuracy 0.914 333.556 seconds elapsed
**step 1000, validation accuracy 0.929 333.906 seconds**
step 1100, trainining accuracy 0.912 367.366 seconds elapsed
step 1200, trainining accuracy 0.932 400.112 seconds elapsed
step 1300, trainining accuracy 0.932 432.929 seconds elapsed
step 1400, trainining accuracy 0.928 465.69 seconds elapsed
step 1500, trainining accuracy 0.966 498.451 