## A deep convolutional neural network to classify MNIST dataset using Tensorflow
- To fully grasp the implementation of CNNs in tensorflow, we'll learn by classifying the simple MNIST dataset with 10 output classes. The basic idea of CNNs are the same even when we scale this up to the CIFAR-10 dataset.
- This is an expert level technical guide on tensorflow based on "Deep MNIST for Experts" by Google 
https://www.tensorflow.org/get_started/mnist/pros , Keras is not used so we can fully understand the nuts and bolts of tensorflow and be able to customize and tweak it completely.
- The theory of convolutional neural networks are not covered here as it is beyond the scope of this guide, please do some googling to understand kernels / filter sizes, strides, paddings and max pooling in CNNs before proceeding.

by Kelvin Kong
- **Github**: https://github.com/kelvinAI
- **Linkedin**: https://linkedin.com/in/kelvinkong

In [1]:
#Obtain the MNIST dataset
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Extracting MNIST_data\train-images-idx3-ubyte.gz
Extracting MNIST_data\train-labels-idx1-ubyte.gz
Extracting MNIST_data\t10k-images-idx3-ubyte.gz
Extracting MNIST_data\t10k-labels-idx1-ubyte.gz


In [2]:
#Enable interactive output for easier debugging
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
from IPython.core.debugger import set_trace
#import necessary pacakages
import tensorflow as tf
import numpy as np

#### Peek into the mnist dataset data structure


In [3]:
mnist

Datasets(train=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x0000000008D78470>, validation=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x0000000008D78E10>, test=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x0000000008D78BE0>)

#### Datasets are stored in mnist.train, mnist.validation, mnist.test
The datasets can be obtained in batches through the next_batch(batch_size) command

First create placeholders for input nodes x and target output nodes y.

The next_batch command returns a tensor with (batch_size, 28, 28 , 1) dimensions, where 28, 28 corresponds to the height and with of the image in pixels, and 1 for color channel (only greyscale).
The flattened input will then be 784 ( 28 * 28 ) with 10 target output classes which corresponds to 0 - 9

The target output classes y is a 2D tensor [ None , prediction] where prediction is a one-hot 10 dimensional vector that indicates which class that the image belongs to

In [85]:
#Get an idea of the expected matrix shapes from the raw input to design our tensorflow placeholders
batch_size = 20
for i in range(100):
    batch = mnist.train.next_batch(batch_size)
    if i % 20 == 0:
        print("batch[0]:{}, batch[1]:{}".format(batch[0].shape, batch[1].shape))


batch[0]:(20, 784), batch[1]:(20, 10)
batch[0]:(20, 784), batch[1]:(20, 10)
batch[0]:(20, 784), batch[1]:(20, 10)
batch[0]:(20, 784), batch[1]:(20, 10)
batch[0]:(20, 784), batch[1]:(20, 10)


### For each batch
batch[0] contains the training images in the dimensions [ batch_size, 784 ] where 784 is the 28 * 28 pixel image and batch_size corresponds to how many training examples that is computed in one single matrix multiplication step. Setting batch_size will directly impact the amount of RAM that is used, setting it too high may cause your machine to freeze. Typical values are around 128, 256 depending on your available RAM.

batch[1] contains the one-hot encoded target output class

In [86]:
#First create placeholders to receive input values

#Create some test input to validate the shape
test_x_input = np.random.random((batch_size, 784))
test_y_input = np.random.random((batch_size, 10))
print("Test x input shape: {}".format(test_x_input.shape))
print("Test y input shape: {}".format(test_y_input.shape))


#Create a tensorflow placeholder of shape (batch_size, features) called x that will accept a matrix that have the shape of text_x_input, but use None
#as the batch_size so we can change that later on the fly
x = tf.placeholder(tf.float32, [None, test_x_input.shape[1]], name='x')
print("Validate shape of x: {}".format(x.get_shape().as_list()))

#Same for y, create a placeholder to hold target output classes
y = tf.placeholder(tf.float32, [None, test_y_input.shape[1]], name='y')
print("Validate shape of y: {}".format(y.get_shape().as_list()))


Test x input shape: (20, 784)
Test y input shape: (20, 10)
Validate shape of x: [None, 784]
Validate shape of y: [None, 10]


### Convolutional Neural Networks in Tensorflow
Convolutional Neural Networks can be computed easily in tensorflow using tf.convo2d.
Especially for first timers, we'll first understand the arguments needed by tf.convo2d to quickly clear the confusion of how to use this class properly.

tf.convo2d( x_image_input_in_4d, weights, strides , padding)

1. **x_image_input_in_4d** requires that your image input tensor to be in 4 dimensions, specifically ( batch_size, image height, image width, color depth)
- for the MNIST dataset, the images are in greyscale, thus color depth == 1
- MNIST dataset comes with an already flattened image that is 784 dimensions, thus we have to extract the height and image from these columns by reshaping it to 28*28. This can be done by using the tf.reshape command

2. We will create a new placeholder **x_4d** instead to prevent confusion later on, by reshaping x. 
    
    Note: -1 in the reshape command is a special key that means "Automatically compute this column". Only one -1 can be used at a time on any instance of reshaping.
    Thus when we forcibly reshape the 1st, 2th and 3rd dimensions to 28, 28 and 1, we can put -1 as the 0th dimension and it will automatically be deduced from the matrix, which is in this case the batch_size




In [87]:
# x input needs to be reshaped to 4 dimensions for tf.convo2d.
# [ batch_size, height, width, depth(Channels, 1 for grayscale)]
x_4d = tf.reshape(x, [-1, 28, 28,1], name='x_4d')
print("x_4d shape: {}".format(x_4d.get_shape().as_list()))


x_4d shape: [None, 28, 28, 1]


### Create the Convolutional Neural Network + Max Pooling Graph

This is where the fun begins. The actual architecture of convolutional neural networks combined with max pooling, where the neural network will learn from the input image

To facilitate understanding, (almost) the entire graph is created in the next cell. This is to help first timers in understanding the overall picture of the graph and not scrolling around blindly up and down across the notebook to search for global variables that are defined randomly in different cells. For larger networks, it is recommended to segregate some of the steps to reusable methods to improve readability and adhering to DRY.

I find it easier to visualize the flow of the entire tensorflow graph this way before adding on sugar coated methods while refactoring the program. One thing to note, most of the tensorflow variables and placeholders here are defined in the global scope, but in the end you will only be using the last few of them in your tensorflow session as the graph have already been generated internally.

This tutorial unfortunately does not explain the theory of convolutional neural networks in depth as it is beyond the scope, please do some googling to understand kernels / filter sizes, strides, paddings and max pooling in CNNs if necessary.

In [88]:
#Create first convolutional layer with depth of 32
# ( x, kheight, kwidth, xdepth, output_depth)

#weight is a 4 dimensional filter, (height, width, input_depth, output depth)
output_depth_1 = 32

#First, understand that weights for convolutional neural networks in tensorflow are of type tf.Variable and must
#have 4 dimensions, (kernel height, kernel width, input_depth(color channels of input), output_depth)
weight_1 = tf.Variable(tf.truncated_normal((5,5,1, output_depth_1),stddev=0.1) )
print("Dimensions of the first Weight/Filter: {}".format(weight_1.get_shape()))
bias_1 = tf.Variable(tf.zeros(output_depth_1))
print("Bias dimensions of the first output layer will equal the number of output depth: {}".format(bias_1.get_shape()))

#Create the first convolutional layer, using x_4d as input
convo_layer1 = tf.nn.conv2d(x_4d, weight_1, strides=[1,1,1,1], padding='SAME' )
convo_layer1 = tf.nn.bias_add(convo_layer1, bias_1)
convo_layer1 = tf.nn.relu(convo_layer1) #Use relu activation function before applying max pooling

#Apply max pooling on the convolutional layer 1 . note that the standard values for kernel size and strides for max pooling 
#are ksize=[1,2,2,1] and strides=[1,2,2,1]
maxpool_layer1 = tf.nn.max_pool(convo_layer1, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')




#Obtain the depth of the maxpool_layer1 in the 3rd dimension
maxpool_layer1_depth = maxpool_layer1.get_shape().as_list()[3]

#Create the 2nd convolutional layer with depth of 64 for each 5x5 patch (determind by kernel size)
convo_depth_2_output = 64

weight_2 = tf.Variable(tf.truncated_normal((5,5,maxpool_layer1_depth, convo_depth_2_output)))
print("Weight shapes for 2nd convolutional layer : {}".format(weight_2.get_shape().as_list()))
bias_2 = tf.Variable(tf.zeros(convo_depth_2_output))


#Create the 2nd convolutional layer by applying convo2d on top of the previous maxpool_layer
convo_layer2 = tf.nn.conv2d(maxpool_layer1, weight_2, strides=[1,1,1,1], padding='SAME') + bias_2
convo_layer2 = tf.nn.relu(convo_layer2) 

#Apply the same maxpool layer
maxpool_layer2 = tf.nn.max_pool(convo_layer2, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')


#Create the Dense / Fully Connected layer
#First flatten the maxpool_layer2 with dimensions (None, 7, 7, 64)
mx_layer2_dims = maxpool_layer2.get_shape().as_list()
print("Preview the maxpool_layer2 output dimensions: {}".format(mx_layer2_dims))

# Now, to apply a fully connected later on this 4 dimension matrix, we'll first have to flatten back the matrix by
# reshaping it to a 2d matrix which have the shape of 7 * 7 * 64 (based on the previous maxpool output layer)
# We'll design it to automatically obtain the calculations by obtaining the shapes from the previous maxpool_layer2 and
# multiply the 1st, 2nd and 3rd dimensions together. This is useful because if we change the kernel size, we do not 
# have to manually change this value

fc_layer = tf.reshape(maxpool_layer2, [-1, mx_layer2_dims[1] * mx_layer2_dims[2] * mx_layer2_dims[3]])
print("Reshaped FC layer Shape: {}".format(fc_layer.get_shape()))

fc_layer_dims = fc_layer.get_shape().as_list()

# We'll create a first dense/hidden layer with 1024 output neurons, and a second dense layer with 
# 10 output neurons for final classification
fc_layer_1_output = 1024

weights_fc1 = tf.Variable(tf.truncated_normal([fc_layer_dims[1], 1024]))
print("Weights of Fully connected layer 1: {}".format(weights_fc1.get_shape().as_list()))
bias_fc1 = tf.Variable(tf.zeros([fc_layer_1_output]))
print("Fully connected layer 1 - Bias_fc1 shape : ", bias_fc1.get_shape().as_list())

#Perform linear regression on the dense layer, add a bias, and apply RELU activation
fc_layer1_z = tf.nn.bias_add(tf.matmul(fc_layer, weights_fc1),bias_fc1)
fc_layer_1_relu = tf.nn.relu(fc_layer1_z)
print("fc_layer1_activation shape", fc_layer_1_relu.get_shape().as_list())

#add Dropout on the first dense layer to prevent overfitting
keep_prob = tf.placeholder(tf.float32)
fc_layer_1_drop = tf.nn.dropout(fc_layer_1_relu, keep_prob)

#Final output layer
dims_fc_l1_drop = fc_layer_1_drop.get_shape().as_list()
print("Fully connected layer 1 output shape:",dims_fc_l1_drop)

# Create output with 10 classes
# Simply a linear regression on the 1st fully connected layer, and add a bias. Note: no softmax activation functions here
weights_fc2 = tf.Variable(tf.truncated_normal((dims_fc_l1_drop[1], 10)))
print("Weights shape for FC layer 2 : {}".format(weights_fc2.get_shape().as_list()))
bias_fc2 = tf.Variable(tf.zeros([10]))

fc_layer2 = tf.nn.bias_add(tf.matmul(fc_layer_1_drop, weights_fc2),bias_fc2)
print("Final FC layer2 shape, must be the same as the one hot encoded y input: {}".format(fc_layer2.get_shape().as_list()))

print("\nConvolutional Neural Network Graph creation completed!")

Dimensions of the first Weight/Filter: (5, 5, 1, 32)
Bias dimensions of the first output layer will equal the number of output depth: (32,)
Weight shapes for 2nd convolutional layer : [5, 5, 32, 64]
Preview the maxpool_layer2 output dimensions: [None, 7, 7, 64]
Reshaped FC layer Shape: (?, 3136)
Weights of Fully connected layer 1: [3136, 1024]
Fully connected layer 1 - Bias_fc1 shape :  [1024]
fc_layer1_activation shape [None, 1024]
Fully connected layer 1 output shape: [None, 1024]
Weights shape for FC layer 2 : [1024, 10]
Final FC layer2 shape, must be the same as the one hot encoded y input: [None, 10]

Convolutional Neural Network Graph creation completed!


### Now we'll need to create the cost function and optimizer for the graph

This can be done by using tf.nn.softmax_cross_entropy_with_logits where the labels argument is the target labels and logits will refer to your neural network's final output layer, which in this case is fc_layer2. The cross entropy is obtained by obtaining the mean value across this vector

Instead of using gradient descent, we'll use the more sophisticated AdamOptimizer with a learning rate of 0.0004 and ask it to minimize the cross_entropy. Understand that cross_entropy in this case is not a single number, but rather a huge graph that was created from the above steps which includes fc_layer2, etc. By supplying cross_entropy into AdamOptimizer, tensorflow will automatically find a way to reduce the loss. (We'll have to treat it as a black box a this time when it comes to tensorflow as defining and writing your own gradient descent will not be feasible anymore when it comes to CNNs, but be sure to understand them)


In [103]:
#Create cross_entropy to be used in Adam optimizer
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=fc_layer2))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)


### Understanding how to calculate accuracy and predictions (Optional, for understanding purposes)
Using this small tensorflow session, we'll understand how the accuracy tensor is created


In [172]:
# Warning, Calling this (particularly global_variables_initializer() )  will clear the weights if it had already been trained!
# Example to obtain 5 predictions from the training dataset
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    predictions = sess.run(fc_layer2, feed_dict={x: mnist_test_x[:5], keep_prob: 1.0})
    print("fc_layer2 output shape:{}".format(predictions.shape)) #5 rows, 10 columns
    
    #Peek into 5 rows of predictions
    print("Peek into the 5 rows of final fully connected layer 2 output predictions:\n{}".format(predictions))
    
    # Calling tf.argmax with axis = 1 on a predictions, will return the index/column that contains
    # the largest value
    
    pred_argmax = sess.run(tf.argmax(predictions,1))
    print("tf argmax on predictions returns 5 values that are indexes of the largest value in each prediction row:\n{}".format(pred_argmax))
    print("\nprediction[0] example:{}\ncolumn with highest prediction = pred_argmax[0] :{} ".format(predictions[0], pred_argmax[0]))
    
    
    print("\nPeek into 5 rows of target mnist_test_labels:\n{}".format(mnist_test_labels[:5]))
    actual_argmax = sess.run(tf.argmax(mnist_test_labels[:5], 1))
    print("Index of columns with the correct target: {}".format(actual_argmax))
    
    print("Predictions vs target:\n{}\n{}".format(pred_argmax,actual_argmax))
    
    # We'll run tf.equals that will return a vector of TRUE FALSE values, if the highest prediction index in (pred_argmax)
    # equals the index in target vector, then the prediction is correct. otherwise, if the network predicted a different 
    # class which resulted in a different index in pred_argmax, then the value will be false in that vector
    correct_predictions = sess.run(tf.equal(pred_argmax,actual_argmax))
    
    print("correct_predictions vector: {}".format(correct_predictions))

    # We can obtain the accuracy of the predictions by dividing the total count of TRUE values over the entire prediction set
    # First, cast the TRUE/False values to 1 or 0 using tf.cast to turn this TRUE/FALSE observation into a mathematical problem
    prediction_numbers = sess.run(tf.cast(correct_predictions, tf.float32))
    print("correct_predictions_in_numbers: {}".format(prediction_numbers))

    # Now calculate the accuracy by simply obtaining the mean of this vector, where 1 is a correct prediction and 0 is a wrong
    # predcition
    accuracy = sess.run(tf.reduce_mean(prediction_numbers))
    print("Accuracy: {}".format(accuracy))
    print("\nNote:The accuracy at this stage should be terrible and totally random, since we have not trained the model yet.")


fc_layer2 output shape:(5, 10)
Peek into the 5 rows of final fully connected layer 2 output predictions:
[[-1113.46716309  2792.99023438  2528.83447266  2069.26904297
    103.74853516  1540.29199219  1262.55993652  2380.5222168  -1811.08862305
    471.68908691]
 [  228.03097534  4267.12939453 -1583.92626953  -253.44909668
   -228.01464844   798.46337891  6654.45458984  3446.76025391
  -6469.01269531  -280.30090332]
 [ 1575.96398926  1041.82128906 -1055.50048828  1220.06860352
    289.98596191   500.32275391  1477.18493652  3492.81640625
  -1289.64929199  -410.78933716]
 [   78.37426758  3994.08032227  1848.76245117  2234.07055664
   1012.12451172  1830.78198242  4707.67041016  3467.58081055
  -3980.89819336  1479.26416016]
 [-1659.06079102  3980.03955078  1222.11865234   952.20593262
  -2790.08544922  2176.875       1928.78955078  3106.7800293  -4728.12011719
   1636.65942383]]
tf argmax on predictions returns 5 values that are indexes of the largest value in each prediction row:
[1 6 

### Creating correct predictions and accuracy tensors
Once we understand the above mechanics on predictions and accuracy, the two tensors is created in simply 2 lines of code

In [171]:
correct_predictions = tf.equal(tf.argmax(fc_layer2,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))


In [111]:
#Create a subset of test images
mnist_test_x = mnist.test.images[:1000]
mnist_test_x.shape
mnist_test_labels = mnist.test.labels[:1000]
mnist_test_labels.shape

print("We'll test our model on 1000 examples from the MNIST test database")


(1000, 784)

(1000, 10)

We'll test our model on 1000 examples from the MNIST test database


In [113]:
batch_size = 100

iterations = 10000
reporting_count = 10

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(iterations):
        batch = mnist.train.next_batch(batch_size)
        output = sess.run(train_step, feed_dict={x:batch[0], keep_prob:0.5, y:batch[1]})
        
        if i % int(iterations/reporting_count) == 0:
            print("Training accuracy:{} {}% completed".format(
                sess.run(accuracy,feed_dict={x:mnist_test_x, y:mnist_test_labels, keep_prob:1.0 }), i/iterations * 100 ))
    print("Train Complete!")

Training accuracy:0.06599999964237213 0.0% completed
Training accuracy:0.8830000162124634 10.0% completed
Training accuracy:0.9229999780654907 20.0% completed
Training accuracy:0.9359999895095825 30.0% completed
Training accuracy:0.949999988079071 40.0% completed
Training accuracy:0.9480000138282776 50.0% completed
Training accuracy:0.9520000219345093 60.0% completed
Training accuracy:0.9520000219345093 70.0% completed
Training accuracy:0.9490000009536743 80.0% completed
Training accuracy:0.9430000185966492 90.0% completed
Train Complete!


### Final Accuracy after training using a larger test set

In [128]:
print("Total number of test images: {}".format(len(mnist.test.images)))
# Create a test set of 5000 images from the whole set (only necessary to overcome RAM limitations, 
# otherwise use the full test set)
mnist_test_x_large = mnist.test.images[:-1000]
mnist_test_y_large = mnist.test.labels[:-1000]

Total number of test images: 10000


In [132]:
with tf.Session() as sess:
    output = sess.run(accuracy, feed_dict={x:mnist_test_x, y:mnist_test_labels, keep_prob:1.0 })
    print("Final Accuracy :{}".format(output))

Final Accuracy :0.08299999684095383
