## A CNN Architecture for MNIST
conv -> relu -> pool -> conv -> relu -> pool -> fully connected -> softmax

(W−F+2P)/S+ 1
- W: input width
- F: filter width
- P: padding
- S: stride

__Variable scope:__
Since we’ll be dealing with multiple layers, it’s important to introduce variable scope. Think of a
variable scope something similar to a namespace. A variable name ‘weights’ in variable scope
‘conv1’ will become ‘conv1-weights’. The common practice is to create a variable scope for each
layer, so that if you have variable ‘weights’ in both convolution layer 1 and convolution layer 2,
there won’t be any name clash.

In variable scope, we don’t create variable using tf.Variable, but instead use tf.get_variable()
tf.get_variable(<name>, <shape> , <initializer>)

If a variable with that name already exists in that variable scope, we use that variable. If a
variable with that name doesn’t already exists in that variable scope, TensorFlow creates a new
variable. This setup makes it really easy to share variables across architecture. This will come in
extremely handy when you build complex models and you need to share large sets of variables.
Variable scopes help you initialize all of them in one place.

also look into name scope.

__Note:__ You can view progress in tensorboard via using summaries. http://localhost:6006/#scalars&_ignoreYOutliers=false

In [1]:
from __future__ import print_function
from __future__ import division
from __future__ import print_function

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import time 
import tensorflow as tf
import tensorflow.contrib.layers as layers
from tensorflow.examples.tutorials.mnist import input_data

def make_dir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass


make_dir('checkpoints')
make_dir('checkpoints/convnet_mnist')

# Step 1: load MNIST data to data/mnist
mnist = input_data.read_data_sets("./data/mnist", one_hot=True)

# Step 2: Define paramaters for the model
N_CLASSES = 10
LEARNING_RATE = 0.001
BATCH_SIZE = 128
SKIP_STEP = 10
DROPOUT = 0.75
N_EPOCHS = 9

Extracting ./data/mnist/train-images-idx3-ubyte.gz
Extracting ./data/mnist/train-labels-idx1-ubyte.gz
Extracting ./data/mnist/t10k-images-idx3-ubyte.gz
Extracting ./data/mnist/t10k-labels-idx1-ubyte.gz


In [2]:
# Step 3: create placeholders for features and labels
# - Each image in the MNIST data is of shape 28*28 = 784 therefore, each image is represented with a 1x784 tensor
# - We'll be doing dropout for hidden layer so we'll need a placeholder for the dropout probability too
# - Use None for shape so we can change the batch_size once we've built the graph
with tf.name_scope('data'):
    X = tf.placeholder(tf.float32, [None, 784], name="X_placeholder")
    Y = tf.placeholder(tf.float32, [None, 10], name="Y_placeholder")

dropout = tf.placeholder(tf.float32, name='dropout')

__Note:__
- Use None for shape so we can change the batch_size once we've built the graph
- One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.

In [3]:
# Step 4 + 5: create weights + do inference
# the model is conv -> relu -> pool -> conv -> relu -> pool -> fully connected -> softmax

global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

with tf.variable_scope('conv1') as scope:
    # first, reshape the image to [BATCH_SIZE, 28, 28, 1] to make it work with tf.nn.conv2d
    # use the dynamic dimension -1
    
    images = tf.reshape(X, shape=[-1, 28, 28, 1]) 
    
    # TO DO

    # create kernel variable of dimension [5, 5, 1, 32]
    # use tf.truncated_normal_initializer()
    # For conv2d: Given an input tensor of shape [batch, in_height, in_width, in_channels] 
    # and a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]
    conv1_kernels =  tf.get_variable(shape=[5, 5, 1, 32], initializer=tf.truncated_normal_initializer(),   # YES mean=0, stddev=1 
                                     name='conv1kernels', dtype=tf.float32)
    
    # TO DO

    # create biases variable of dimension [32]
    # use tf.constant_initializer(0.0)
    #conv1_biases = tf.Variable([32], tf.constant_initializer(0.0), name='conv1biases', dtype=tf.float32)  # YES
    conv1_biases = tf.get_variable('biases', [32],
                        initializer=tf.random_normal_initializer())
    
    # TO DO 

    # apply tf.nn.conv2d. strides [1, 1, 1, 1], padding is 'SAME'
    
    conv1_output = tf.nn.conv2d(images, conv1_kernels, strides=[1,1,1,1], padding='SAME', name='conv1')
    # TO DO

    # apply relu on the sum of convolution output and biases
    relu1 = tf.nn.relu(tf.add(conv1_output, conv1_biases), name='relu1')    
    # TO DO 

    # output is of dimension BATCH_SIZE x 28 x 28 x 32

with tf.variable_scope('pool1') as scope:
    # apply max pool with ksize [1, 2, 2, 1], and strides [1, 2, 2, 1], padding 'SAME'
    # stride of 2,2 means non-overlapping
    max_pool1 = tf.nn.max_pool(relu1, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME', name='maxpool1')
    # TO DO

    # output is of dimension BATCH_SIZE x 14 x 14 x 32

with tf.variable_scope('conv2') as scope:
    # similar to conv1, except kernel now is of the size 5 x 5 x 32 x 64
    # 32 is the input channels (comes from previous layers)
    conv2_kernels = tf.get_variable('conv2_kernels', [5, 5, 32, 64], initializer=tf.truncated_normal_initializer())
    conv2_biases = tf.get_variable('conv2_biases', [64], initializer=tf.random_normal_initializer()) # TODO: WHY not 0?
    conv2_output = tf.nn.conv2d(max_pool1, conv2_kernels, strides=[1, 1, 1, 1], padding='SAME', name='conv2output')
    relu2 = tf.nn.relu(conv2_output + conv2_biases, name='relu2')

    # output is of dimension BATCH_SIZE x 14 x 14 x 64

with tf.variable_scope('pool2') as scope:
    # similar to pool1
    max_pool2 = tf.nn.max_pool(relu2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name='maxpool2')

    # output is of dimension BATCH_SIZE x 7 x 7 x 64

with tf.variable_scope('fc') as scope:
    # use weight of dimension 7 * 7 * 64 x 1024
    input_features = 7 * 7 * 64
    
    # create weights and biases
    #fc_weight = tf.Variable(tf.random_normal([input_features, 1024]), name='fcweight') YES
    #fc_bias = tf.Variable(tf.zeros([1024]), name='fcbias') YES
    
    fc_weight = tf.get_variable('fc_weight', [input_features, 1024],
                        initializer=tf.truncated_normal_initializer())
    fc_bias = tf.get_variable('fc_bias', [1024],
                        initializer=tf.constant_initializer(0.0))


    # TO DO

    # reshape pool2 to 2 dimensional
    max_pool2 = tf.reshape(max_pool2, [-1, input_features])

    # apply relu on matmul of pool2 and w + b
    fc = tf.nn.relu(tf.matmul(max_pool2, fc_weight) + fc_bias, name='fcrelu')
    
    # TO DO

    # apply dropout
    fc = tf.nn.dropout(fc, dropout, name='dropout')

with tf.variable_scope('softmaxlinear') as scope:
    # this you should know. get logits without softmax
    # you need to create weights and biases
    
   # W = tf.Variable(tf.random_normal(shape=[1024,10], mean=0, stddev=0.1), name="W") YES
    #b = tf.Variable(tf.zeros(shape=[10]), name="b") YES
    
    W = tf.get_variable('weights', [1024, N_CLASSES],
                        initializer=tf.truncated_normal_initializer())
    b = tf.get_variable('biases', [N_CLASSES],
                        initializer=tf.random_normal_initializer())
    
    logits = tf.matmul(fc, W) + b

    

    # TO DO

# Step 6: define loss function
# use softmax cross entropy with logits as the loss function
# compute mean cross entropy, softmax is applied internally
with tf.name_scope('loss'):
    # you should know how to do this too
    entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y, name="entropy")
    loss = tf.reduce_mean(entropy)

    # TO DO

    
with tf.name_scope('summaries'):
    tf.summary.scalar('loss', loss)
    tf.summary.histogram('histogram_loss', loss)
    summary_op = tf.summary.merge_all()
    
# Step 7: define training op
# using gradient descent with learning rate of LEARNING_RATE to minimize cost
# don't forgot to pass in global_step
    #optimizer = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE)
    optimizer = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE)
    lossOptimizer = optimizer.minimize(loss, global_step=global_step)
# TO DO

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    # to visualize using TensorBoard
    writer = tf.summary.FileWriter('./my_graph/mnist', sess.graph)
    ##### You have to create folders to store checkpoints
    ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint'))
    # if that checkpoint exists, restore from checkpoint
    if ckpt and ckpt.model_checkpoint_path:
        saver.restore(sess, ckpt.model_checkpoint_path)
    
    initial_step = global_step.eval()

    start_time = time.time()
    n_batches = int(mnist.train.num_examples / BATCH_SIZE)

    total_loss = 0.0
    for index in range(initial_step, n_batches * N_EPOCHS): # train the model n_epochs times
        X_batch, Y_batch = mnist.train.next_batch(BATCH_SIZE)
        _, loss_batch, summary = sess.run([lossOptimizer, loss, summary_op], 
                                feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) 
        writer.add_summary(summary, global_step=index)
        total_loss += loss_batch
        if (index + 1) % SKIP_STEP == 0:
            print('Average loss at step {}: {:5.1f}'.format(index + 1, total_loss / SKIP_STEP))
            total_loss = 0.0
            saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', index)
    
    print("Optimization Finished!") # should be around 0.35 after 25 epochs
    print("Total time: {0} seconds".format(time.time() - start_time))
    
    # test the model
    n_batches = int(mnist.test.num_examples/BATCH_SIZE)
    total_correct_preds = 0
    for i in range(n_batches):
        X_batch, Y_batch = mnist.test.next_batch(BATCH_SIZE)
        _, loss_batch, logits_batch = sess.run([lossOptimizer, loss, logits], 
                                        feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) 
        preds = tf.nn.softmax(logits_batch)
        correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y_batch, 1))
        accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))
        total_correct_preds += sess.run(accuracy)   
    
    print("Accuracy {0}".format(total_correct_preds/mnist.test.num_examples))

Average loss at step 10: 36946.6
Average loss at step 20: 21872.6
Average loss at step 30: 13436.8
Average loss at step 40: 9398.1
Average loss at step 50: 6879.5
Average loss at step 60: 5101.2
Average loss at step 70: 4285.3
Average loss at step 80: 4133.6
Average loss at step 90: 3540.7
Average loss at step 100: 3175.9
Average loss at step 110: 2846.9
Average loss at step 120: 2682.5
Average loss at step 130: 2582.2
Average loss at step 140: 1975.6
Average loss at step 150: 2077.3
Average loss at step 160: 1833.6
Average loss at step 170: 1928.1
Average loss at step 180: 1882.1
Average loss at step 190: 1331.1
Average loss at step 200: 1720.3
Average loss at step 210: 1362.8
Average loss at step 220: 1346.1
Average loss at step 230: 1430.8
Average loss at step 240: 1401.5
Average loss at step 250: 1264.5
Average loss at step 260: 1226.7
Average loss at step 270: 1130.2
Average loss at step 280: 1046.6
Average loss at step 290: 873.9
Average loss at step 300: 1140.8
Average loss at s

--------------------------------------------
Adam with no weight initialization: Total time: 5.83721089363 seconds Accuracy 0.8358

Adam with initialization
Average loss at step 10: 1802.0
Average loss at step 20: 699.4
Average loss at step 30: 445.7
Average loss at step 40: 288.4
Average loss at step 50: 224.2
Average loss at step 60: 179.3
Average loss at step 70: 166.3
Average loss at step 80: 131.9
Average loss at step 90: 122.3
Average loss at step 100:  95.3
Average loss at step 110:  89.0
Average loss at step 120:  87.8
Average loss at step 130:  69.9
Average loss at step 140:  72.3
Average loss at step 150:  56.3
Average loss at step 160:  54.9
Average loss at step 170:  66.5
Average loss at step 180:  49.7
Average loss at step 190:  52.7
Average loss at step 200:  55.5
Average loss at step 210:  44.3
Average loss at step 220:  45.9
Average loss at step 230:  46.4
Average loss at step 240:  37.0
Average loss at step 250:  38.6
Average loss at step 260:  42.6
Average loss at step 270:  34.6
Average loss at step 280:  31.8
Average loss at step 290:  31.8
Average loss at step 300:  31.1
Average loss at step 310:  31.4
Average loss at step 320:  26.2
Average loss at step 330:  30.7
Average loss at step 340:  25.5
Average loss at step 350:  26.8
Average loss at step 360:  23.8
Average loss at step 370:  26.7
Average loss at step 380:  23.8
Average loss at step 390:  21.8
Average loss at step 400:  23.6
Average loss at step 410:  19.2
Average loss at step 420:  18.6
Optimization Finished!
Total time: 284.903088093 seconds
Accuracy 0.9174

Adam
Average loss at step 10: 30222.2
Average loss at step 20: 16148.8
Average loss at step 30: 9587.3
Average loss at step 40: 6922.9
Average loss at step 50: 4953.3
Average loss at step 60: 4613.0
Average loss at step 70: 3682.1
Average loss at step 80: 3718.2
Average loss at step 90: 3408.3
Average loss at step 100: 2872.0
Average loss at step 110: 2445.8
Average loss at step 120: 2078.3
Average loss at step 130: 2231.1
Average loss at step 140: 1894.4
Average loss at step 150: 1677.1
Average loss at step 160: 2162.8
Average loss at step 170: 1588.3
Average loss at step 180: 1731.8
Average loss at step 190: 2008.5
Average loss at step 200: 1751.3
Average loss at step 210: 1478.8
Average loss at step 220: 1362.8
Average loss at step 230: 1276.3
Average loss at step 240: 1247.8
Average loss at step 250: 1045.4
Average loss at step 260: 1269.3
Average loss at step 270: 1031.2
Average loss at step 280: 1283.8
Average loss at step 290: 1029.7
Average loss at step 300: 861.0
Average loss at step 310: 1002.9
Average loss at step 320: 837.7
Average loss at step 330: 823.5
Average loss at step 340: 1034.0
Average loss at step 350: 1112.5
Average loss at step 360: 791.2
Average loss at step 370: 940.1
Average loss at step 380: 802.0
Average loss at step 390: 640.3
Average loss at step 400: 635.3
Average loss at step 410: 806.9
Average loss at step 420: 781.4
Optimization Finished!
Total time: 309.968480825 seconds
Accuracy 0.9208

## Important
Big blockers were:
1. Good optmizer (Adam vs Stochastic gradient descent)
2. Proper random initialization
3. Use of lower learning rate