# Deep Neural Network for MNIST:
This notebook constructs a 5-layer neural network to predict MNIST labels 0-4 based on exercise 8 in [Hands-on Machine Learning with Scikit-Learn and TensorFlow](http://shop.oreilly.com/product/0636920052289.do)

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

### Read Data

In [2]:
mnist = input_data.read_data_sets("/tmp/data/")

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [3]:
X_train1 = mnist.train.images[mnist.train.labels < 5]
y_train1 = mnist.train.labels[mnist.train.labels < 5]
X_valid1 = mnist.validation.images[mnist.validation.labels < 5]
y_valid1 = mnist.validation.labels[mnist.validation.labels < 5]
X_test1 = mnist.test.images[mnist.test.labels < 5]
y_test1 = mnist.test.labels[mnist.test.labels < 5]

In [4]:
print len(X_train1), len(X_valid1), len(X_test1)

28038 2558 5139


### Construct DNN

Shape of inputs/outputs

In [5]:
n_inputs = 28*28  # MNIST
n_outputs = 5

In [30]:
tf.reset_default_graph()

In [31]:
X = tf.placeholder(shape=[None, n_inputs], dtype=tf.float32, name='X')
y = tf.placeholder(shape=None, dtype=tf.int64, name='X')

**Deep Neural Network Structure:**

This network has 5 layers with 100 neurons each. The layers are initialized using _He_ initialization and use an _ELU_ activation function.

In [37]:
# number of neurons by layer:
n_neur = 100
he_init = tf.contrib.layers.variance_scaling_initializer()

hidden_1 = tf.layers.dense(inputs=X,
                           units=n_neur,
                           kernel_initializer=he_init,
                           activation=tf.nn.elu,
                           name="hidden1")
hidden_2 = tf.layers.dense(inputs=hidden_1,
                           units=n_neur,
                           kernel_initializer=he_init,
                           activation=tf.nn.elu,
                           name="hidden2")
hidden_3 = tf.layers.dense(inputs=hidden_2,
                           units=n_neur,
                           kernel_initializer=he_init,
                           activation=tf.nn.elu,
                           name="hidden3")
hidden_4 = tf.layers.dense(inputs=hidden_3,
                           units=n_neur,
                           kernel_initializer=he_init,
                           activation=tf.nn.elu,
                           name="hidden4")
hidden_5 = tf.layers.dense(inputs=hidden_4,
                           units=n_neur,
                           kernel_initializer=he_init,
                           activation=tf.nn.elu,
                           name="hidden5")
logits = tf.layers.dense(inputs=hidden_5,
                         units=n_outputs,
                         kernel_initializer=he_init,
                         name='logits')
# define individual cross entropy and overall loss for a batch
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name='loss')

ValueError: Variable hidden1/kernel already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

  File "<ipython-input-32-16a3f1521573>", line 9, in <module>
    name="hidden1")
  File "/Users/sarahcarlisle/anaconda/envs/neural-nets/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/Users/sarahcarlisle/anaconda/envs/neural-nets/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
    if self.run_code(code, result):


**Optimization:**

Some hyper-parameters for optimizing the weights of this network:
- learning rate (_eta_) = 0.01
- optimizer = Adam optimization with default hyper-parameters

In [33]:
learning_rate = 0.01
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
                                   beta1=0.9,
                                   beta2=0.999)
training_step = optimizer.minimize(loss, name='training_step')
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
saver = tf.train.Saver()

**Training**   
This training procedure implements early stopping to avoid overfitting. The stopping scheme works as follows: store the best loss value as a new one is encountered, if this value has not changed in 20 iterations, preserve the model that achieved this loss and stop training. I had initially set the batch size to 50 and got good results, which kind of thwarted the next stage of the exercise to implement some anti-overfitting strategies. So re-setting to 20 obtained the desired result of getting an overfit model to improve.

In [38]:
n_epochs = 1000
batch_size = 20
stopping_threshold = 20
iter_since_best = 0
best_loss_val = float('inf')

In [39]:
n_samples = X_train1.shape[0]

In [40]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        rand_idx = np.random.permutation(np.arange(n_samples))
        for iteration in range(n_samples // batch_size):
            batch_idx = rand_idx[(iteration*batch_size):((iteration+1)*batch_size)]
            X_batch = X_train1[batch_idx,:]
            y_batch = y_train1[batch_idx]
            sess.run(training_step, feed_dict={X: X_batch, y: y_batch})
        loss_val, acc_val = sess.run([loss, accuracy], feed_dict={X: X_valid1, y: y_valid1})
        print(epoch, "Train accuracy:", acc_train, "Val accuracy:", acc_val, )
        if epoch % 10 == 0:
            if loss_val < best_loss_val:
                save_path = saver.save(sess, "./my_mnist_model_0_to_4.ckpt")
                best_loss_val = loss_val
                iter_since_best = 0
        if iter_since_best >= stopping_threshold:
            break
        iter_since_best +=1
with tf.Session() as sess:
    saver.restore(sess, "./my_mnist_model_0_to_4.ckpt")
    acc_test = accuracy.eval(feed_dict={X: X_test1, y: y_test1})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))

(0, 'Train accuracy:', 1.0, 'Val accuracy:', 0.94370604)
(1, 'Train accuracy:', 1.0, 'Val accuracy:', 0.96638)
(2, 'Train accuracy:', 1.0, 'Val accuracy:', 0.18725567)
(3, 'Train accuracy:', 1.0, 'Val accuracy:', 0.20914777)
(4, 'Train accuracy:', 1.0, 'Val accuracy:', 0.22009382)
(5, 'Train accuracy:', 1.0, 'Val accuracy:', 0.22009382)
(6, 'Train accuracy:', 1.0, 'Val accuracy:', 0.20914777)
(7, 'Train accuracy:', 1.0, 'Val accuracy:', 0.20914777)
(8, 'Train accuracy:', 1.0, 'Val accuracy:', 0.1927287)
(9, 'Train accuracy:', 1.0, 'Val accuracy:', 0.20914777)
(10, 'Train accuracy:', 1.0, 'Val accuracy:', 0.19077404)
(11, 'Train accuracy:', 1.0, 'Val accuracy:', 0.22009382)
(12, 'Train accuracy:', 1.0, 'Val accuracy:', 0.19077404)
(13, 'Train accuracy:', 1.0, 'Val accuracy:', 0.22009382)
(14, 'Train accuracy:', 1.0, 'Val accuracy:', 0.22009382)
(15, 'Train accuracy:', 1.0, 'Val accuracy:', 0.20914777)
(16, 'Train accuracy:', 1.0, 'Val accuracy:', 0.22009382)
(17, 'Train accuracy:', 1.0,

**Better performance, batch size 50**

In [41]:
n_epochs = 1000
batch_size = 50
stopping_threshold = 20
iter_since_best = 0
best_loss_val = float('inf')

In [42]:
n_samples = X_train1.shape[0]

In [43]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        rand_idx = np.random.permutation(np.arange(n_samples))
        for iteration in range(n_samples // batch_size):
            batch_idx = rand_idx[(iteration*batch_size):((iteration+1)*batch_size)]
            X_batch = X_train1[batch_idx,:]
            y_batch = y_train1[batch_idx]
            sess.run(training_step, feed_dict={X: X_batch, y: y_batch})
        loss_val, acc_val = sess.run([loss, accuracy], feed_dict={X: X_valid1, y: y_valid1})
        print(epoch, "Train accuracy:", acc_train, "Val accuracy:", acc_val, )
        if epoch % 10 == 0:
            if loss_val < best_loss_val:
                save_path = saver.save(sess, "./my_mnist_model_0_to_4.ckpt")
                best_loss_val = loss_val
                iter_since_best = 0
        if iter_since_best >= stopping_threshold:
            break
        iter_since_best +=1
with tf.Session() as sess:
    saver.restore(sess, "./my_mnist_model_0_to_4.ckpt")
    acc_test = accuracy.eval(feed_dict={X: X_test1, y: y_test1})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))

(0, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97458953)
(1, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97615325)
(2, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98436278)
(3, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97928071)
(4, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97810787)
(5, 'Train accuracy:', 1.0, 'Val accuracy:', 0.86708367)
(6, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97615325)
(7, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97888976)
(8, 'Train accuracy:', 1.0, 'Val accuracy:', 0.77677876)
(9, 'Train accuracy:', 1.0, 'Val accuracy:', 0.77834243)
(10, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97849882)
(11, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98162627)
(12, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97849882)
(13, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98553556)
(14, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98279905)
(15, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98358095)
(16, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97732604)
(17, 'Train accuracy:', 

(I also tried a smaller learning rate which seemed to really boost things in terms of performance)

### Adding batch normalization

Trying to use Batch Normalization to improve convergence

In [49]:
tf.reset_default_graph()

In [50]:
X = tf.placeholder(shape=[None, n_inputs], dtype=tf.float32, name='X')
y = tf.placeholder(shape=None, dtype=tf.int64, name='X')

In [51]:
training = tf.placeholder_with_default(False, shape=(), name='training')

In [52]:
hidden_1 = tf.layers.dense(inputs=X,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden1")
bn_h1 = tf.layers.batch_normalization(hidden_1,training=training,momentum=0.9)
h1_act = tf.nn.elu(bn_h1)
hidden_2 = tf.layers.dense(inputs=h1_act,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden2")
bn_h2 = tf.layers.batch_normalization(hidden_2,training=training,momentum=0.9)
h2_act = tf.nn.elu(bn_h2)
hidden_3 = tf.layers.dense(inputs=h2_act,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden3")
bn_h3 = tf.layers.batch_normalization(hidden_3,training=training,momentum=0.9)
h3_act = tf.nn.elu(bn_h3)
hidden_4 = tf.layers.dense(inputs=h3_act,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden4")
bn_h4 = tf.layers.batch_normalization(hidden_4,training=training,momentum=0.9)
h4_act = tf.nn.elu(bn_h4)
hidden_5 = tf.layers.dense(inputs=h4_act,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden5")
bn_h5 = tf.layers.batch_normalization(hidden_5,training=training,momentum=0.9)
h5_act = tf.nn.elu(bn_h5)
pre_logits = tf.layers.dense(inputs=h5_act,
                             units=n_outputs,
                             kernel_initializer=he_init,
                             name='logits')
logits = tf.layers.batch_normalization(pre_logits,training=training,momentum=0.9) 
# define individual cross entropy and overall loss for a batch
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name='loss')

In [53]:
learning_rate = 0.01
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
                                   beta1=0.9,
                                   beta2=0.999)
training_step = optimizer.minimize(loss, name='training_step')
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
saver = tf.train.Saver()

In [None]:
iter_since_best = 0
best_loss_val = float('inf')

In [54]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        rand_idx = np.random.permutation(np.arange(n_samples))
        for iteration in range(n_samples // batch_size):
            batch_idx = rand_idx[(iteration*batch_size):((iteration+1)*batch_size)]
            X_batch = X_train1[batch_idx,:]
            y_batch = y_train1[batch_idx]
            sess.run([training_step, extra_update_ops],
                     feed_dict={training:True, X: X_batch, y: y_batch}
                    )
        loss_val, acc_val = sess.run([loss, accuracy],
                                     feed_dict={X: X_valid1, y: y_valid1})
        print(epoch, "Train accuracy:", acc_train, "Val accuracy:", acc_val, )
        if epoch % 10 == 0:
            if loss_val < best_loss_val:
                save_path = saver.save(sess, "./my_mnist_model_0_to_4.ckpt")
                best_loss_val = loss_val
                iter_since_best = 0
        if iter_since_best >= stopping_threshold:
            break
        iter_since_best +=1
with tf.Session() as sess:
    saver.restore(sess, "./my_mnist_model_0_to_4.ckpt")
    acc_test = accuracy.eval(feed_dict={X: X_test1, y: y_test1})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))

(0, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97849882)
(1, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98162627)
(2, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98436278)
(3, 'Train accuracy:', 1.0, 'Val accuracy:', 0.99061769)
(4, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98514462)
(5, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98709929)
(6, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98201722)
(7, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98788118)
(8, 'Train accuracy:', 1.0, 'Val accuracy:', 0.9898358)
(9, 'Train accuracy:', 1.0, 'Val accuracy:', 0.99022675)
(10, 'Train accuracy:', 1.0, 'Val accuracy:', 0.99100858)
(11, 'Train accuracy:', 1.0, 'Val accuracy:', 0.9898358)
(12, 'Train accuracy:', 1.0, 'Val accuracy:', 0.99061769)
(13, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98788118)
(14, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98788118)
(15, 'Train accuracy:', 1.0, 'Val accuracy:', 0.99022675)
(16, 'Train accuracy:', 1.0, 'Val accuracy:', 0.99179047)
(17, 'Train accuracy:', 1.

Looks like performance got better but convergence did not.

### Adding dropout

In [56]:
tf.reset_default_graph()

In [57]:
X = tf.placeholder(shape=[None, n_inputs], dtype=tf.float32, name='X')
y = tf.placeholder(shape=None, dtype=tf.int64, name='X')

In [58]:
training = tf.placeholder_with_default(False, shape=(), name='training')

In [59]:
dropout_rate = 0.5

In [60]:
hidden_1 = tf.layers.dense(inputs=X,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden1")
bn_h1 = tf.layers.batch_normalization(hidden_1,training=training,momentum=0.9)
h1_act = tf.nn.elu(bn_h1)
h1_drop = tf.layers.dropout(h1_act, dropout_rate, training=training)
hidden_2 = tf.layers.dense(inputs=h1_drop,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden2")
bn_h2 = tf.layers.batch_normalization(hidden_2,training=training,momentum=0.9)
h2_act = tf.nn.elu(bn_h2)
h2_drop = tf.layers.dropout(h2_act, dropout_rate, training=training)
hidden_3 = tf.layers.dense(inputs=h2_drop,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden3")
bn_h3 = tf.layers.batch_normalization(hidden_3,training=training,momentum=0.9)
h3_act = tf.nn.elu(bn_h3)
h3_drop = tf.layers.dropout(h3_act, dropout_rate, training=training)
hidden_4 = tf.layers.dense(inputs=h3_drop,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden4")
bn_h4 = tf.layers.batch_normalization(hidden_4,training=training,momentum=0.9)
h4_act = tf.nn.elu(bn_h4)
h4_drop = tf.layers.dropout(h4_act, dropout_rate, training=training)
hidden_5 = tf.layers.dense(inputs=h4_drop,
                           units=n_neur,
                           kernel_initializer=he_init,
                           name="hidden5")
bn_h5 = tf.layers.batch_normalization(hidden_5,training=training,momentum=0.9)
h5_act = tf.nn.elu(bn_h5)
h5_drop = tf.layers.dropout(h5_act, dropout_rate, training=training)
pre_logits = tf.layers.dense(inputs=h5_drop,
                             units=n_outputs,
                             kernel_initializer=he_init,
                             name='logits')
logits = tf.layers.batch_normalization(pre_logits,training=training,momentum=0.9) 
# define individual cross entropy and overall loss for a batch
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name='loss')

In [61]:
learning_rate = 0.01
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
                                   beta1=0.9,
                                   beta2=0.999)
training_step = optimizer.minimize(loss, name='training_step')
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
saver = tf.train.Saver()

In [63]:
iter_since_best = 0
best_loss_val = float('inf')

In [64]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        rand_idx = np.random.permutation(np.arange(n_samples))
        for iteration in range(n_samples // batch_size):
            batch_idx = rand_idx[(iteration*batch_size):((iteration+1)*batch_size)]
            X_batch = X_train1[batch_idx,:]
            y_batch = y_train1[batch_idx]
            sess.run([training_step, extra_update_ops],
                     feed_dict={training:True, X: X_batch, y: y_batch}
                    )
        loss_val, acc_val = sess.run([loss, accuracy],
                                     feed_dict={X: X_valid1, y: y_valid1})
        print(epoch, "Train accuracy:", acc_train, "Val accuracy:", acc_val, )
        if epoch % 10 == 0:
            if loss_val < best_loss_val:
                save_path = saver.save(sess, "./my_mnist_model_0_to_4.ckpt")
                best_loss_val = loss_val
                iter_since_best = 0
        if iter_since_best >= stopping_threshold:
            break
        iter_since_best +=1
with tf.Session() as sess:
    saver.restore(sess, "./my_mnist_model_0_to_4.ckpt")
    acc_test = accuracy.eval(feed_dict={X: X_test1, y: y_test1})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))

(0, 'Train accuracy:', 1.0, 'Val accuracy:', 0.96833462)
(1, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97498047)
(2, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97732604)
(3, 'Train accuracy:', 1.0, 'Val accuracy:', 0.97928071)
(4, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98397183)
(5, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98592651)
(6, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98436278)
(7, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98514462)
(8, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98553556)
(9, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98709929)
(10, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98553556)
(11, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98827207)
(12, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98944485)
(13, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98905396)
(14, 'Train accuracy:', 1.0, 'Val accuracy:', 0.98944485)
(15, 'Train accuracy:', 1.0, 'Val accuracy:', 0.99100858)
(16, 'Train accuracy:', 1.0, 'Val accuracy:', 0.99139953)
(17, 'Train accuracy:', 

Dropout looks like it didn't do too much improvement over the last model with Batch Norm