<h1>Deep Learning</h1>
<h2>Assignment 3</h2><br/>
Previously in 2_fullyconnected.ipynb, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle


First reload the data we generated in 1_notmnist.ipynb.

In [2]:

pickle_file = 'notMNIST_clean.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

('Training set', (187318, 28, 28), (187318,))
('Validation set', (8918, 28, 28), (8918,))
('Test set', (8707, 28, 28), (8707,))


In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

('Training set', (187318, 784), (187318, 10))
('Validation set', (8918, 784), (8918, 10))
('Test set', (8707, 784), (8707, 10))


Reformat into a shape that's more adapted to the models we're going to train:

- data as a flat matrix
- labels as float 1-hot encodings.

In [4]:
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

<h2>Problem 1</h2><br/>
Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor t using nn.l2_loss(t). The right amount of regularization should improve your validation / test accuracy.

<h3>Applying l2 regularization for logistic models</h3>

In [7]:
train_subset = 10000
lambda_val = 0.003 #scaling value for controlling regularization on the weight value

graph = tf.Graph()
with graph.as_default():
    
    tf_train_dataset = tf.constant(train_dataset[:10000, :])
    tf_train_labels  = tf.constant(train_labels[:10000, :])
    tf_test_dataset = tf.constant(test_dataset)
    tf_valid_dataset = tf.constant(valid_dataset)
    
    #variables
    weights = tf.Variable(
        tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
    
    logits = tf.matmul(tf_train_dataset,weights) + biases
    
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,logits=logits)) + lambda_val*tf.nn.l2_loss(weights)
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(loss)
    
    train_prediction = tf.nn.softmax(logits)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset,weights) + biases)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset,weights) + biases)

In [8]:
num_steps = 801

with tf.Session(graph=graph) as session:
  # This is a one-time operation which ensures the parameters get initialized as
  # we described in the graph: random weights for the matrix, zeros for the
  # biases. 
  tf.global_variables_initializer().run()
  print('Initialized')
  for step in range(num_steps):
    # Run the computations. We tell .run() that we want to run the optimizer,
    # and get the loss value and the training predictions returned as numpy
    # arrays.
    _, l, predictions = session.run([optimizer, loss, train_prediction])
    if (step % 100 == 0):
      print('Loss at step %d: %f' % (step, l))
      print('Training accuracy: %.1f%%' % accuracy(
        predictions, train_labels[:train_subset, :]))
      # Calling .eval() on valid_prediction is basically like calling run(), but
      # just to get that one numpy array. Note that it recomputes all its graph
      # dependencies.
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Loss at step 0: 26.103168
Training accuracy: 11.1%
Validation accuracy: 13.8%
Loss at step 100: 8.305917
Training accuracy: 71.4%
Validation accuracy: 69.7%
Loss at step 200: 5.868908
Training accuracy: 74.8%
Validation accuracy: 72.5%
Loss at step 300: 4.257320
Training accuracy: 76.8%
Validation accuracy: 73.9%
Loss at step 400: 3.157048
Training accuracy: 78.4%
Validation accuracy: 75.1%
Loss at step 500: 2.398562
Training accuracy: 79.8%
Validation accuracy: 76.3%
Loss at step 600: 1.872542
Training accuracy: 81.3%
Validation accuracy: 77.3%
Loss at step 700: 1.505772
Training accuracy: 82.5%
Validation accuracy: 78.3%
Loss at step 800: 1.248805
Training accuracy: 83.4%
Validation accuracy: 79.1%
Test accuracy: 85.9%


Accuracy increased by around 5 percent when we train the network for the same number of itertions using regularizing 

<h3>Experimenting with higher value of lambda </h3>

In [9]:
# Applying l2 regularization for logistic models
train_subset = 10000
lambda_val = 3 # We have taken a very high value

graph = tf.Graph()
with graph.as_default():
    
    tf_train_dataset = tf.constant(train_dataset[:10000, :])
    tf_train_labels  = tf.constant(train_labels[:10000, :])
    tf_test_dataset = tf.constant(test_dataset)
    tf_valid_dataset = tf.constant(valid_dataset)
    
    #variables
    weights = tf.Variable(
        tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
    
    logits = tf.matmul(tf_train_dataset,weights) + biases
    
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,logits=logits)) + lambda_val*tf.nn.l2_loss(weights)
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(loss)
    
    train_prediction = tf.nn.softmax(logits)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset,weights) + biases)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset,weights) + biases)

In [10]:
num_steps = 801

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  for step in range(num_steps):
    _, l, predictions = session.run([optimizer, loss, train_prediction])
    if (step % 100 == 0):
      print('Loss at step %d: %f' % (step, l))
      print('Training accuracy: %.1f%%' % accuracy(
        predictions, train_labels[:train_subset, :]))
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Loss at step 0: 9140.834961
Training accuracy: 12.5%
Validation accuracy: 13.5%
Loss at step 100: 26.667292
Training accuracy: 15.5%
Validation accuracy: 10.8%
Loss at step 200: 22.777790
Training accuracy: 15.5%
Validation accuracy: 15.4%
Loss at step 300: 23.502842
Training accuracy: 15.8%
Validation accuracy: 11.0%
Loss at step 400: 22.101349
Training accuracy: 13.0%
Validation accuracy: 15.1%
Loss at step 500: 26.476952
Training accuracy: 15.2%
Validation accuracy: 11.1%
Loss at step 600: 21.613659
Training accuracy: 12.9%
Validation accuracy: 15.6%
Loss at step 700: 24.193426
Training accuracy: 16.9%
Validation accuracy: 12.1%
Loss at step 800: 23.919769
Training accuracy: 8.8%
Validation accuracy: 17.9%
Test accuracy: 18.8%


<b> We can see that the accuracy is very bad. It is because more loss is contributed by the regularizing loss that we have added in the loss function than the loss we are getting by comparing logits to labels. Thus, the models is underfitted as it is just trying to choose small weight values in any way to reduce the loss function.

Similarly, if we choose very small value of lambda, there will be little to no significance of adding regularizing loss to the loss function and no effect could be seen then </b>

<h3>Applying regularization to 1-hidden layer neural network</h3>


In [11]:
batch_size = 128
image_size = 28
num_labels = 10
hidden_layer_nodes = 1000
lambda_val = 0.001

graph = tf.Graph()
with graph.as_default():
    
    #create place holders for taking training input
    tf_train_dataset = tf.placeholder(dtype=tf.float32,shape=(batch_size,image_size*image_size))
    tf_train_labels = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_labels))
    tf_test_dataset = tf.constant(test_dataset)
    tf_valid_dataset = tf.constant(valid_dataset)
    
    # Variables for computing hidden layer nodes values
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_layer_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_layer_nodes]))
    hidden_layer_data = tf.nn.relu(tf.matmul(tf_train_dataset,weights1)+biases1)
    
    
    #variables for computing logits for the output layer
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_layer_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))
    logits = tf.matmul(hidden_layer_data,weights2)+biases2
    
    
    #calculating loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,logits=logits)) \
            + lambda_val * tf.nn.l2_loss(weights1) + lambda_val * tf.nn.l2_loss(weights2)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(loss)
    
    train_prediction = tf.nn.softmax(logits)
    test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset,weights1)+biases1),weights2)+biases2)
    valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset,weights1)+biases1),weights2)+biases2)

In [12]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 593.787781
Minibatch accuracy: 10.2%
Validation accuracy: 31.2%
Minibatch loss at step 500: 192.214767
Minibatch accuracy: 77.3%
Validation accuracy: 78.0%
Minibatch loss at step 1000: 112.514793
Minibatch accuracy: 78.1%
Validation accuracy: 81.0%
Minibatch loss at step 1500: 67.780785
Minibatch accuracy: 83.6%
Validation accuracy: 82.1%
Minibatch loss at step 2000: 40.501095
Minibatch accuracy: 85.2%
Validation accuracy: 84.1%
Minibatch loss at step 2500: 24.522041
Minibatch accuracy: 90.6%
Validation accuracy: 86.0%
Minibatch loss at step 3000: 15.073820
Minibatch accuracy: 84.4%
Validation accuracy: 86.4%
Test accuracy: 91.6%


I tried multiple lambda values, starting with a higher lambda value and then decreasing the value of lambda by 1/3 its previous value and checked the accuracy. Found 0.01 to better than other values. As lambda is a hyper parameter, experimentation with its values need to be done to get it most optimal value.

<h2>Problem 2</h2><br/>
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

<h3> Taking 1 hidden layer neural network passing only a small subset of the training data</h3>

In [13]:
batch_size = 128
image_size = 28
num_labels = 10
hidden_layer_nodes = 1000
lambda_val = 0.001

graph = tf.Graph()
with graph.as_default():
    
    #create place holders for taking training input
    tf_train_dataset = tf.placeholder(dtype=tf.float32,shape=(batch_size,image_size*image_size))
    tf_train_labels = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_labels))
    tf_test_dataset = tf.constant(test_dataset)
    tf_valid_dataset = tf.constant(valid_dataset)
    
    # Variables for computing hidden layer nodes values
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_layer_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_layer_nodes]))
    hidden_layer_data = tf.nn.relu(tf.matmul(tf_train_dataset,weights1)+biases1)
    
    
    #variables for computing logits for the output layer
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_layer_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))
    logits = tf.matmul(hidden_layer_data,weights2)+biases2
    
    
    #calculating loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,logits=logits)) \
            + lambda_val * tf.nn.l2_loss(weights1) + lambda_val * tf.nn.l2_loss(weights2)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(loss)
    
    train_prediction = tf.nn.softmax(logits)
    test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset,weights1)+biases1),weights2)+biases2)
    valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset,weights1)+biases1),weights2)+biases2)

In [14]:
num_steps = 3001
small_train_dataset = train_dataset[:2000,:]
small_train_labels  = train_labels[:2000]
indexes = np.arange(128)

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):

    np.random.shuffle(indexes)
    batch_data = small_train_dataset[indexes]
    batch_labels = small_train_labels[indexes]
    
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 649.880981
Minibatch accuracy: 10.2%
Minibatch loss at step 500: 186.195587
Minibatch accuracy: 100.0%
Minibatch loss at step 1000: 112.919128
Minibatch accuracy: 100.0%
Minibatch loss at step 1500: 68.480408
Minibatch accuracy: 100.0%
Minibatch loss at step 2000: 41.530254
Minibatch accuracy: 100.0%
Minibatch loss at step 2500: 25.186213
Minibatch accuracy: 100.0%
Minibatch loss at step 3000: 15.274346
Minibatch accuracy: 100.0%
Test accuracy: 69.9%


We took 200 images and trained the network, we can see that the model is giving 100% accuracy on our trained dataset. however, for test we are getting only 69% accuracy. 


<h1>Problem 3</h1><br/>
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides nn.dropout() for that, but you have to make sure it's only inserted during training.



<h3>1 hidden layer neural network with dropout applied to the first layer activations</h3>

In [15]:
batch_size = 128
image_size = 28
num_labels = 10
hidden_layer_nodes = 1000
print train_dataset.shape

graph = tf.Graph()
with graph.as_default():
    
    #create place holders for taking training input
    tf_train_dataset = tf.placeholder(dtype=tf.float32,shape=(batch_size,image_size*image_size))
    tf_train_labels = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_labels))
    tf_test_dataset = tf.constant(test_dataset)
    tf_valid_dataset = tf.constant(valid_dataset)
    
    keep_prob = tf.constant(0.5)
    # Variables for computing hidden layer nodes values
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_layer_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_layer_nodes]))
    hidden_layer_data = tf.nn.dropout(tf.nn.relu(tf.matmul(tf_train_dataset,weights1)+biases1),keep_prob)
    
    
    #variables for computing logits for the output layer
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_layer_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))
    logits = tf.matmul(hidden_layer_data,weights2)+biases2
    
    
    #calculating loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,logits=logits))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(loss)
    
    train_prediction = tf.nn.softmax(logits)
    test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset,weights1)+biases1),weights2)+biases2)
    valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset,weights1)+biases1),weights2)+biases2)

(187318, 784)


In [16]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 548.001648
Minibatch accuracy: 6.2%
Validation accuracy: 25.4%
Minibatch loss at step 500: 38.624714
Minibatch accuracy: 75.8%
Validation accuracy: 78.4%
Minibatch loss at step 1000: 13.993723
Minibatch accuracy: 68.8%
Validation accuracy: 78.4%
Minibatch loss at step 1500: 19.152189
Minibatch accuracy: 66.4%
Validation accuracy: 77.8%
Minibatch loss at step 2000: 8.705633
Minibatch accuracy: 70.3%
Validation accuracy: 77.8%
Minibatch loss at step 2500: 5.419950
Minibatch accuracy: 77.3%
Validation accuracy: 78.3%
Minibatch loss at step 3000: 3.228109
Minibatch accuracy: 76.6%
Validation accuracy: 78.1%
Test accuracy: 84.5%


<h1>Problem 4</h1><br/>
Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is 97.1%.

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

global_step = tf.Variable(0)  # count the number of steps taken.
learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

<h3> Applying learning rate decay on a 1 hidden layer neural network</h3> 

In [21]:
#trying learning rate decay
batch_size = 128
image_size = 28
num_labels = 10
hidden_layer_nodes = 1000

graph = tf.Graph()
with graph.as_default():
    
    #create place holders for taking training input
    tf_train_dataset = tf.placeholder(dtype=tf.float32,shape=(batch_size,image_size*image_size))
    tf_train_labels = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_labels))
    tf_test_dataset = tf.constant(test_dataset)
    tf_valid_dataset = tf.constant(valid_dataset)
    
    keep_prob = tf.constant(0.5)
    # Variables for computing hidden layer nodes values
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_layer_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_layer_nodes]))
    hidden_layer_data = tf.nn.dropout(tf.nn.relu(tf.matmul(tf_train_dataset,weights1)+biases1),keep_prob)
    
    
    #variables for computing logits for the output layer
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_layer_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))
    logits = tf.matmul(hidden_layer_data,weights2)+biases2
    
    
    global_steps = tf.Variable(0,trainable=False)
    learning_rate = tf.train.exponential_decay(learning_rate=0.5,global_step=global_steps,decay_rate=0.9,decay_steps=500)
    #calculating loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,logits=logits)) + \
        lambda_val * tf.nn.l2_loss(weights1) + lambda_val * tf.nn.l2_loss(weights2)
        
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    train_prediction = tf.nn.softmax(logits)
    test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset,weights1)+biases1),weights2)+biases2)
    valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset,weights1)+biases1),weights2)+biases2)

Added learning rate decay and training the network for more number of iterations will help the network learn better

In [22]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 842.687805
Minibatch accuracy: 7.8%
Validation accuracy: 33.5%
Minibatch loss at step 500: 199.993484
Minibatch accuracy: 75.0%
Validation accuracy: 79.2%
Minibatch loss at step 1000: 116.190338
Minibatch accuracy: 71.1%
Validation accuracy: 80.5%
Minibatch loss at step 1500: 70.170685
Minibatch accuracy: 69.5%
Validation accuracy: 81.2%
Minibatch loss at step 2000: 41.188198
Minibatch accuracy: 75.0%
Validation accuracy: 82.7%
Minibatch loss at step 2500: 24.621210
Minibatch accuracy: 85.9%
Validation accuracy: 84.4%
Minibatch loss at step 3000: 15.187391
Minibatch accuracy: 83.6%
Validation accuracy: 84.5%
Test accuracy: 90.4%


<h3>Creating a two layer neural network having 1024 nodes in the first layer and 100 nodes on the second layer</h3>

In [23]:
#trying learning rate decay
batch_size = 128
image_size = 28
num_labels = 10
hidden1_layer_nodes = 1024
hidden2_layer_nodes = 100
lambda_val = 0.001

graph = tf.Graph()
with graph.as_default():
    
    #create place holders for taking training input
    tf_train_dataset = tf.placeholder(dtype=tf.float32,shape=(batch_size,image_size*image_size))
    tf_train_labels = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_labels))
    tf_test_dataset = tf.constant(test_dataset)
    tf_valid_dataset = tf.constant(valid_dataset)
    
    keep_prob = tf.constant(0.5)
    # Variables for computing first hidden layer nodes values
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden1_layer_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden1_layer_nodes]))
    hidden1_layer_data = tf.nn.dropout(tf.nn.relu(tf.matmul(tf_train_dataset,weights1)+biases1),keep_prob)
    
    # Variables for computing second hidden layer nodes values
    weights2 = tf.Variable(
        tf.truncated_normal([hidden1_layer_nodes, hidden2_layer_nodes]))
    biases2 = tf.Variable(tf.zeros([hidden2_layer_nodes]))
    hidden2_layer_data = tf.nn.dropout(tf.nn.relu(tf.matmul(hidden1_layer_data,weights2)+biases2),keep_prob)
    
    #variables for computing logits for the output layer
    weights3 = tf.Variable(
        tf.truncated_normal([hidden2_layer_nodes, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    logits = tf.matmul(hidden2_layer_data,weights3)+biases3
    
    
    global_steps = tf.Variable(0,trainable=False)
    learning_rate = tf.train.exponential_decay(learning_rate=0.5,global_step=global_steps,decay_rate=0.9,decay_steps=5000,)
    #calculating loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,logits=logits))
        
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    train_prediction = tf.nn.softmax(logits)
    
    test_hidden2_layer_act = tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset,weights1)+biases1),weights2)+biases2
    test_prediction = tf.nn.softmax(tf.matmul(test_hidden2_layer_act,weights3)+biases3)
    
    valid_hidden2_layer_act = tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset,weights1)+biases1),weights2)+biases2
    valid_prediction = tf.nn.softmax(tf.matmul(valid_hidden2_layer_act,weights3)+biases3)

In [24]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3989.051758
Minibatch accuracy: 16.4%
Validation accuracy: 10.2%
Minibatch loss at step 500: 2.298987
Minibatch accuracy: 11.7%
Validation accuracy: 10.1%
Minibatch loss at step 1000: nan
Minibatch accuracy: 3.9%
Validation accuracy: 10.1%
Minibatch loss at step 1500: nan
Minibatch accuracy: 7.8%
Validation accuracy: 10.1%
Minibatch loss at step 2000: nan
Minibatch accuracy: 13.3%
Validation accuracy: 10.1%
Minibatch loss at step 2500: nan
Minibatch accuracy: 10.9%
Validation accuracy: 10.1%
Minibatch loss at step 3000: nan
Minibatch accuracy: 10.2%
Validation accuracy: 10.1%
Test accuracy: 10.2%


Strangely in a two layer network that is created above the accuracy of the model is as good as a random guess .
<b> It is due to the reason of weight initialization, proper weight initialization methods should be used in deep networks as done below<b/>

<h3>Applying better approach for weight initialization </h3>

In [29]:
#appliying proper weight initialization technique
#trying learning rate decay
batch_size = 128
image_size = 28
num_labels = 10
hidden1_layer_nodes = 1024
hidden2_layer_nodes = 100

graph = tf.Graph()
with graph.as_default():
    
    #create place holders for taking training input
    tf_train_dataset = tf.placeholder(dtype=tf.float32,shape=(batch_size,image_size*image_size))
    tf_train_labels = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_labels))
    tf_test_dataset = tf.constant(test_dataset)
    tf_valid_dataset = tf.constant(valid_dataset)
    
    
    # Variables for computing first hidden layer nodes values
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden1_layer_nodes],stddev=np.sqrt(2.0 / (image_size * image_size+hidden1_layer_nodes))))
    biases1 = tf.Variable(tf.zeros([hidden1_layer_nodes]))
    hidden1_layer_data = tf.nn.relu(tf.matmul(tf_train_dataset,weights1)+biases1)
    
    # Variables for computing second hidden layer nodes values
    weights2 = tf.Variable(
        tf.truncated_normal([hidden1_layer_nodes, hidden2_layer_nodes],stddev=np.sqrt(2.0 / (hidden1_layer_nodes+hidden1_layer_nodes))))
    biases2 = tf.Variable(tf.zeros([hidden2_layer_nodes]))
    hidden2_layer_data = tf.nn.relu(tf.matmul(hidden1_layer_data,weights2)+biases2)
    
    #variables for computing logits for the output layer
    weights3 = tf.Variable(
        tf.truncated_normal([hidden2_layer_nodes, num_labels],stddev=np.sqrt(2.0 / (hidden2_layer_nodes+num_labels))))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    logits = tf.matmul(hidden2_layer_data,weights3)+biases3
    
    
    global_steps = tf.Variable(0,trainable=False)
    learning_rate = tf.train.exponential_decay(learning_rate=0.2,global_step=global_steps,decay_rate=0.8,decay_steps=2000,staircase=True)
    #calculating loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,logits=logits)) + \
        lambda_val*tf.nn.l2_loss(weights1) + lambda_val*tf.nn.l2_loss(weights2) + lambda_val*tf.nn.l2_loss(weights3)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    train_prediction = tf.nn.softmax(logits)
    
    test_hidden2_layer_act = tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset,weights1)+biases1),weights2)+biases2
    test_prediction = tf.nn.softmax(tf.matmul(test_hidden2_layer_act,weights3)+biases3)
    
    valid_hidden2_layer_act = tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset,weights1)+biases1),weights2)+biases2
    valid_prediction = tf.nn.softmax(tf.matmul(valid_hidden2_layer_act,weights3)+biases3)

In [26]:
num_steps = 10000

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 5000 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.710467
Minibatch accuracy: 17.2%
Validation accuracy: 28.1%
Minibatch loss at step 5000: 0.694132
Minibatch accuracy: 81.2%
Validation accuracy: 87.6%
Test accuracy: 93.3%
