Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in _notmist.ipynb_.

In [2]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [4]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

---
L2 regularization of the logistic regression model 
---
beta is the regularization parameter

In [21]:
# With gradient descent training, even this much data is prohibitive.
# Subset the training data for faster turnaround.
train_subset = 10000
beta=0.0

graph = tf.Graph()
with graph.as_default():

  # Input data.
  # Load the training, validation and test data into constants that are
  # attached to the graph.
  tf_train_dataset = tf.constant(train_dataset[:train_subset, :])
  tf_train_labels = tf.constant(train_labels[:train_subset])
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  # These are the parameters that we are going to be training. The weight
  # matrix will be initialized using random valued following a (truncated)
  # normal distribution. The biases get initialized to zero.
  weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_labels]))
  biases = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  # We multiply the inputs with the weight matrix, and add biases. We compute
  # the softmax and cross-entropy (it's one operation in TensorFlow, because
  # it's very common, and it can be optimized). We take the average of this
  # cross-entropy across all training examples: that's our loss.
  logits = tf.matmul(tf_train_dataset, weights) + biases
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels)) + beta*(tf.nn.l2_loss(weights)+tf.nn.l2_loss(biases))
  
  # Optimizer.
  # We are going to find the minimum of this loss using gradient descent.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  # These are not part of training, but merely here so that we can report
  # accuracy figures as we train.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(
    tf.matmul(tf_valid_dataset, weights) + biases)
  test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [22]:
num_steps = 801

def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

with tf.Session(graph=graph) as session:
  # This is a one-time operation which ensures the parameters get initialized as
  # we described in the graph: random weights for the matrix, zeros for the
  # biases. 
  tf.initialize_all_variables().run()
  print('Initialized')
  for step in range(num_steps):
    # Run the computations. We tell .run() that we want to run the optimizer,
    # and get the loss value and the training predictions returned as numpy
    # arrays.
    _, l, predictions = session.run([optimizer, loss, train_prediction])
    if (step % 100 == 0):
      print('Loss at step %d: %f' % (step, l))
      print('Training accuracy: %.1f%%' % accuracy(
        predictions, train_labels[:train_subset, :]))
      # Calling .eval() on valid_prediction is basically like calling run(), but
      # just to get that one numpy array. Note that it recomputes all its graph
      # dependencies.
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Loss at step 0: 17.749409
Training accuracy: 10.2%
Validation accuracy: 12.3%
Loss at step 100: 2.337568
Training accuracy: 71.1%
Validation accuracy: 69.5%
Loss at step 200: 1.862736
Training accuracy: 74.3%
Validation accuracy: 72.2%
Loss at step 300: 1.613759
Training accuracy: 75.7%
Validation accuracy: 73.1%
Loss at step 400: 1.451089
Training accuracy: 76.7%
Validation accuracy: 73.7%
Loss at step 500: 1.332731
Training accuracy: 77.3%
Validation accuracy: 73.9%
Loss at step 600: 1.240709
Training accuracy: 77.9%
Validation accuracy: 74.1%
Loss at step 700: 1.165575
Training accuracy: 78.3%
Validation accuracy: 74.3%
Loss at step 800: 1.102303
Training accuracy: 78.7%
Validation accuracy: 74.4%
Test accuracy: 82.5%


In [23]:
# With gradient descent training, even this much data is prohibitive.
# Subset the training data for faster turnaround.
train_subset = 10000
beta=0.05

graph = tf.Graph()
with graph.as_default():

  # Input data.
  # Load the training, validation and test data into constants that are
  # attached to the graph.
  tf_train_dataset = tf.constant(train_dataset[:train_subset, :])
  tf_train_labels = tf.constant(train_labels[:train_subset])
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  # These are the parameters that we are going to be training. The weight
  # matrix will be initialized using random valued following a (truncated)
  # normal distribution. The biases get initialized to zero.
  weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_labels]))
  biases = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  # We multiply the inputs with the weight matrix, and add biases. We compute
  # the softmax and cross-entropy (it's one operation in TensorFlow, because
  # it's very common, and it can be optimized). We take the average of this
  # cross-entropy across all training examples: that's our loss.
  logits = tf.matmul(tf_train_dataset, weights) + biases
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels)) + beta*(tf.nn.l2_loss(weights)+tf.nn.l2_loss(biases))
  
  # Optimizer.
  # We are going to find the minimum of this loss using gradient descent.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  # These are not part of training, but merely here so that we can report
  # accuracy figures as we train.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(
    tf.matmul(tf_valid_dataset, weights) + biases)
  test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [24]:
num_steps = 801

def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

with tf.Session(graph=graph) as session:
  # This is a one-time operation which ensures the parameters get initialized as
  # we described in the graph: random weights for the matrix, zeros for the
  # biases. 
  tf.initialize_all_variables().run()
  print('Initialized')
  for step in range(num_steps):
    # Run the computations. We tell .run() that we want to run the optimizer,
    # and get the loss value and the training predictions returned as numpy
    # arrays.
    _, l, predictions = session.run([optimizer, loss, train_prediction])
    if (step % 100 == 0):
      print('Loss at step %d: %f' % (step, l))
      print('Training accuracy: %.1f%%' % accuracy(
        predictions, train_labels[:train_subset, :]))
      # Calling .eval() on valid_prediction is basically like calling run(), but
      # just to get that one numpy array. Note that it recomputes all its graph
      # dependencies.
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Loss at step 0: 169.583466
Training accuracy: 9.6%
Validation accuracy: 12.4%
Loss at step 100: 1.746494
Training accuracy: 80.7%
Validation accuracy: 79.7%
Loss at step 200: 0.898433
Training accuracy: 81.5%
Validation accuracy: 80.5%
Loss at step 300: 0.893574
Training accuracy: 81.5%
Validation accuracy: 80.4%
Loss at step 400: 0.893545
Training accuracy: 81.5%
Validation accuracy: 80.4%
Loss at step 500: 0.893545
Training accuracy: 81.5%
Validation accuracy: 80.4%
Loss at step 600: 0.893545
Training accuracy: 81.5%
Validation accuracy: 80.4%
Loss at step 700: 0.893545
Training accuracy: 81.5%
Validation accuracy: 80.4%
Loss at step 800: 0.893545
Training accuracy: 81.5%
Validation accuracy: 80.4%
Test accuracy: 88.1%


In [31]:
# With gradient descent training, even this much data is prohibitive.
# Subset the training data for faster turnaround.
train_subset = 10000
beta=0.1

graph = tf.Graph()
with graph.as_default():

  # Input data.
  # Load the training, validation and test data into constants that are
  # attached to the graph.
  tf_train_dataset = tf.constant(train_dataset[:train_subset, :])
  tf_train_labels = tf.constant(train_labels[:train_subset])
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  # These are the parameters that we are going to be training. The weight
  # matrix will be initialized using random valued following a (truncated)
  # normal distribution. The biases get initialized to zero.
  weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_labels]))
  biases = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  # We multiply the inputs with the weight matrix, and add biases. We compute
  # the softmax and cross-entropy (it's one operation in TensorFlow, because
  # it's very common, and it can be optimized). We take the average of this
  # cross-entropy across all training examples: that's our loss.
  logits = tf.matmul(tf_train_dataset, weights) + biases
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels)) + beta*(tf.nn.l2_loss(weights)+tf.nn.l2_loss(biases))
  
  # Optimizer.
  # We are going to find the minimum of this loss using gradient descent.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  # These are not part of training, but merely here so that we can report
  # accuracy figures as we train.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(
    tf.matmul(tf_valid_dataset, weights) + biases)
  test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [32]:
num_steps = 801

def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

with tf.Session(graph=graph) as session:
  # This is a one-time operation which ensures the parameters get initialized as
  # we described in the graph: random weights for the matrix, zeros for the
  # biases. 
  tf.initialize_all_variables().run()
  print('Initialized')
  for step in range(num_steps):
    # Run the computations. We tell .run() that we want to run the optimizer,
    # and get the loss value and the training predictions returned as numpy
    # arrays.
    _, l, predictions = session.run([optimizer, loss, train_prediction])
    if (step % 100 == 0):
      print('Loss at step %d: %f' % (step, l))
      print('Training accuracy: %.1f%%' % accuracy(
        predictions, train_labels[:train_subset, :]))
      # Calling .eval() on valid_prediction is basically like calling run(), but
      # just to get that one numpy array. Note that it recomputes all its graph
      # dependencies.
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Loss at step 0: 320.035522
Training accuracy: 8.8%
Validation accuracy: 11.2%
Loss at step 100: 1.025233
Training accuracy: 80.6%
Validation accuracy: 79.9%
Loss at step 200: 1.016166
Training accuracy: 80.5%
Validation accuracy: 79.9%
Loss at step 300: 1.016166
Training accuracy: 80.5%
Validation accuracy: 79.9%
Loss at step 400: 1.016166
Training accuracy: 80.5%
Validation accuracy: 79.9%
Loss at step 500: 1.016166
Training accuracy: 80.5%
Validation accuracy: 79.9%
Loss at step 600: 1.016166
Training accuracy: 80.5%
Validation accuracy: 79.9%
Loss at step 700: 1.016166
Training accuracy: 80.5%
Validation accuracy: 79.9%
Loss at step 800: 1.016166
Training accuracy: 80.5%
Validation accuracy: 79.9%
Test accuracy: 87.4%


---
Conclusion
---
A small beta of about 0.05 significantly improves the test accuracy of the logistic model. Larger beta values lead to underfitting. 

---
L2 regularization of a neural network with one hidden layer
---

In [43]:
batch_size = 128
num_hidden_layers = 1024
beta=0.05

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  weights_01 = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_hidden_layers]))
  biases_01 = tf.Variable(tf.zeros([num_hidden_layers]))
  weights_12 = tf.Variable(
    tf.truncated_normal([num_hidden_layers,num_labels]))
  biases_12 = tf.Variable(tf.zeros([num_labels]))
  
    
  # Training computation.
  z_01 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_01) + biases_01)
  z_02 = tf.matmul(z_01, weights_12) + biases_12
    
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(z_02, tf_train_labels)) + beta*(tf.nn.l2_loss(weights_01)+tf.nn.l2_loss(weights_12))
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(z_02)
  
    
  z_01_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_01) + biases_01)
  z_02_valid = tf.matmul(z_01_valid, weights_12) + biases_12
  valid_prediction = tf.nn.softmax(z_02_valid)

  z_01_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_01) + biases_01)
  z_02_test = tf.matmul(z_01_test, weights_12) + biases_12
  test_prediction = tf.nn.softmax(z_02_test)

In [44]:
num_steps = 20001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 1000 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 16010.475586
Minibatch accuracy: 10.2%
Validation accuracy: 24.2%
Minibatch loss at step 1000: 1.276807
Minibatch accuracy: 73.4%
Validation accuracy: 79.8%
Minibatch loss at step 2000: 1.095932
Minibatch accuracy: 78.1%
Validation accuracy: 76.0%
Minibatch loss at step 3000: 1.255746
Minibatch accuracy: 75.8%
Validation accuracy: 79.7%
Minibatch loss at step 4000: 1.029670
Minibatch accuracy: 81.2%
Validation accuracy: 80.4%
Minibatch loss at step 5000: 1.147048
Minibatch accuracy: 78.1%
Validation accuracy: 78.2%
Minibatch loss at step 6000: 1.224494
Minibatch accuracy: 75.8%
Validation accuracy: 76.7%
Minibatch loss at step 7000: 1.071976
Minibatch accuracy: 79.7%
Validation accuracy: 80.0%
Minibatch loss at step 8000: 1.135514
Minibatch accuracy: 72.7%
Validation accuracy: 77.9%
Minibatch loss at step 9000: 1.028635
Minibatch accuracy: 84.4%
Validation accuracy: 79.6%
Minibatch loss at step 10000: 1.193546
Minibatch accuracy: 78.9%
Validation a

In [39]:
#TRAINING WITH BIAS REGULARIZATION
batch_size = 128
num_hidden_layers = 1024
beta=0.01

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  weights_01 = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_hidden_layers]))
  biases_01 = tf.Variable(tf.zeros([num_hidden_layers]))
  weights_12 = tf.Variable(
    tf.truncated_normal([num_hidden_layers,num_labels]))
  biases_12 = tf.Variable(tf.zeros([num_labels]))
  
    
  # Training computation.
  z_01 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_01) + biases_01)
  z_02 = tf.matmul(z_01, weights_12) + biases_12
    
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(z_02, tf_train_labels)) + beta*(tf.nn.l2_loss(weights_01)+tf.nn.l2_loss(biases_01)+tf.nn.l2_loss(weights_12)+tf.nn.l2_loss(biases_12))
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(z_02)
  
    
  z_01_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_01) + biases_01)
  z_02_valid = tf.matmul(z_01_valid, weights_12) + biases_12
  valid_prediction = tf.nn.softmax(z_02_valid)

  z_01_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_01) + biases_01)
  z_02_test = tf.matmul(z_01_test, weights_12) + biases_12
  test_prediction = tf.nn.softmax(z_02_test)

In [40]:
num_steps = 20001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 1000 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3545.948486
Minibatch accuracy: 7.8%
Validation accuracy: 32.1%
Minibatch loss at step 1000: 1.034058
Minibatch accuracy: 79.7%
Validation accuracy: 83.3%
Minibatch loss at step 2000: 0.714004
Minibatch accuracy: 84.4%
Validation accuracy: 81.6%
Minibatch loss at step 3000: 0.831817
Minibatch accuracy: 80.5%
Validation accuracy: 82.9%
Minibatch loss at step 4000: 0.637632
Minibatch accuracy: 85.2%
Validation accuracy: 83.2%
Minibatch loss at step 5000: 0.842441
Minibatch accuracy: 82.0%
Validation accuracy: 82.5%
Minibatch loss at step 6000: 0.858152
Minibatch accuracy: 78.9%
Validation accuracy: 81.6%
Minibatch loss at step 7000: 0.699000
Minibatch accuracy: 85.2%
Validation accuracy: 82.5%
Minibatch loss at step 8000: 0.797049
Minibatch accuracy: 79.7%
Validation accuracy: 81.1%
Minibatch loss at step 9000: 0.683247
Minibatch accuracy: 84.4%
Validation accuracy: 82.7%
Minibatch loss at step 10000: 0.796071
Minibatch accuracy: 84.4%
Validation acc

In [41]:
#TRAINING W/O BIAS REGULARIZATION AT BETA=0.001
batch_size = 128
num_hidden_layers = 1024
beta=0.001

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  weights_01 = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_hidden_layers]))
  biases_01 = tf.Variable(tf.zeros([num_hidden_layers]))
  weights_12 = tf.Variable(
    tf.truncated_normal([num_hidden_layers,num_labels]))
  biases_12 = tf.Variable(tf.zeros([num_labels]))
  
    
  # Training computation.
  z_01 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_01) + biases_01)
  z_02 = tf.matmul(z_01, weights_12) + biases_12
    
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(z_02, tf_train_labels)) + beta*(tf.nn.l2_loss(weights_01)+tf.nn.l2_loss(weights_12))
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(z_02)
  
    
  z_01_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_01) + biases_01)
  z_02_valid = tf.matmul(z_01_valid, weights_12) + biases_12
  valid_prediction = tf.nn.softmax(z_02_valid)

  z_01_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_01) + biases_01)
  z_02_test = tf.matmul(z_01_test, weights_12) + biases_12
  test_prediction = tf.nn.softmax(z_02_test)

In [42]:
num_steps = 20001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 1000 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 605.349792
Minibatch accuracy: 10.9%
Validation accuracy: 33.0%
Minibatch loss at step 1000: 115.583450
Minibatch accuracy: 79.7%
Validation accuracy: 80.3%
Minibatch loss at step 2000: 41.456627
Minibatch accuracy: 82.8%
Validation accuracy: 83.8%
Minibatch loss at step 3000: 15.530670
Minibatch accuracy: 88.3%
Validation accuracy: 86.5%
Minibatch loss at step 4000: 5.910212
Minibatch accuracy: 90.6%
Validation accuracy: 87.2%
Minibatch loss at step 5000: 2.623655
Minibatch accuracy: 84.4%
Validation accuracy: 87.8%
Minibatch loss at step 6000: 1.342641
Minibatch accuracy: 83.6%
Validation accuracy: 87.7%
Minibatch loss at step 7000: 0.778712
Minibatch accuracy: 90.6%
Validation accuracy: 87.8%
Minibatch loss at step 8000: 0.558711
Minibatch accuracy: 89.1%
Validation accuracy: 87.8%
Minibatch loss at step 9000: 0.478707
Minibatch accuracy: 92.2%
Validation accuracy: 88.1%
Minibatch loss at step 10000: 0.558133
Minibatch accuracy: 89.8%
Validation

---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

We simply restrict the available training data to precisely the first 50 examples of the whole training set, i.e. we switch off the randomized offset, introduced earlier. This way the network trains on a small subset of all training data and soon overfits.  

---

In [11]:
batch_size = 128
num_hidden_layers = 1024

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  weights_01 = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_hidden_layers]))
  biases_01 = tf.Variable(tf.zeros([num_hidden_layers]))
  weights_12 = tf.Variable(
    tf.truncated_normal([num_hidden_layers,num_labels]))
  biases_12 = tf.Variable(tf.zeros([num_labels]))
  
    
  # Training computation.
  z_01 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_01) + biases_01)
  z_02 = tf.matmul(z_01, weights_12) + biases_12
    
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(z_02, tf_train_labels))
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(z_02)
  
    
  z_01_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_01) + biases_01)
  z_02_valid = tf.matmul(z_01_valid, weights_12) + biases_12
  valid_prediction = tf.nn.softmax(z_02_valid)

  z_01_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_01) + biases_01)
  z_02_test = tf.matmul(z_01_test, weights_12) + biases_12
  test_prediction = tf.nn.softmax(z_02_test)

In [12]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = 0 #(step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 337.187286
Minibatch accuracy: 10.9%
Validation accuracy: 32.5%
Minibatch loss at step 500: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 64.8%
Minibatch loss at step 1000: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 64.8%
Minibatch loss at step 1500: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 64.8%
Minibatch loss at step 2000: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 64.8%
Minibatch loss at step 2500: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 64.8%
Minibatch loss at step 3000: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 64.8%
Test accuracy: 71.1%


---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

---

In [13]:

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  keep_prob = tf.placeholder(tf.float32)
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  weights_01 = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_hidden_layers]))
  biases_01 = tf.Variable(tf.zeros([num_hidden_layers]))
  weights_12 = tf.Variable(
    tf.truncated_normal([num_hidden_layers,num_labels]))
  biases_12 = tf.Variable(tf.zeros([num_labels]))
  
    
  # Training computation.
  logits = tf.matmul(tf_train_dataset, weights_01) + biases_01
  z_01 = tf.nn.relu(logits)
  h_drop = tf.nn.dropout(z_01, keep_prob)
  z_02 = tf.matmul(h_drop, weights_12) + biases_12
    
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(z_02, tf_train_labels))
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(z_02)
    
  z_01_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_01) + biases_01)
  z_02_valid = tf.matmul(z_01_valid, weights_12) + biases_12
  valid_prediction = tf.nn.softmax(z_02_valid)

  z_01_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_01) + biases_01)
  z_02_test = tf.matmul(z_01_test, weights_12) + biases_12
  test_prediction = tf.nn.softmax(z_02_test)

In [18]:
# we use the same restricted training set as before, i.e. we set offset to zero.

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = 0 # (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    
    if (step%500 == 0):
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 1.0}
      _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)   
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
    
    else:
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 0.5}
      _, l, predictions = session.run(
        [optimizer, loss, train_prediction], feed_dict=feed_dict)     
    
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 328.940002
Minibatch accuracy: 8.6%
Validation accuracy: 34.0%
Minibatch loss at step 500: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 71.9%
Minibatch loss at step 1000: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 71.3%
Minibatch loss at step 1500: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 71.6%
Minibatch loss at step 2000: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 72.0%
Minibatch loss at step 2500: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 71.1%
Minibatch loss at step 3000: 0.000000
Minibatch accuracy: 100.0%
Validation accuracy: 71.3%
Test accuracy: 78.8%


---
Conclusion
---
Dropout regularization significantly improves the overfitting problem (78.8% testing accuracy vs 71.1% before)

---
Problem 4
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


Let's try split the hidden layer into two smaller hidden layers using ordinary regularization and learning rate decay.

In [52]:
batch_size = 128
num_hidden_layer1 = 1024
num_hidden_layer2 = 128
beta=1e-4

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  #keep_prob = tf.placeholder(tf.float32)
  
  # Variables.
  weights_01 = tf.Variable(tf.truncated_normal([image_size * image_size, num_hidden_layer1],stddev=1.0/num_hidden_layer1))
  biases_01  = tf.Variable(tf.zeros([num_hidden_layer1]))
  weights_12 = tf.Variable(tf.truncated_normal([num_hidden_layer1,num_hidden_layer2],stddev=1.0/num_hidden_layer2))
  biases_12  = tf.Variable(tf.zeros([num_hidden_layer2]))
  weights_23 = tf.Variable(tf.truncated_normal([num_hidden_layer2,num_labels],stddev=1.0/num_labels))
  biases_23  = tf.Variable(tf.zeros([num_labels]))  
  
  global_step = tf.Variable(0)  # for learning rate decay
  
  # Training computation.
  z_01 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_01) + biases_01)
  z_02 = tf.nn.relu(tf.matmul(z_01, weights_12) + biases_12)
  # h_drop = tf.nn.dropout(z_02, keep_prob)  
  z_03 = tf.matmul(z_02, weights_23) + biases_23
     
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(z_03, tf_train_labels)) \
         + beta*(tf.nn.l2_loss(weights_01)+tf.nn.l2_loss(weights_12)+tf.nn.l2_loss(weights_23))
  
  #optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)  # fixed learning rate optimizer
  starter_learning_rate = 0.1
  learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 1000, 0.96, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(z_03)
      
  z_01_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_01) + biases_01)
  z_02_valid = tf.nn.relu(tf.matmul(z_01_valid, weights_12) + biases_12)
  z_03_valid = tf.matmul(z_02_valid, weights_23) + biases_23
  valid_prediction = tf.nn.softmax(z_03_valid)
    
  z_01_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_01) + biases_01)
  z_02_test = tf.nn.relu(tf.matmul(z_01_test, weights_12) + biases_12)
  z_03_test = tf.matmul(z_02_test, weights_23) + biases_23
  test_prediction = tf.nn.softmax(z_03_test)

In [53]:
num_steps =  10001
with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    
    if (step%500 == 0):
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
      _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)   
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
    
    else:
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
      _, l, predictions = session.run(
        [optimizer, loss, train_prediction], feed_dict=feed_dict)     
    
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.303339
Minibatch accuracy: 15.6%
Validation accuracy: 25.4%
Minibatch loss at step 500: 0.493493
Minibatch accuracy: 85.2%
Validation accuracy: 82.8%
Minibatch loss at step 1000: 0.683289
Minibatch accuracy: 82.0%
Validation accuracy: 84.1%
Minibatch loss at step 1500: 0.315254
Minibatch accuracy: 91.4%
Validation accuracy: 85.3%
Minibatch loss at step 2000: 0.434237
Minibatch accuracy: 86.7%
Validation accuracy: 85.5%
Minibatch loss at step 2500: 0.359008
Minibatch accuracy: 89.1%
Validation accuracy: 86.2%
Minibatch loss at step 3000: 0.470756
Minibatch accuracy: 85.2%
Validation accuracy: 87.0%
Minibatch loss at step 3500: 0.525446
Minibatch accuracy: 84.4%
Validation accuracy: 87.7%
Minibatch loss at step 4000: 0.273722
Minibatch accuracy: 91.4%
Validation accuracy: 87.9%
Minibatch loss at step 4500: 0.304876
Minibatch accuracy: 92.2%
Validation accuracy: 88.2%
Minibatch loss at step 5000: 0.464231
Minibatch accuracy: 85.9%
Validation accurac

---
Conclusions
---
1. Initializing the weights with **small** numbers was crucial for numerical stability.
Otherwise nans get spat out.
2. Two equal fully connected hidden layers of 256 units lead to 95% test accuracy.
3. regularization parameter seemed to be optimal around beta = [1e-4 , 1e-5] for these equal sized hidden layers.
4. Adding learning rate decay on top of that didn't improve the test accuracy significantly. 
5. Two fully connected hidden layers of 512 and 256 units lead to 95.1% test accuracy (no hyperparameter tuning).
6. Two fully connected hidden layers of 1024 and 128 units lead to 95.2% test accuracy (beta=1e-4).

---
Dropout regularization
---
Test the same network as before, but now with dropout regularization

In [78]:
batch_size = 128
num_hidden_layer1 = 1024
num_hidden_layer2 = 128

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  keep_prob = tf.placeholder(tf.float32)
  
  # Variables.
  weights_01 = tf.Variable(tf.truncated_normal([image_size * image_size, num_hidden_layer1],stddev=1.0/num_hidden_layer1))
  biases_01  = tf.Variable(tf.zeros([num_hidden_layer1]))
  weights_12 = tf.Variable(tf.truncated_normal([num_hidden_layer1,num_hidden_layer2],stddev=1.0/num_hidden_layer2))
  biases_12  = tf.Variable(tf.zeros([num_hidden_layer2]))
  weights_23 = tf.Variable(tf.truncated_normal([num_hidden_layer2,num_labels],stddev=1.0/num_labels))
  biases_23  = tf.Variable(tf.zeros([num_labels]))  
  
  global_step = tf.Variable(0)  # for learning rate decay
  
  # Training computation.
  z_01 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_01) + biases_01)
  z_02 = tf.nn.relu(tf.matmul(z_01, weights_12) + biases_12)
  h_drop = tf.nn.dropout(z_02, keep_prob)  
  z_03 = tf.matmul(h_drop, weights_23) + biases_23
     
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(z_03, tf_train_labels)) 
  
  #optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)  # fixed learning rate optimizer
  starter_learning_rate = 0.1
  learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 1000, 0.96, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(z_03)
      
  z_01_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_01) + biases_01)
  z_02_valid = tf.nn.relu(tf.matmul(z_01_valid, weights_12) + biases_12)
  z_03_valid = tf.matmul(z_02_valid, weights_23) + biases_23
  valid_prediction = tf.nn.softmax(z_03_valid)
    
  z_01_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_01) + biases_01)
  z_02_test = tf.nn.relu(tf.matmul(z_01_test, weights_12) + biases_12)
  z_03_test = tf.matmul(z_02_test, weights_23) + biases_23
  test_prediction = tf.nn.softmax(z_03_test)

In [79]:
num_steps =  40001
with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    
    if (step%5000 == 0):
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 1.0}
      _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)   
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
    
    else:
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5}
      _, l, predictions = session.run(
        [optimizer, loss, train_prediction], feed_dict=feed_dict)     
    
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.302588
Minibatch accuracy: 4.7%
Validation accuracy: 28.5%
Minibatch loss at step 5000: 0.458427
Minibatch accuracy: 85.9%
Validation accuracy: 88.4%
Minibatch loss at step 10000: 0.339566
Minibatch accuracy: 89.1%
Validation accuracy: 89.5%
Minibatch loss at step 15000: 0.171588
Minibatch accuracy: 95.3%
Validation accuracy: 90.0%
Minibatch loss at step 20000: 0.189020
Minibatch accuracy: 93.0%
Validation accuracy: 90.5%
Minibatch loss at step 25000: 0.247092
Minibatch accuracy: 93.0%
Validation accuracy: 90.8%
Minibatch loss at step 30000: 0.172979
Minibatch accuracy: 96.1%
Validation accuracy: 90.9%
Minibatch loss at step 35000: 0.178730
Minibatch accuracy: 93.0%
Validation accuracy: 90.8%
Minibatch loss at step 40000: 0.191066
Minibatch accuracy: 94.5%
Validation accuracy: 91.0%
Test accuracy: 96.0%


---
Results of dropout regularization on a 2 hidden layer network
---
fully connected. dropout regularization on the output of the last layer. No other regularization
keep_prob = 0.5

1.  256 x 256, 95.1% @10001 iteration
2.  512 x 256, 95.2% @ 10001 iterations
3. 1024 x 128, 95.2% @ 10001 iterations
4. 1024 x 128, 96.0% @ 40001 iterations
---
Conclusions
---
the networks trained here using dropout regularization match the performance of those trained using ordinary regularization. There is no hyperparameter tuning neccessary here, (apart from keep_prob=50 %).



---
Now try 3 hidden layers
---
Let's see whether we can still improve, by adding a nother hidden layer.  

In [76]:
batch_size = 128
num_hidden_layer1 = 1024
num_hidden_layer2 = 128
num_hidden_layer3 = 64

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  keep_prob = tf.placeholder(tf.float32)
  
  # Variables.
  weights_01 = tf.Variable(tf.truncated_normal([image_size * image_size, num_hidden_layer1],stddev=1.0/num_hidden_layer1))
  biases_01  = tf.Variable(tf.zeros([num_hidden_layer1]))
  weights_12 = tf.Variable(tf.truncated_normal([num_hidden_layer1,num_hidden_layer2],stddev=1.0/num_hidden_layer2))
  biases_12  = tf.Variable(tf.zeros([num_hidden_layer2]))
  weights_23 = tf.Variable(tf.truncated_normal([num_hidden_layer2,num_hidden_layer3],stddev=1.0/num_hidden_layer3))
  biases_23  = tf.Variable(tf.zeros([num_hidden_layer3]))  
  weights_34 = tf.Variable(tf.truncated_normal([num_hidden_layer3,num_labels],stddev=1.0/num_labels))
  biases_34  = tf.Variable(tf.zeros([num_labels]))  


  global_step = tf.Variable(0)  # for learning rate decay
  
  # Training computation.
  z_01 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_01) + biases_01)
  z_02 = tf.nn.relu(tf.matmul(z_01, weights_12) + biases_12)
  z_03 = tf.nn.relu(tf.matmul(z_02, weights_23) + biases_23)
  h_drop = tf.nn.dropout(z_03, keep_prob)  
  z_04 = tf.matmul(h_drop, weights_34) + biases_34
     
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(z_04, tf_train_labels)) 
  
  #optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)  # fixed learning rate optimizer
  starter_learning_rate = 0.1
  learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 1000, 0.96, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(z_04)
      
  z_01_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_01) + biases_01)
  z_02_valid = tf.nn.relu(tf.matmul(z_01_valid, weights_12) + biases_12)
  z_03_valid = tf.nn.relu(tf.matmul(z_02_valid, weights_23) + biases_23)
  z_04_valid = tf.matmul(z_03_valid, weights_34) + biases_34
  valid_prediction = tf.nn.softmax(z_04_valid)
    
  z_01_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_01) + biases_01)
  z_02_test = tf.nn.relu(tf.matmul(z_01_test, weights_12) + biases_12)
  z_03_test = tf.nn.relu(tf.matmul(z_02_test, weights_23) + biases_23)
  z_04_test = tf.matmul(z_03_test, weights_34) + biases_34
  test_prediction = tf.nn.softmax(z_04_test)

In [77]:
num_steps =  40001
with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    
    if (step%5000 == 0):
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 1.0}
      _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)   
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
    
    else:
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5}
      _, l, predictions = session.run(
        [optimizer, loss, train_prediction], feed_dict=feed_dict)     
    
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.302577
Minibatch accuracy: 13.3%
Validation accuracy: 10.0%
Minibatch loss at step 5000: 0.460902
Minibatch accuracy: 85.2%
Validation accuracy: 87.4%
Minibatch loss at step 10000: 0.382691
Minibatch accuracy: 89.1%
Validation accuracy: 89.0%
Minibatch loss at step 15000: 0.207812
Minibatch accuracy: 93.0%
Validation accuracy: 89.7%
Minibatch loss at step 20000: 0.201253
Minibatch accuracy: 93.0%
Validation accuracy: 90.3%
Minibatch loss at step 25000: 0.216620
Minibatch accuracy: 93.0%
Validation accuracy: 90.5%
Minibatch loss at step 30000: 0.164299
Minibatch accuracy: 94.5%
Validation accuracy: 90.7%
Minibatch loss at step 35000: 0.124943
Minibatch accuracy: 95.3%
Validation accuracy: 90.9%
Minibatch loss at step 40000: 0.191021
Minibatch accuracy: 93.8%
Validation accuracy: 90.7%
Test accuracy: 96.0%


---
Conclusions
---
We used dropout regularization and 10001 iterations.
Adding a third fully connected layer had no positive effect on the test accuracy for the following networks:
1.  256 x 256 x 256 : 94.5% , batch_size: 128 @ 10001 iterations
2.  512 x 256 x 128 : 94.7% , batch_size: 128 @ 10001 iterations
3. 1024 x 256 x 64  : 94.9% , batch_sizes: 128, 256 @ 10001 iterations



Thus: no significant improvement over the 2 fully connected hidden layers network in these simulations here. But maybe longer training will improve the accuracy? Try 20001 and 40001 iterations.
1. 1024 x 256 x 64  : 95.4%, batch_size: 128 @ 20001 iterations
2. 1024 x 256 x 64  : 95.8%, batch_size: 128 @ 40001 iterations

So was it just the lack of training that prevented better results? 
To check we reduce the second hidden layer by a factor of two.

1. 1024 x 128 x 64  : 96.0%,  batch_size: 128 @ 40001 iterations

(For comparison we also have to train the two-hidden-layer networks longer. Result:
1024 x 128 : also 96.0%    batch_size: 128 @ 40001 iterations)

Longer training did significantly improve the test accuracy!
It's reasonable to assume that longer training will still improve the accuracies of either 2 or 3 hidden-layered models. But without a decent GPU I'm not going to test this any further.

