Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in _notmist.ipynb_.

In [2]:
pickle_file = 'notMNIST_sanitized.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

('Training set', (200000, 28, 28), (200000,))
('Validation set', (9633, 28, 28), (9633,))
('Test set', (9604, 28, 28), (9604,))


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels

train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

('Training set', (200000, 784), (200000, 10))
('Validation set', (9633, 784), (9633, 10))
('Test set', (9604, 784), (9604, 10))


In [4]:
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) 
            / predictions.shape[0])

---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compue the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

In [5]:
# From previous problem set, with modifications

batch_size = 128
num_relus = 1024
l2_w1par = 0.01
l2_w2par = 0.01


graph = tf.Graph()
with graph.as_default():

    # For training data, use a placeholder that will be fed a minibatch at runtime.
    tf_train_dataset = tf.placeholder(tf.float32, 
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
  
    # Variables.
    # our network now has one set of weight and biases producing an output vector of
    # size `num_relus`. we apply the relus and then feed that vector into a new set 
    # of weight matrix + bias vector that will produce a vector of sime `num_labels`
    # that is fed into our softmax
    weights1 = tf.Variable(tf.truncated_normal([image_size * image_size, num_relus]))
    biases1 = tf.Variable(tf.zeros([num_relus]))
    weights2 = tf.Variable(tf.truncated_normal([num_relus, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

  
    # Training computation.
    logits1 = tf.matmul(tf_train_dataset, weights1) + biases1
    activations = tf.nn.relu(logits1)
    logits2 = tf.matmul(activations, weights2) + biases2

    # add L2 loss terms: https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#l2_loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits2, tf_train_labels) + 
                         l2_w1par * tf.nn.l2_loss(weights1) + 
                         l2_w2par * tf.nn.l2_loss(weights2))
  
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits2)
    # validation set...
    val_logits1 = tf.matmul(tf_valid_dataset, weights1) + biases1
    val_activations = tf.nn.relu(val_logits1)
    val_logits2 = tf.matmul(val_activations, weights2) + biases2
    valid_prediction = tf.nn.softmax(val_logits2)
    # test set...
    test_logits1 = tf.matmul(tf_test_dataset, weights1) + biases1
    test_activations = tf.nn.relu(test_logits1)
    test_logits2 = tf.matmul(test_activations, weights2) + biases2
    test_prediction = tf.nn.softmax(test_logits2) 

In [6]:
# From previous problem set, with modifications

num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                    valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3505.096924
Minibatch accuracy: 10.2%
Validation accuracy: 23.9%
Minibatch loss at step 500: 21.529495
Minibatch accuracy: 76.6%
Validation accuracy: 80.0%
Minibatch loss at step 1000: 0.915912
Minibatch accuracy: 85.9%
Validation accuracy: 79.4%
Minibatch loss at step 1500: 0.828583
Minibatch accuracy: 79.7%
Validation accuracy: 79.6%
Minibatch loss at step 2000: 0.682324
Minibatch accuracy: 88.3%
Validation accuracy: 79.9%
Minibatch loss at step 2500: 0.696696
Minibatch accuracy: 82.0%
Validation accuracy: 79.5%
Minibatch loss at step 3000: 0.637018
Minibatch accuracy: 87.5%
Validation accuracy: 79.9%
Test accuracy: 90.0%


---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

---

In [7]:
train_dataset = train_dataset[:(128*3)]
train_labels = train_labels[:(128*3)]

In [8]:
num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                    valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3598.753662
Minibatch accuracy: 5.5%
Validation accuracy: 17.6%
Minibatch loss at step 500: 21.056644
Minibatch accuracy: 100.0%
Validation accuracy: 68.5%
Minibatch loss at step 1000: 0.402568
Minibatch accuracy: 100.0%
Validation accuracy: 69.7%
Minibatch loss at step 1500: 0.242716
Minibatch accuracy: 100.0%
Validation accuracy: 70.3%
Minibatch loss at step 2000: 0.232524
Minibatch accuracy: 100.0%
Validation accuracy: 70.4%
Minibatch loss at step 2500: 0.227466
Minibatch accuracy: 100.0%
Validation accuracy: 70.4%
Minibatch loss at step 3000: 0.224395
Minibatch accuracy: 100.0%
Validation accuracy: 70.5%
Test accuracy: 81.4%


Here, the minibatch accuracy quickly saturates and as a consequence, the model stops evolving and performs poorly at the test stage. Although, really, it does surprisingly well. (I guess 80% is pretty bad for handwriting, actually, but whatever, it is clearly better than 10%.)

---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

---

In [18]:
# From previous problem set, with modifications

batch_size = 128
num_relus = 1024
l2_w1par = 0.01
l2_w2par = 0.01
dropout_keep_prob = 0.75


graph = tf.Graph()
with graph.as_default():

    # For training data, use a placeholder that will be fed a minibatch at runtime.
    tf_train_dataset = tf.placeholder(tf.float32, 
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
  
    # Variables.
    # our network now has one set of weight and biases producing an output vector of
    # size `num_relus`. we apply the relus and then feed that vector into a new set 
    # of weight matrix + bias vector that will produce a vector of sime `num_labels`
    # that is fed into our softmax
    weights1 = tf.Variable(tf.truncated_normal([image_size * image_size, num_relus]))
    biases1 = tf.Variable(tf.zeros([num_relus]))
    weights2 = tf.Variable(tf.truncated_normal([num_relus, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

  
    # Training computation.
    logits1 = tf.matmul(tf_train_dataset, weights1) + biases1
    activations = tf.nn.dropout(tf.nn.relu(logits1), keep_prob=dropout_keep_prob)
    logits2 = tf.matmul(activations, weights2) + biases2

    # add L2 loss terms: https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#l2_loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits2, tf_train_labels) + 
                         l2_w1par * tf.nn.l2_loss(weights1) + 
                         l2_w2par * tf.nn.l2_loss(weights2))
  
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits2)
    # validation set...
    val_logits1 = tf.matmul(tf_valid_dataset, weights1) + biases1
    val_activations = tf.nn.relu(val_logits1)
    val_logits2 = tf.matmul(val_activations, weights2) + biases2
    valid_prediction = tf.nn.softmax(val_logits2)
    # test set...
    test_logits1 = tf.matmul(tf_test_dataset, weights1) + biases1
    test_activations = tf.nn.relu(test_logits1)
    test_logits2 = tf.matmul(test_activations, weights2) + biases2
    test_prediction = tf.nn.softmax(test_logits2) 

In [19]:
num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                    valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3497.390137
Minibatch accuracy: 23.4%
Validation accuracy: 30.8%
Minibatch loss at step 500: 21.007128
Minibatch accuracy: 100.0%
Validation accuracy: 68.7%
Minibatch loss at step 1000: 0.409433
Minibatch accuracy: 100.0%
Validation accuracy: 70.1%
Minibatch loss at step 1500: 0.247708
Minibatch accuracy: 100.0%
Validation accuracy: 70.3%
Minibatch loss at step 2000: 0.242500
Minibatch accuracy: 100.0%
Validation accuracy: 70.6%
Minibatch loss at step 2500: 0.235747
Minibatch accuracy: 100.0%
Validation accuracy: 70.4%
Minibatch loss at step 3000: 0.229695
Minibatch accuracy: 100.0%
Validation accuracy: 70.6%
Test accuracy: 81.6%


Well, it depends on the dropout probability. Very high probability degrades performance. "Middle" probabilities (e.g., 0.5) slightly improve the behavior, but only slightly.

---
Problem 4
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


In [20]:
# restore data

pickle_file = 'notMNIST_sanitized.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)
    
image_size = 28
num_labels = 10

train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

('Training set', (200000, 28, 28), (200000,))
('Validation set', (9633, 28, 28), (9633,))
('Test set', (9604, 28, 28), (9604,))
('Training set', (200000, 784), (200000, 10))
('Validation set', (9633, 784), (9633, 10))
('Test set', (9604, 784), (9604, 10))


In [36]:
# From previous problem set, with modifications

base_learning_rate = 0.1
batch_size = 128
num_relus1 = 1024
num_relus2 = 512
l2_w1par = 0.01
l2_w2par = 0.01
l2_w3par = 0.01
dropout_keep_prob1 = 0.90
dropout_keep_prob2 = 0.75

nonlin_actfun1 = tf.tanh      # we always get infinite gradient errors if we make the 1st layer a relu
nonlin_actfun2 = tf.nn.relu

graph = tf.Graph()
with graph.as_default():

    # For training data, use a placeholder that will be fed a minibatch at runtime.
    tf_train_dataset = tf.placeholder(tf.float32, 
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
  
    # Variables.
    # our network now has one set of weight and biases producing an output vector of
    # size `num_relus`. we apply the relus and then feed that vector into a new set 
    # of weight matrix + bias vector that will produce a vector of sime `num_labels`
    # that is fed into our softmax
    weights1 = tf.Variable(tf.truncated_normal([image_size * image_size, num_relus1]))
    biases1 = tf.Variable(tf.zeros([num_relus1]))
#    weights2 = tf.Variable(tf.truncated_normal([num_relus1, num_labels]))
#    biases2 = tf.Variable(tf.zeros([num_labels]))
    weights2 = tf.Variable(tf.truncated_normal([num_relus1, num_relus2]))
    biases2 = tf.Variable(tf.zeros([num_relus2]))
    weights3 = tf.Variable(tf.truncated_normal([num_relus2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))

  
    # Training computation.
    logits1 = tf.matmul(tf_train_dataset, weights1) + biases1
    activations = tf.nn.dropout(nonlin_actfun1(logits1), keep_prob=dropout_keep_prob1)
    logits2 = tf.matmul(activations, weights2) + biases2
    activations2 = tf.nn.dropout(nonlin_actfun2(logits2), keep_prob=dropout_keep_prob2)
    logits3 = tf.matmul(activations2, weights3) + biases3
    final_logits = logits3
    train_prediction = tf.nn.softmax(logits3)

    # add L2 loss terms: https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#l2_loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(final_logits, tf_train_labels) + 
                         l2_w1par * tf.nn.l2_loss(weights1) + 
                         l2_w2par * tf.nn.l2_loss(weights2) + 
                         l2_w3par * tf.nn.l2_loss(weights3))
  
    # Optimizer.
    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, step, decay_steps=10000, decay_rate=0.96)
    optimizer = tf.train.GradientDescentOptimizer(base_learning_rate).minimize(loss, global_step=global_step)
  
    # validation set...
    val_logits1 = tf.matmul(tf_valid_dataset, weights1) + biases1
    val_activations = nonlin_actfun1(val_logits1)
    val_logits2 = tf.matmul(val_activations, weights2) + biases2
    val_activations2 = nonlin_actfun2(val_logits2)
    val_logits3 = tf.matmul(val_activations2, weights3) + biases3
    val_final_logits = val_logits3
    valid_prediction = tf.nn.softmax(val_final_logits)

    # test set...
    test_logits1 = tf.matmul(tf_test_dataset, weights1) + biases1
    test_activations = nonlin_actfun1(test_logits1)
    test_logits2 = tf.matmul(test_activations, weights2) + biases2
    test_activations2 = nonlin_actfun2(test_logits2)
    test_logits3 = tf.matmul(test_activations2, weights3) + biases3
    test_final_logits = test_logits3
    test_prediction = tf.nn.softmax(test_final_logits) 

In [37]:
num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 100 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                    valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 5857.618652
Minibatch accuracy: 12.5%
Validation accuracy: 24.2%
Minibatch loss at step 100: 4282.854492
Minibatch accuracy: 70.3%
Validation accuracy: 71.8%
Minibatch loss at step 200: 3495.483398
Minibatch accuracy: 81.2%
Validation accuracy: 71.1%
Minibatch loss at step 300: 2847.058350
Minibatch accuracy: 75.0%
Validation accuracy: 70.9%
Minibatch loss at step 400: 2326.574219
Minibatch accuracy: 62.5%
Validation accuracy: 69.4%
Minibatch loss at step 500: 1904.387695
Minibatch accuracy: 63.3%
Validation accuracy: 69.2%
Minibatch loss at step 600: 1554.293945
Minibatch accuracy: 71.1%
Validation accuracy: 70.4%
Minibatch loss at step 700: 1271.488159
Minibatch accuracy: 71.1%
Validation accuracy: 71.1%
Minibatch loss at step 800: 1041.126953
Minibatch accuracy: 71.1%
Validation accuracy: 68.8%
Minibatch loss at step 900: 851.230103
Minibatch accuracy: 75.8%
Validation accuracy: 72.0%
Minibatch loss at step 1000: 697.046448
Minibatch accuracy: 6

Basically, we've done the best so far with just one activations layer (almost 90%) - adding layers has in every case degraded performance... well, but that was with a constant learning rate. A decaying learning rate seems to help (although the best accuracy so far is still with only one ReLU layer, almost 90%).

Actually, we got to 90% (with improvements still present, although slow) at 3001 steps. Decaying the learning rate is very important. Also, dropout seems to require more patience.