# Assignment 3: Hyperparameters, Optimizers, and Regularization


This notebook is meant to be an overview of common optimizer options, regularization techniques, and a little bit on hyperparamter tuning. We show you how to make a nice api with tensorflow, use that api to make a network, and set up the network for training. 

From this framework, we demonstrate experiments with different optimizers and regularization techniques. We outline a few important best practices of deep learning, especially in the design phase of the neural network.

In [None]:
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt

Tensorflow provides a nice interface for MNIST that makes it easy to load MNIST. It also provides a convenient way for us to use standard train/val/test splits.

For those of you who took 189 or the decal a while ago, you might remember that these splits are delineated as so:
* train: the data we train on 
* val:   the data we use to adjust our hyperparameters
* test:  the data we use at the very end of our hyperparameter search to report the quality of our model. Shouldn't be looked at until you are done tuninig hyperparameters and never should be a metric to adjust a model

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('../data/mnist', one_hot=True)

### Boilerplate
Here I've prepared a few boilerplate functions that I like to use to make construction of nets easier. 

In [None]:
relu = tf.nn.relu
def weight_variable(shape,name=None):
    initial = # ___ YOUR CODE HERE ____
    return tf.Variable(initial, name=name)

def bias_variable(shape,name=None):
    initial = # ___ YOUR CODE HERE ____
    return tf.Variable(initial, name=name)

def fc_layer(x, W, b, act_fn=None):
    if act_fn is not None:
        # return the layer function (affine operation) - make sure you apply the act_fn
        return # ___ YOUR CODE HERE ____
    else:
        #return the affine function by itself
        return # ___ YOUR CODE HERE ____
def accuracy(preds, true):
    #compute accuracy using predicted and true
    return tf.reduce_mean(tf.cast(tf.equal(tf.argmax(preds, 1), tf.argmax(true, 1)), tf.float32))

If you've defined everything correctly, then this test should pass and you won't receive an assertion error

In [None]:
def test_layers():
    tf.reset_default_graph()
    tf.set_random_seed(1)
    x = tf.random_normal([1,10])
    w = weight_variable([10,10])
    b = bias_variable([10])
    layer1 = fc_layer(x, w, b, act_fn=relu)
    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())
        output = sess.run(layer1)
        print(output)
        assert np.allclose(output,np.array([[ 0.17964056,  0.1712192 ,  0.0615531 ,  0.04394994,  0.,0.06671001,  0.36768118,  0.11596319,  0.,  0.03804523]], dtype=np.float32))
test_layers()

# Define the input
We define the input before anything else because all of this will stay constant throughout all of our experiments

In [None]:
# Define the placeholders for the dataset. X is the features and y are the labels
x =  # ___ YOUR CODE HERE ____
y_ =  # ___ YOUR CODE HERE ____

# Make a simple Fully Connected Net
Here you should build a fully-connected net. Read the doc string for more details.

Use the `layers` dictionary to return all of the layers you defined. 

Additionally, make sure that you set `logits` the output of the last layer BEFORE applying an `act_fn`. This means you'll have to set `act_fn` to `None`.


Make sure that you return `logits` as your first

In [None]:
def fc_net(x):
    '''
    Simple Fully-connected network for MNIST
    Input: 784
    Hidden Layers
        (1): 800
        (2): 300
    Output: 10, softmax output
    
    Returns:
    tuple with the contents `(logits, layers)`
    
    logits: the op corresponding to the last fully-connect layer BEFORE applying softmax
    layers: dictionary containing all of the layers
    '''
    layers = {
        'input': x,
    }# (Ignore if you don't understand this yet) COPY Below this line, including return
    
    # define layer 1 parameters here 
    layers['fc1'] = # ___ YOUR CODE HERE ____
    
    # define layer 2 parameters here 
    layers['fc2'] = # ___ YOUR CODE HERE ____
    
    # define layer 3 parameters here 
    layers['fc3'] = # ___ YOUR CODE HERE ____
    
    layers['logits'] = logits = layers['fc3']
    layers['pred'] = tf.nn.softmax(layers['logits'])
    return logits, layers

Define the rest of the net

In [None]:
logits, net = fc_net(x)
entropy = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=logits)
loss = tf.reduce_mean(entropy)
acc = accuracy(net['pred'], y_)

# Exploring different optimizers

In this section we look at different optimizers. They all stem from [Gradient Descent](https://en.wikipedia.org/wiki/Gradient_descent) originally.

but there are many modified optimizers that use statistics about the preceding learning rates that happen to help inform the optimizer's function.

We look in particular at RMSProp and Adam- two typically high performoing, and thus desired optimization techniques. There are many others that have performed very well in the past and may hold promise in the future. A good summary is compiled in this [video](https://www.youtube.com/watch?v=nhqo0u1a6fw)
## Standard SGD

In [None]:
# Start with the standard SGD Optimizer. Try with a learnsing rate of 1e-2
optimizer = # ___ YOUR CODE HERE ____

In [None]:

num_epochs = 25
batch_size = 64
train_acc_list = []
val_acc_list = []
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for i in range(num_epochs):
        n_batches = (int) (mnist.train.num_examples/batch_size)
        for e in range(n_batches):
            batch = mnist.train.next_batch(batch_size)
            optimizer.run(feed_dict={x: batch[0], y_: batch[1]})
        # calculate validation accuracy every 5th epoch
        if i % 1 == 0:
            val_batch = mnist.validation.next_batch(mnist.validation.num_examples)
            train_batch = mnist.train.next_batch(1000)
            train_accuracy = acc.eval(feed_dict={
                x: train_batch[0], y_: train_batch[1]
            })

            val_accuracy = acc.eval(feed_dict={
                x: val_batch[0], y_: val_batch[1]
            })
            train_acc_list.append(train_accuracy)
            val_acc_list.append(val_accuracy)
            print('epoch %d \ttraining accuracy %g \tvalidation accuracy %g' % (i, train_accuracy, val_accuracy))
       

    print('test accuracy %g' % acc.eval(feed_dict={
        x: mnist.test.images, y_: mnist.test.labels}))

In [None]:
Xaxis = np.arange(num_epochs)
plt.plot(Xaxis, train_acc_list, color='r', label='train acc')
plt.plot(Xaxis, val_acc_list,color='g', label='val_acc')
plt.legend()
plt.show()

# RMSProp

In [None]:
num_epochs = 25
batch_size = 64
learning_rate=1e-4

In [None]:
# Now try with RMSProp Optimizer. You should search the tensorflow documentation to find it.
# use the learning_rate parameter from the previous cell
optimizer = # ___ YOUR CODE HERE ____


In [None]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(num_epochs):
        n_batches = (int) (mnist.train.num_examples/batch_size)
        for e in range(n_batches):
            batch = mnist.train.next_batch(batch_size)
            optimizer.run(feed_dict={x: batch[0], y_: batch[1]})
        # calculate validation accuracy every 5th epoch
        if i % 1 == 0:
            val_batch = mnist.validation.next_batch(mnist.validation.num_examples)
            train_batch = mnist.train.next_batch(1000)
            train_accuracy = acc.eval(feed_dict={
                x: train_batch[0], y_: train_batch[1]
            })
            val_accuracy = acc.eval(feed_dict={
                x: val_batch[0], y_: val_batch[1]
            })
            print('epoch %d \ttraining accuracy %g \tvalidation accuracy %g' % (i, train_accuracy, val_accuracy))
       

    print('test accuracy %g' % acc.eval(feed_dict={
        x: mnist.test.images, y_: mnist.test.labels}))

# ADAM

Here we demonstrate the [ADAM regularizor](https://arxiv.org/pdf/1412.6980.pdf). This regularizer uses the mean and the variance of the gradients, updated iteratively using a bias constant term and the gradient of the network. This produces a supposedly much more smooth update to movement in the hyperspace

In [None]:
# hyperparameters
num_epochs = 25
batch_size = 64
learning_rate = 1e-4

In [None]:
# Now setup an ADAM optimizer. Look in the TensorFlow documentation for more details.
# use the learning_rate parameter from the previous cell
optimizer = # ___ YOUR CODE HERE ____


In [None]:
train_acc_list = []
val_acc_list = []
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for i in range(num_epochs):
        n_batches = (int) (mnist.train.num_examples/batch_size)
        for e in range(n_batches):
            batch = mnist.train.next_batch(batch_size)
            optimizer.run(feed_dict={x: batch[0], y_: batch[1]})
        # calculate validation accuracy every 5th epoch
        if i % 1 == 0:
            val_batch = mnist.validation.next_batch(mnist.validation.num_examples)
            train_batch = mnist.train.next_batch(1000)
            train_accuracy = acc.eval(feed_dict={
                x: train_batch[0], y_: train_batch[1]
            })

            val_accuracy = acc.eval(feed_dict={
                x: val_batch[0], y_: val_batch[1]
            })
            train_acc_list.append(train_accuracy)
            val_acc_list.append(val_accuracy)
            print('epoch %d \ttraining accuracy %g \tvalidation accuracy %g' % (i, train_accuracy, val_accuracy))
       

    print('test accuracy %g' % acc.eval(feed_dict={
        x: mnist.test.images, y_: mnist.test.labels}))

# Regularization
Here we demonstrate the effects of several different regularization techniques. The idea behind these techniques is that they are all intended to reduce the separation between training accuracy and validation accuracy. They help to prevent the training steps from overfitting to the training dataset

# L1 /L2 regularization
This is probably the simplest form of regularization. In this case, we add a simple l2-norm constant to the cost function we try to learn. We borrow this regularization from the Ridge Classiciation [L2-Norm] and Lasso [L1-Norm], which were the same idea, but with a logistic regression function.

In this we have to add a term to the loss function, which is relatively straight forward to understand and use. 

In [None]:
def fc_l_norm(x, reg=tf.nn.l2_loss):
    layers = {
        'input': x,
    }

    # COPY from fc_net here 
    # Change weight1, weight2 ... to whatever your weight variables are called.
    # TODO move this line to before the RETURN statement
    # layers['regularizer'] = reg(weight1) + reg(weight2) + reg(weight3) 

logits, net = fc_l_norm(x, tf.nn.l2_loss)

In [None]:
# hyperparameters
num_epochs = 25
batch_size = 64

# the constant coefficient of the regularization part of the loss
reg_alpha = 0.125

In [None]:

entropy = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=logits)
regularizer = net['regularizer']
# make sure that you include the regularizer to the loss function. Make sure that it's scaled by `reg_alpha`
loss = tf.reduce_mean(entropy)# ___ YOUR CODE HERE ____
optimizer = tf.train.AdamOptimizer(1e-4).minimize(loss)
net_output = tf.nn.softmax(logits)
acc = accuracy(net_output, y_)

In [None]:
# hyperparameters
num_epochs = 25
batch_size = 64
dropout_prob = 1
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(num_epochs):
        n_batches = (int) (mnist.train.num_examples/batch_size)
        for e in range(n_batches):
            batch = mnist.train.next_batch(batch_size)
            optimizer.run(feed_dict={x: batch[0], y_: batch[1]})
        # calculate validation accuracy every 5th epoch
        if i % 1 == 0:
            val_batch = mnist.validation.next_batch(mnist.validation.num_examples)
            train_batch = mnist.train.next_batch(1000)
            train_accuracy = acc.eval(feed_dict={
                x: train_batch[0], y_: train_batch[1]
            })
 
            val_accuracy = acc.eval(feed_dict={
                x: val_batch[0], y_: val_batch[1]
            })
            print('epoch %d \ttraining accuracy %g \tvalidation accuracy %g' % (i, train_accuracy, val_accuracy))
       

    print('test accuracy %g' % acc.eval(feed_dict={
        x: mnist.test.images, y_: mnist.test.labels}))

## Dropout
[Srivistava et. al.](http://jmlr.org/papers/v15/srivastava14a.html)
Dropout provides an extremely interesting case of regularization. The basis for this technique relies 
on this inspection of a neural network. When we observe the information that a 
neuron in the $k$th layer of a $n$ layer neural may begin relying heavily on a few neurons in 
the $k-1$th layer. In turn those $k-1$ neurons rely on $k-2$ neurons and so on. This reliance on certain weights over others causes the model to be much more fragile. Certain misleading signals that come from the input may be multiplied through the series of chains - a result of overfitting on the dataset. 

Dropout attempts to prevent this issue by randomly preventing weights from contributing to a feed-forward and updated by the subsequent backprop. A practitioner sets the dropout parameter as a probability that a neuron will not be included in the feedforward. This helps the model to learn a representation that is not very reliant on any single neuron, but rather upon many, increasing the chances that more general information will be learned. 


![dropout.jpeg](dropout.jpeg)


In [None]:
# copy your code from fc_net and add it here. then move the line that starts with layer['dropout1'] in between 'fc1' and 'fc2'
# so that dropout is in between fc1 and fc2
def fc_net_dropout(x, keep_prob_op):
    layers = {
        'input': x,
        'keep_prob': keep_prob_op
    }

    # TODO put between two layers
    # layers['dropout1'] = tf.nn.dropout("TODO CHANGE WITH INPUT LAYER", layers['keep_prob'])
    # COPY fc_net() below the layer dict initialization
# only useful for dropout
keep_prob = tf.placeholder(tf.float32)
net, net_dict = fc_net_dropout(x, keep_prob)

In [None]:
entropy = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=net)
loss = tf.reduce_mean(entropy)
optimizer = tf.train.AdamOptimizer(1e-4).minimize(loss)
net_output = tf.nn.softmax(net)
acc = accuracy(net_output, y_)

In [None]:
# hyperparameters
num_epochs = 25
batch_size = 64
dropout_prob = 0.99

In [None]:
train_acc_dropout_list = []
val_acc_dropout_list = []
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(num_epochs):
        n_batches = (int) (mnist.train.num_examples/batch_size)
        for e in range(n_batches):
            batch = mnist.train.next_batch(batch_size)
            optimizer.run(feed_dict={x: batch[0], y_: batch[1],  net_dict['keep_prob'] : dropout_prob})
        # calculate validation acc every 5th epoch
        if i % 1 == 0:
            val_batch = mnist.validation.next_batch(mnist.validation.num_examples)
            train_batch = mnist.train.next_batch(1000)
            train_acc = acc.eval(feed_dict={
                x: train_batch[0], y_: train_batch[1], net_dict['keep_prob'] : 1
            })
            #val_acc = acc.eval(feed_dict={
           #     x: mnist.validation.images, y_: mnist.validation.labels
           # })
            val_acc = acc.eval(feed_dict={
                x: val_batch[0], y_: val_batch[1], net_dict['keep_prob'] : 1
            })
            train_acc_dropout_list.append(train_acc)
            val_acc_dropout_list.append(val_acc)
            print('epoch %d \ttraining accuracy %g \tvalidation accuracy %g' % (i, train_acc, val_acc))
       

    print('test accuracy %g' % acc.eval(feed_dict={
        x: mnist.test.images, y_: mnist.test.labels, net_dict['keep_prob']: 1}))#, keep_prob: 1.0}))


Now it's important that we compare the results of training using droput with regular training

In [None]:
Xaxis = np.arange(num_epochs)
plt.plot(Xaxis, train_acc_list, color='r', label='train acc')
plt.plot(Xaxis, train_acc_dropout_list, color='b', label='train acc dropout')
plt.plot(Xaxis, val_acc_list,color='g', label='val acc')
plt.plot(Xaxis, val_acc_dropout_list,color='y', label='val acc dropout')
plt.legend()
plt.show()

In the following sections, we procede with Adam for optimization for it's simplicity and general first choice. We recommend as an exercise to try RMSProp and plot the curve as well in comparison to Adam. That way, you'll see how the learning rate evolves over time.
## Batch Normalization
[Batch Normaliazation ](https://arxiv.org/abs/1502.03167)is a regularization technique that is meant to preserve the statistics inside of a network. The batch normalization technique basically ensures that a layers output would match a certain mean and standard deviation. This technique stabilizes parameter growth, enabling higher learning rates. This also happens to add normalization to the model as well.


In [None]:
# this one is a little bit harder. The syntax for batch normalization is similar to dropout.
# we just add it as a part of our graph. Try adding it to the input of fc2. You can also try adding it 
# to different parts of the net as well.
def fc_batch_norm(x):
    layers = {
        'input': x,
    }
    # COPY fc_net() below the layer dict initialization
net, net_dict = fc_batch_norm(x)

As the gradients are smaller, you should try a few different learning rates to see how that effects the training of the model. It's a lot easier to compare if you write the information of each run inside of a variable or a file.

In [None]:
entropy = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=logits)
loss = tf.reduce_mean(entropy)
optimizer = tf.train.AdamOptimizer(1e-3).minimize(loss)
net_output = tf.nn.softmax(logits)
acc = accuracy(net_output, y_)

In [None]:
# hyperparameters
num_epochs = 25
batch_size = 64

In [None]:
train_acc_bn_list = []
val_acc_bn_list = []
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for i in range(num_epochs):
        n_batches = (int) (mnist.train.num_examples/batch_size)
        for e in range(n_batches):
            batch = mnist.train.next_batch(batch_size)
            optimizer.run(feed_dict={x: batch[0], y_: batch[1]})
            
        # calculate validation accuracy every 5th epoch
        if i % 1 == 0:
            val_batch = mnist.validation.next_batch(mnist.validation.num_examples)
            train_batch = mnist.train.next_batch(1000)
            
            train_accuracy = acc.eval(feed_dict={
                x: train_batch[0], y_: train_batch[1]
            })

            val_accuracy = acc.eval(feed_dict={
                x: val_batch[0], y_: val_batch[1]
            })
            train_acc_bn_list.append(train_accuracy)
            val_acc_bn_list.append(val_accuracy)
            
            print('epoch %d \ttraining accuracy %g \tvalidation accuracy %g' % (i, train_accuracy, val_accuracy))
       

    print('test accuracy %g' % acc.eval(feed_dict={
        x: mnist.test.images, y_: mnist.test.labels}))#, keep_prob: 1.0}))


In [None]:
train_acc_bn_list

In [None]:
Xaxis = np.arange(num_epochs)
plt.plot(Xaxis, train_acc_list, color='r', label='train acc')
plt.plot(Xaxis, train_acc_dropout_list, color='b', label='train acc dropout')
plt.plot(Xaxis, train_acc_bn_list, color='k', label='train acc bn')
plt.plot(Xaxis, val_acc_list,color='g', label='val acc')
plt.plot(Xaxis, val_acc_dropout_list,color='y', label='val acc dropout')
plt.plot(Xaxis, val_acc_bn_list,color='#e12f12', label='val acc bn')
plt.legend()
plt.show()

# End Note
Now that you've seen a few of these techniques in action, you might be wondering how you might have noticed that there was a lot of redundancy in our code. Additionally, you might've found it kind of annoying to compare the results for something trained using batch norm with another net trained with dropout instead. You also may have noticed that we separated regularization techniques from optimization techniques like Adam. 

It turns out that Adam actually has some regularization properties - the variance and mean of the network parameters regularize the gradients of the model. It's important to think of all the different ways you are thinking about your model when you're experimenting with a new technique. If you know all of the knobs and how they work you'll be able to find new patterns that work very well. This is a tedious process, but we want to show you how to manage this large scale of a task with more convenient methods than those that are demonstrated here.

In the next assignment, we look at a different way to display hyperparmeter tuning results using TensorBoard. This allows you to see the results of each experiment on the exact same plot. The tool is extremely useful for all of your deep learning needs. It will be the resource you use to debug your nets.

Additionally, we've included a copy of this assignment, but using CNN models instead. We've taken the liberty and condensed the CNN variations into a single model that's dictated by the parameters. We invite you to try to play around wiht different configurations of the CNN. You could add more fc layers, expand the size of some layers, or play around with different convolutional filters (I'd recommend that order in fact). 

If you have an idea that you think is cool with this stuff, hit me up at philkuz@ml.berkeley.edu