1. [Linear Models are Limited](#Linear Models are Limited)
2. [2-Layer Neural Network](#2-Layer Neural Network)
    1. [Network of ReLUs](#Network of ReLUs)
    2. [TensorFlow ReLUs](#TensorFlow ReLUs)
    3. [Chain Rule and Backpropagation](#Chain Rule and Backpropagation)
3. [Deep Neural Network in TensorFlow](#Deep Neural Network in TensorFlow)
    1. [TensorFlow MNIST](#TensorFlow MNIST)
    2. [Learning Parameters](#Learning Parameters)
    3. [Hidden Layer Parameters](#Hidden Layer Parameters)
    4. [Weights and Biases](#Weights and Biases)
    5. [Input](#Input)
    6. [Multilayer Perceptron](#Multilayer Perceptron)
    7. [Optimizer](#Optimizer)
4. [Training a Deep Learning Network](#Training a Deep Learning Network)
5. [Save and Restore TensorFlow Models](#Save and Restore TensorFlow Models)
    1. [Saving Variables](#Saving Variables)
    2. [Save a Trained Model](#Save a Trained Model)
    3. [Load a Trained Model](#Load a Trained Model)
6. [Loading the Weights and Biases into a New Model](#Loading the Weights and Biases into a New Model)
    1. [Naming Error](#Naming Error)
    2. [Regularization](#Regularization)
7. [Regularization](#Regularization)
    1. [Investigation of Performance Under Validation Set](#Investigation of Performance Under Validation Set)
    2. [L2 Regularization](#L2 Regularization)
8. [Dropout](#Dropout)

# 1. Linear Models are Limited <a name='Linear Models are Limited'></a>

We've trained so far is simple model, and it's also relatively limited. 

__Quiz:__

How many train parameters did it actually have, which input was a 28 by 28 image, and the output was 10 classes. 

<img src='Figures4/Screen Shot 2017-03-26 at 18.00.25.png' width=400>

__Answer:__

$$
\begin{align}
\text{Total number of parameters} &= \text{size of } \mathbf{W} + \text{size of } \mathbf{b} \\
&= 28 \times 28 \times 10 + 10 \\
&= 7850
\end{align}
$$

If we have $N$ inputs, and $K$ outputs, we have $(N+1)K$ parameters to use. We might want to use many, many more parameters in practice, but it's linear. This means that the kind of interactions that we're capable of representing with that model is somewhat limited. For example, if two inputs interacti in an additive way, $y = x_1 + x_2$, the model can represent them well as a matrix multiply. But if two inputs interact in the way that the outcome depends on the product of the two, $y=x_1 \times x_2$, we won't be able to model that efficiently with a linear model. 

However, linear operations are really nice. Big matrix multiplies are exactly what GPUs wer designed for, and numerically linear operations are very stable. 

<img src='Figures4/Screen Shot 2017-03-26 at 18.26.49.png' width=300>

We can show mathematically that small changed in the input can never yield big changes in the output. In addition, the derivates are very nice too. The derivative of a linear function is constant. 

<img src='Figures4/Screen Shot 2017-03-26 at 18.27.18.png' width=300>

This means that we can't get more stable numerically than a constant. So, we would like to keep the parameters inside big linear functions, but we would also want the entire model to be nonlinear.

We can't just keep multiplying our inputs by linear functions, because that's just equivalent to one big linear function as below.

<img src='Figures4/Screen Shot 2017-03-26 at 18.52.19.png' width=300>

So, we're going to have to introduce non-linearities.

# 2. 2-Layer Neural Network <a name='2-Layer Neural Network'></a>

### 2.1. Network of ReLUs <a name='Network of ReLUs'></a>

Rectified Linear Units (ReLUs) are literally the simplest non-linear functions. They're linear if $x$ is greater than $0$, and they're the $0$ everywhere else.

<img src='Figures4/Screen Shot 2017-03-26 at 18.35.17.png' width = 400>

ReLUs have nice derivatives, as well. 

<img src='Figures4/Screen Shot 2017-03-26 at 18.37.28.png' width = 200>

When $x$ is less than zero, the value is 0. So, the derivative is 0 as well. When $x$ is greater than 0, the value is equal to x. So, the derivative is equal to 1. 

A logistic classifier can be non-linear. Instead of having a single matrix
multiplier as our classifier, we're going to insert a ReLUs right in the middle.

<img src='Figures4/Screen Shot 2017-03-26 at 18.41.05.png' width=500>

We now have two matrices. One going from the inputs to the ReLUs, and another one connecting the ReLUs to the classifier. 

We've solved two of our problems. Our function in now nonlinear thanks
to the ReLUs in the middle, and we now have a new knob that we can tune,
the number $H$ which corresponds to the number of ReLUs units that we have in the classifier. We can make it as big as we want. 

__Note__: Depicted above is a "2-layer" neural network:

<img src='Figures4/RELU.png' width=500>

    1. The first layer effectively consists of the set of weights and biases applied to X and passed through ReLUs. The output of this layer is fed to the next one, but is not observable outside the network, hence it is known as a hidden layer.
    
    2. The second layer consists of the weights and biases applied to these intermediate outputs, followed by the softmax function to generate probabilities.

# 2.2. TensorFlow ReLUs <a name='TensorFlow ReLUs'></a>

<img src='Figures4/Screen Shot 2017-03-26 at 18.35.17.png' width = 400>

A Rectified linear unit (ReLU) is type of [activation function](https://en.wikipedia.org/wiki/Activation_function) that is defined as $f(x) = max(0, x)$. The function returns 0 if $x$ is negative, otherwise it returns $x$. TensorFlow provides the ReLU function as __tf.nn.relu()__, as shown below.

```Python
# Hidden Layer with ReLU activation function
hidden_layer = tf.add(tf.matmul(features, hidden_weights), hidden_biases)
hidden_layer = tf.nn.relu(hidden_layer)

output = tf.add(tf.matmul(hidden_layer, output_weights), output_biases)
```

<img src='Figures4/insert-relu.png' width=500>

The above code applies the __tf.nn.relu()__ function to the __hidden_layer__, effectively turning off any negative weights and acting like an on/off switch. Adding additional layers, like the output layer, after an activation function turns the model into a nonlinear function. This nonlinearity allows the network to solve more complex problems.

__Quiz:__

Use TensorFlow's ReLU function to turn the linear model below into a nonlinear model.

In [1]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

output = None
hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8]]
out_weights = [
    [0.1, 0.6],
    [0.2, 0.1],
    [0.7, 0.9]]

# Weights and biases
weights = [
    tf.Variable(hidden_layer_weights),
    tf.Variable(out_weights)]
biases = [
    tf.Variable(tf.zeros(3)),
    tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0, -4.0], [11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model
linear1 = tf.add(tf.matmul(features, weights[0]), biases[0])
ReLU = tf.nn.relu(linear1)
linear2 = tf.add(tf.matmul(ReLU, weights[1]), biases[1])

# TODO: Print session results
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    output = sess.run(linear2)
print(output)

[[  5.11000013   8.44000053]
 [  0.           0.        ]
 [ 24.01000214  38.23999786]]


__Answer:__

```Python
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

output = None
hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8]]
out_weights = [
    [0.1, 0.6],
    [0.2, 0.1],
    [0.7, 0.9]]

# Weights and biases
weights = [
    tf.Variable(hidden_layer_weights),
    tf.Variable(out_weights)]
biases = [
    tf.Variable(tf.zeros(3)),
    tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0, -4.0], [11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model
hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

# TODO: Print session results
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(logits))
```

# 2.3. Chain Rule and Backpropagation <a name='Chain Rule and Backpropagation'><a/>

<img src='Figures4/Screen Shot 2017-03-26 at 19.50.36.png' width = 400>

One reason to build this network by stacking simple operations, like multiplications, and sums, and ReLUs, on top of each other is
that it makes the math very simple.

The key mathematical insight is the chain rule. 

<img src='Figures4/Screen Shot 2017-03-26 at 19.52.42.png' width = 300>

If we have two functions that get composed, then the chain rule tells the derivatives of that function simply by taking the product of the derivatives of the components. That's very powerful.

As long as you know how to write the derivatives of your individual functions, there is a simple graphical way to combine them together and compute the derivative for the whole function.

<img src='Figures4/Screen Shot 2017-03-26 at 19.54.26.png' width = 500>

There is a way to write this chain rule that is very efficient computationally, with lots of data reuse, and that looks like a very simple data pipeline. Imagine the network is a stack of simple operations. Some have parameters like the matrix transforms, some don't, like the ReLUs.

<img src='Figures4/Screen Shot 2017-03-26 at 19.59.04.png' width = 400>

When we apply data to some input x, we have data flowing through the stack up to the predictions $y$. To compute the derivatives, we create another graph.

<img src='Figures4/Screen Shot 2017-03-26 at 20.01.01.png' width = 400>

The data in the new graph flows backwards through the network, and get's combined using the chain rule that we saw before and produces gradients. That graph can be derived completely automatically from the individual operations in the network. This is called back-propagation, and it's a very powerful concept. It makes computing derivatives of complex function very efficient as long as the function is made up of simple blocks with simple derivatives.

<img src='Figures4/Screen Shot 2017-03-27 at 11.37.11.png' width = 500>

Running the model up to the predictions is often call the forward prop, and the model that goes backwards
is called the back prop. To run stochastic gradient descent, for every single little batch of data in training set, we're going to run the forward prop, and then the back prop. And that will give gradients for each of weights in model. Then we're going to apply those gradients with the learning weights to the original weights, and update them. And we're going to repeat that all over again, many, many times. This is how entire model gets optimized.

Keep in mind, this diagram. In particular, each block of the back prop often takes about twice the memory that's needed for prop and twice the compute. That's important when we want to size the model and fit it in memory for example.

<img src='Figures4/Screen Shot 2017-03-27 at 11.42.46.png' width = 500>

# 3. Deep Neural Network in TensorFlow <a name='Deep Neural Network in TensorFlow'></a>

We've seen how to build a logistic classifier using TensorFlow. Now we're going to see how to use the logistic classifier to build a deep neural network. In the following walkthrough, we'll step through TensorFlow code written to classify the letters in the MNIST database.

### 3.1. TensorFlow MNIST <a name='TensorFlow MNIST'></a>

The MNIST dataset is provided by TensorFlow, which batches and One-Hot encodes the data for us.

In [2]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("datasets/mnist/", one_hot=True, reshape=False)

Extracting datasets/mnist/train-images-idx3-ubyte.gz
Extracting datasets/mnist/train-labels-idx1-ubyte.gz
Extracting datasets/mnist/t10k-images-idx3-ubyte.gz
Extracting datasets/mnist/t10k-labels-idx1-ubyte.gz


### 3.2. Learning Parameters <a name='Learning Parameters'></a>

The focus here is on the architecture of multilayer neural networks, not parameter tuning, so here we'll just give the learning parameters.

In [3]:
import tensorflow as tf

# Parameters
learning_rate = 0.001
training_epochs = 20
batch_size = 128  # Decrease batch size if you don't have enough memory
display_step = 1

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

### 3.3. Hidden Layer Parameters <a name='Hidden Layer Parameters'></a>

The variable **n\_hidden\_layer** determines the size of the hidden layer in the neural network. This is also known as the width of a layer.

In [4]:
n_hidden_layer = 256 # layer number of features

### 3.4. Weights and Biases <a name='Weights and Biases'></a>

Deep neural networks use multiple layers with each layer requiring it's own weight and bias. The **'hidden\_layer'** weight and bias is for the hidden layer. The __'out'__ weight and bias is for the output layer. If the neural network were deeper, there would be weights and biases for each additional layer.

In [5]:
# Store layers weight & bias
weights = {
    'hidden_layer': tf.Variable(tf.random_normal([n_input, n_hidden_layer])),
    'out': tf.Variable(tf.random_normal([n_hidden_layer, n_classes]))
}
biases = {
    'hidden_layer': tf.Variable(tf.random_normal([n_hidden_layer])),
    'out': tf.Variable(tf.random_normal([n_classes]))
}

### 3.5. Input <a name='Input'></a>

The MNIST data is made up of 28px by 28px images with a single [channel](https://en.wikipedia.org/wiki/Channel_(digital_image%29). The __tf.reshape()__ function above reshapes the 28px by 28px matrices in _x_ into row vectors of 784px.

In [6]:
# tf Graph input
x = tf.placeholder("float", [None, 28, 28, 1])
y = tf.placeholder("float", [None, n_classes])

x_flat = tf.reshape(x, [-1, n_input])

### 3.6. Multilayer Perceptron <a name='Multilayer Perceptron'></a>

<img src='Figures4/multi-layer.png' width=500>

We've seen the linear function __tf.add(tf.matmul(x_flat, weights['hidden_layer']), biases['hidden_layer'])__ before, also known as __xw + b__. Combining linear functions together using a ReLU will give you a two layer network.

In [7]:
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x_flat, weights['hidden_layer']),\
    biases['hidden_layer'])
layer_1 = tf.nn.relu(layer_1)
# Output layer with linear activation
logits = tf.add(tf.matmul(layer_1, weights['out']), biases['out'])

### 3.7. Optimizer <a name='Optimizer'></a>

This is the same optimization technique used in the Intro to TensorFLow lab.

In [8]:
# Define loss and optimizer
cost = tf.reduce_mean(\
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
    .minimize(cost)

### 3.8. Session

The MNIST library in TensorFlow provides the ability to receive the dataset in batches. Calling the __mnist.train.next_batch()__ function returns a subset of the training data.

In [9]:
# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    for epoch in range(training_epochs):
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            # Run optimization op (backprop) and cost op (to get loss value)
            sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})

# 4. Training a Deep Learning Network <a name='Training a Deep Learning Network'></a>

So we you have a small neural network. It's not particularly deep,
just two layers. We can make it bigger, more complex, by increasing the size of that hidden layer in the middle. But it turns out that increasing this $H$ is not particularly efficient in general. We need to make it very, very big, and then it gets really hard to train.

<img src='Figures4/Screen Shot 2017-03-27 at 13.25.24.png' width=500>

This is where the central idea of deep learning comes into play. Instead, we can also add more layers and make the model deeper. First, it is __parameter efficiency__. We can typically get much more performance with fewer parameters by going deeper rather than wider. Secondly, it is that a lot of natural phenomena tend to have a hierarchical structure which deep models naturally capture.

<img src='Figures4/Screen Shot 2017-03-27 at 13.26.11.png' width=500>

If we poke at a model for images, for example, and visualize what the model learns, we'll often find very simple things at the lowest layers, like lines or edges. Once we move up, we tend to see more complicated things like geometric shapes. Go further up and we start seeing things like objects, faces.

This is very powerful because the model structure matches the kind of abstractions that we might expect to see in your data. As a result, the model has an easier time learning them.



# 5. Save and Restore TensorFlow Models <a name='Save and Restore TensorFlow Models'></a>

Training a model can take hours. But once we close our TensorFlow session, we lose all the trained weights and biases. If we were to reuse the model in the future, we would have to train it all over again!

Fortunately, TensorFlow gives the ability to save your progress using a class called [tf.train.Saver](https://www.tensorflow.org/api_docs/python/tf/train/Saver). This class provides the functionality to save any [tf.Variable](https://www.tensorflow.org/api_docs/python/tf/Variable) to our file system.

### 5.1. Saving Variables <a name='Saving Variables'></a>

Let's start with a simple example of saving weights and bias Tensors. For the first example we'll just save two variables. Later examples will save all the weights in a practical model.

In [10]:
import tensorflow as tf

# The file path to save the data
save_file = './model.ckpt'

# Two Tensor Variables: weights and bias
weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))

# Class used to save and/or restore Tensor Variables
saver = tf.train.Saver()

with tf.Session() as sess:
    # Initialize all the Variables
    sess.run(tf.global_variables_initializer())

    # Show the values of weights and bias
    print('Weights:')
    print(sess.run(weights))
    print('Bias:')
    print(sess.run(bias))

    # Save the model
    saver.save(sess, save_file)

Weights:
[[-1.1214844  -0.21496792 -0.78367382]
 [-1.11662614 -0.88709235 -0.73158282]]
Bias:
[-0.21182585  0.50396538  0.25480953]


The Tensors __weights__ and __bias__ are set to random values using the [tf.truncated_normal()](https://www.tensorflow.org/api_docs/python/tf/truncated_normal) function. The values are then saved to the __save_file__ location, "model.ckpt", using the [tf.train.Saver.save()](https://www.tensorflow.org/api_docs/python/tf/train/Saver#save) function. (The ".ckpt" extension stands for "checkpoint".)

If we're using TensorFlow 0.11.0RC1 or newer, a file called "model.ckpt.meta" will also be created. This file contains the TensorFlow graph.

### 5.2. Save a Trained Model <a name='Save a Trained Model'></a>

Let's see how to train a model and save its weights.

First start with a model:

In [11]:
# Remove previous Tensors and Operations
tf.reset_default_graph()

from tensorflow.examples.tutorials.mnist import input_data
import numpy as np

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('datasets/mnist', one_hot=True)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(\
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
    .minimize(cost)

# Calculate accuracy`
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Extracting datasets/mnist/train-images-idx3-ubyte.gz
Extracting datasets/mnist/train-labels-idx1-ubyte.gz
Extracting datasets/mnist/t10k-images-idx3-ubyte.gz
Extracting datasets/mnist/t10k-labels-idx1-ubyte.gz


Let's train that model, then save the weights:

In [12]:
import math

save_file = './train_model.ckpt'
batch_size = 128
n_epochs = 100

saver = tf.train.Saver()

# Launch the graph
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    # Training cycle
    for epoch in range(n_epochs):
        total_batch = math.ceil(mnist.train.num_examples / batch_size)

        # Loop over all batches
        for i in range(total_batch):
            batch_features, batch_labels = mnist.train.next_batch(batch_size)
            sess.run(
                optimizer,
                feed_dict={features: batch_features, labels: batch_labels})

        # Print status for every 10 epochs
        if epoch % 10 == 0:
            valid_accuracy = sess.run(
                accuracy,
                feed_dict={
                    features: mnist.validation.images,
                    labels: mnist.validation.labels})
            print('Epoch {:<3} - Validation Accuracy: {}'.format(
                epoch,
                valid_accuracy))

    # Save the model
    saver.save(sess, save_file)
    print('Trained Model Saved.')

Epoch 0   - Validation Accuracy: 0.14180000126361847
Epoch 10  - Validation Accuracy: 0.2818000018596649
Epoch 20  - Validation Accuracy: 0.42899999022483826
Epoch 30  - Validation Accuracy: 0.5163999795913696
Epoch 40  - Validation Accuracy: 0.573199987411499
Epoch 50  - Validation Accuracy: 0.6126000285148621
Epoch 60  - Validation Accuracy: 0.6448000073432922
Epoch 70  - Validation Accuracy: 0.673799991607666
Epoch 80  - Validation Accuracy: 0.6953999996185303
Epoch 90  - Validation Accuracy: 0.7164000272750854
Trained Model Saved.


### 5.3. Load a Trained Model <a name='Load a Trained Model'></a>

Let's load the weights and bias from memory, then check the test accuracy.

In [13]:
saver = tf.train.Saver()

# Launch the graph
with tf.Session() as sess:
    saver.restore(sess, save_file)

    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: mnist.test.images, labels: mnist.test.labels})

print('Test Accuracy: {}'.format(test_accuracy))

Test Accuracy: 0.7372000217437744


# 6. Loading the Weights and Biases into a New Model <a name='Loading the Weights and Biases into a New Model'></a>

Sometimes we might want to adjust, or "finetune" a model that we have already trained and saved.

However, loading saved Variables directly into a modified model can generate errors. Let's go over how to avoid these problems.

### 6.1. Naming Error <a name='Naming Error'></a>

TensorFlow uses a string identifier for Tensors and Operations called __name__. If a name is not given, TensorFlow will create one automatically. TensorFlow will give the first node the name __< Type \>__, and then give the name **< Type >\_< number >** for the subsequent nodes. Let's see how this can affect loading a model with a different order of __weights__ and __bias__:

In [14]:
import tensorflow as tf

# Remove the previous weights and bias
tf.reset_default_graph()

save_file = './model.ckpt'

# Two Tensor Variables: weights and bias
weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))

saver = tf.train.Saver()

# Print the name of Weights and Bias
print('Save Weights: {}'.format(weights.name))
print('Save Bias: {}'.format(bias.name))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.save(sess, save_file)

# Remove the previous weights and bias
tf.reset_default_graph()

# Two Variables: weights and bias
bias = tf.Variable(tf.truncated_normal([3]))
weights = tf.Variable(tf.truncated_normal([2, 3]))

saver = tf.train.Saver()

# Print the name of Weights and Bias
print('Load Weights: {}'.format(weights.name))
print('Load Bias: {}'.format(bias.name))

with tf.Session() as sess:
    # Load the weights and bias - ERROR
    saver.restore(sess, save_file)

Save Weights: Variable:0
Save Bias: Variable_1:0
Load Weights: Variable_1:0
Load Bias: Variable:0


InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [2,3]
	 [[Node: save/Assign = Assign[T=DT_FLOAT, _class=["loc:@Variable"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Variable, save/RestoreV2)]]

Caused by op 'save/Assign', defined at:
  File "//anaconda/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "//anaconda/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "//anaconda/lib/python3.5/site-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "//anaconda/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "//anaconda/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 474, in start
    ioloop.IOLoop.instance().start()
  File "//anaconda/lib/python3.5/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "//anaconda/lib/python3.5/site-packages/tornado/ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "//anaconda/lib/python3.5/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "//anaconda/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "//anaconda/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "//anaconda/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "//anaconda/lib/python3.5/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "//anaconda/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 276, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "//anaconda/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell
    handler(stream, idents, msg)
  File "//anaconda/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 390, in execute_request
    user_expressions, allow_stdin)
  File "//anaconda/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "//anaconda/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 501, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "//anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "//anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "//anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-14-0e1713b00e31>", line 29, in <module>
    saver = tf.train.Saver()
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1051, in __init__
    self.build()
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1081, in build
    restore_sequentially=self._restore_sequentially)
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 675, in build
    restore_sequentially, reshape)
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 414, in _AddRestoreOps
    assign_ops.append(saveable.restore(tensors, shapes))
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 155, in restore
    self.op.get_shape().is_fully_defined())
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
    use_locking=use_locking, name=name)
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "//anaconda/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [2,3]
	 [[Node: save/Assign = Assign[T=DT_FLOAT, _class=["loc:@Variable"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Variable, save/RestoreV2)]]


You'll notice that the __name__ properties for __weights__ and __bias__ are different than when you saved the model. This is why the code produces the "Assign requires shapes of both tensors to match" error. The code __saver.restore(sess, save\_file)__ is trying to load weight data into __bias__ and bias data into __weights__.

Instead of letting TensorFlow set the __name__ property, let's set it manually:

In [None]:
import tensorflow as tf

tf.reset_default_graph()

save_file = './model.ckpt'

# Two Tensor Variables: weights and bias
weights = tf.Variable(tf.truncated_normal([2, 3]), name='weights_0')
bias = tf.Variable(tf.truncated_normal([3]), name='bias_0')

saver = tf.train.Saver()

# Print the name of Weights and Bias
print('Save Weights: {}'.format(weights.name))
print('Save Bias: {}'.format(bias.name))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.save(sess, save_file)

# Remove the previous weights and bias
tf.reset_default_graph()

# Two Variables: weights and bias
bias = tf.Variable(tf.truncated_normal([3]), name='bias_0')
weights = tf.Variable(tf.truncated_normal([2, 3]) ,name='weights_0')

saver = tf.train.Saver()

# Print the name of Weights and Bias
print('Load Weights: {}'.format(weights.name))
print('Load Bias: {}'.format(bias.name))

with tf.Session() as sess:
    # Load the weights and bias - No Error
    saver.restore(sess, save_file)

print('Loaded Weights and Bias successfully.')

That worked! The Tensor names match and the data loaded correctly.

# 7. Regularization <a name='Regularization'></a>

Why did we not figure out earlier that deep-models were effective? Many reasons, but mostly because deep-models only really shine if 
having enough data to train them. It's only in recent years that large enough data sets have made their way to the academic world. Another reason, we know better today how to train very, very big models using better regularization techniques. 

There is a general issue when we're doing numerical optimization, so called the skinny jeans problem. Skinny jeans look great, they fit perfectly, but they're really, really hard to get into. So, most people end up wearing jeans that are just a bit too big. It's exactly the same with deep networks. The network that's just the right size for data is very, very hard to optimize. So in practice, we always try networks that are way too big for our data and then we try our best to prevent them from overfitting.

### 7.1. Investigation of Performance Under Validation Set <a name='Investigation of Performance Under Validation Set'></a>

The first way we prevent over fitting is by looking at the performance under validation set, and stopping to train as soon as we stop improving. It's called early termination, and it's still the best way to prevent network from over-optimizing on the training set.

<img src='Figures4/Screen Shot 2017-03-27 at 16.35.34.png' width=500>

Another way is to apply regularization. Regularizing means applying artificial constraints on network that implicitly reduce the number of free parameters while not making it more difficult to optimize.

In the skinny jeans analogy, think stretch pants. They fit just as well, but because they're flexible,  they don't make things harder to fit in. The stretch pants of are called L2 Regularization. The idea is to add another term to the loss, which penalizes large weights. 

<img src='Figures4/Screen Shot 2017-03-27 at 16.37.50.png' width=400>

### 7.2. L2 Regularization <a name='L2 Regularization'></a>

It's typically achieved by adding the L2 norm of weights to the loss, multiplied by a small constant. 

The nice thing about L2 regularization is that its very, very simple. Because we just add it to loss, the structure of our network
doesn't have to change. We can even compute its derivative by hand. Remember that the L2 norm stands for the sum of the squares of
the individual elements in a vector, so the derivative of the L2 norm of a vector is just $W$.

<img src='Figures4/Screen Shot 2017-03-27 at 16.46.18.png' width=400>

We know that the derivative of $(\frac{1}{2}x^2)'$ in one dimension is simply $x$. So when we take that derivative for each of the components of vector, we get the same components. Therefore, the answer is the third one, $W$.



# 8. Dropout <a name='Dropout'></a>

There's another important technique for regularization that only emerged relatively recently and works amazingly well. It's called dropout and it works likes this.

Imagine that we have one layer that connects to another layer.

<img src='Figures4/Screen Shot 2017-03-27 at 17.37.43.png' width=400>

The values that go from one layer to the next are often called activations. Now take those activations, and for every example we train the network on, randomly set half of them to zero. 

<img src='Figures4/Screen Shot 2017-03-27 at 17.52.25.png' width=300>

Completely randomly, we basically take half of the data that's flowing through the network and just destroy it.

<img src='Figures4/Screen Shot 2017-03-27 at 17.55.56.png' width=300>

And then randomly again.

<img src='Figures4/Screen Shot 2017-03-27 at 17.56.42.png' width=300>

So what happens with dropout? 

<img src='Figures4/Screen Shot 2017-03-27 at 18.01.25.png' width=500>

The network can never rely on any given activation to be present, because they might be squashed at any given moment. So it is forced to learn a redundant representation for everything to make sure that at least some of the information remains.

It's like a game of whack-a-mole. One activations gets smashed,
but there's always one or more that do the same job and that don't get killed. So everything remains fine at the end. 

Forcing our network to learn redundant representations might sound very inefficient. But in practice, it makes things more robust and prevents over fitting. It also makes the network act as if taking the consensus over an ensemble of networks, which is always a good way to improve performance.

Dropout is one of the most important techniques to emerge in the last few years. If drop out doesn't work for, we should probably
be using a bigger network.

When we evaluate the network that's been trained with drop out, we obviously no longer want this randomness. We want something deterministic. Instead, we're going to want to take the consensus over these redundant models. We get the consensus opinion by averaging the activations. 

<img src='Figures4/Screen Shot 2017-03-27 at 18.11.11.png' width=400>

We want $Y_e$ here to be the average of all the $y_t$s that we
got during training. 

Here's a trick to make sure this expectation holds. During training, not only do you use zero out the activations that we drop out, but we also scale the remaining activations by a factor of 2.

<img src='Figures4/Screen Shot 2017-03-27 at 18.16.14.png' width=400>

This way, when it comes time to average them during evaluation, we just remove these dropouts and scaling operations from our neural net.

<img src='Figures4/Screen Shot 2017-03-27 at 18.19.46.png' width=400>

And the result is an average of these activations that is properly scaled.



### 8.1. TensorFlow Dropout

<img src='Figures4/dropout-node.jpg' width=400>
$$ Figure 1: Taken from the paper "[Dropout: A Simple Way to Prevent Neural Networks from Overfitting](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)"

Dropout is a regularization technique for reducing overfitting. The technique temporarily drops units ([artificial neurons](https://en.wikipedia.org/wiki/Artificial_neuron)) from the network, along with all of those units' incoming and outgoing connections. Figure 1 illustrates how dropout works.

TensorFlow provides the [tf.nn.dropout()](https://www.tensorflow.org/api_docs/python/tf/nn/dropout) function, which we can use to implement dropout.

```Python
keep_prob = tf.placeholder(tf.float32) # probability to keep units

hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])
```

The code above illustrates how to apply dropout to a neural network.

The __tf.nn.dropout()__ function takes in two parameters:
1. __hidden_layer__: the tensor to which we would like to apply dropout
2. __keep_prob__: the probability of keeping (i.e. not dropping) any given unit

__keep_prob__ allows to adjust the number of units to drop. In order to compensate for dropped units, __tf.nn.dropout()__ multiplies all units that are kept (i.e. not dropped) by $\frac{1}{\text{keep_prob}}$.

During training, a good starting value for keep_prob is 0.5.

During testing, use a keep_prob value of 1.0 to keep all units and maximize the power of the model.

__Quiz 1:__

Take a look at the code snippet below. Do you see what's wrong?

There's nothing wrong with the syntax, however the test accuracy is extremely low.

```Python
...

keep_prob = tf.placeholder(tf.float32) # probability to keep units

hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

...

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(epochs):
        for batch_i in range(batches):
            ....

            sess.run(optimizer, feed_dict={
                features: batch_features,
                labels: batch_labels,
                keep_prob: 0.5})

    validation_accuracy = sess.run(accuracy, feed_dict={
        features: test_features,
        labels: test_labels,
        keep_prob: 0.5})
```

__Answer 1:__

__keep_prob__ should be set to 1.0 when evaluating validation accuracy. We should only drop units while training the model. During validation or testing, we should keep all of the units to maximize accuracy.

__Quiz 2:__

This quiz will be starting with the code from the ReLU Quiz and applying a dropout layer. Build a model with a ReLU layer and dropout layer using the __keep_prob__ placeholder to pass in a probability of __0.5__. Print the logits from the model.

__Note__: Output will be different every time the code is run. This is caused by dropout randomizing the units it drops.

In [None]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8]]
out_weights = [
    [0.1, 0.6],
    [0.2, 0.1],
    [0.7, 0.9]]

# Weights and biases
weights = [
    tf.Variable(hidden_layer_weights),
    tf.Variable(out_weights)]
biases = [
    tf.Variable(tf.zeros(3)),
    tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[0.0, 2.0, 3.0, 4.0], [0.1, 0.2, 0.3, 0.4], [11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model with Dropout
hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob=0.5)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

# TODO: Print logits from a session
with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    output = session.run(logits)
    print(output)

__Answer 2:__

```Python
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8]]
out_weights = [
    [0.1, 0.6],
    [0.2, 0.1],
    [0.7, 0.9]]

# Weights and biases
weights = [
    tf.Variable(hidden_layer_weights),
    tf.Variable(out_weights)]
biases = [
    tf.Variable(tf.zeros(3)),
    tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[0.0, 2.0, 3.0, 4.0], [0.1, 0.2, 0.3, 0.4], [11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model with Dropout
keep_prob = tf.placeholder(tf.float32)
hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

# TODO: Print logits from a session
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(logits, feed_dict={keep_prob: 0.5}))
```