In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

- in the last lesson we've trained a simple logistic classifier on images
- now, we're going to take this classifier and turn it into a deep network
  - and it's going to be just a few lines of code, so make sure you understand well what was going on in the previous model
- in the second part you are going to take a small peak into how our optimizer does all the hard work for you computing gradients for arbitrary functions
- and then we are going to look together at the important topic of regularization, which will enable us to train much, much larger models

# Number of Parameters

- the simple model that you've trained so far is nice, but it's also relatively limited
- here is a question for you
  - how many train parameters did it actually have?
  - as a reminder, each input was a 28 by 28 image, and the output was 10 classes

<img src="resources/model_question_number_of_parameters.png" style="width: 70%;">

- the matrix, $W$, here, takes, as an input, the entire image, so $28 \times 28$ pixels
- the output is of size $10$ (letters A - J), so the other dimension of the matrix is $10$
- the biases are just $1 \times 10$
- so the total number of parameters are $28 \times 28 \times 10 + 10 = 7,850$

- that's the case in general
- if you have $N$ inputs, and $K$ outputs, you have $(N+1)K$ parameters to use, not one more

<img src="resources/number_of_parameters.png" style="width: 70%;">

- the thing is, you might want to use many, many more parameters in practice
- also, it's linear
  - this means that the kind of interactions that you're capable of representing with that model is somewhat limited
    - for example, if two inputs interact in an additive way ($Y = X_1 + X_2$), your model can represent them well as a matrix multiply
    - but if two inputs interact in the way that the outcome depends on the product ($Y = X_1 \times X_2$) of the two for example, you won't be able to model that efficiently with a linear model
  - linear operations are really nice though
    - big matrix multiplies are exactly what GPUs were designed for; they're relatively cheap and very, very fast
  - numerically, linear operations are very stable: $Y = WX \rightarrow \Delta Y \sim |W| \Delta X$
    - we can show mathematically that small changes in the input ($\Delta X$) can never yield big changes in the output ($\Delta Y$)
  - the derivates are very nice too
    - the derivative of a linear function ($Y = WX$) is constant
    - $\dfrac {dY}{dX} = W^T$ &nbsp;&nbsp;,&nbsp;&nbsp; $\dfrac {dY}{dW} = X^T$
    - you can't get more stable numerically than a constant


- we would like to keep our parameters inside big linear functions, but we would also want the entire model to be nonlinear
- we can't just keep multiplying our inputs by linear functions, because that's just equivalent to one big linear function
- so, we're going to have to introduce non-linearities

# Rectified Linear Units (ReLU)

- let me introduce you to the lazy engineer's favorite non-linear function: **the rectified linear units**, or **ReLU** for short
- ReLUs are literally the simplest non-linear functions you can think of
- they're linear if $x > 0$, and they're the $0$ everywhere else

<img src="resources/RELU_function.png" style="width: 50%;">

- ReLUs have nice derivatives, as well
  - when $x < 0$ the ReLU is $0$ so the derivative is $0$ as well
  - when $x > 0$ the ReLU is equal to $x$ so the derivative is equal to $1$
  - the derivative of the segment to the left would be zero, as the value remains constant ($x=0$), and to the right would be a constant ($=1$) since it grows linearly ($y=x$)

<img src="resources/RELU_function_derivative.png" style="width: 30%;">

# Network of ReLUs

- because we're lazy engineers, we're going to take something that works - our logistic classifier and do the minimal amount of change to make it nonlinear
- we're going to construct our new function in the simplest way that we can think of
- instead of having a single matrix multiply as our classifier, we're going to insert a ReLU right in the middle
- we now have two matrices, one going from the inputs to the ReLUs and another one connecting the ReLUs to the classifier
- we've solved two of our problems
  - our function is now nonlinear, thanks to the ReLU in the middle
  - and we now have a new knob that we can tune this number $H$ which corresponds to the number of ReLU units that we have in the classifier; we can make it as big as we want

<img src="resources/network_of_RELUs.png" style="width: 80%;">

## Quiz: TensorFlow ReLu

- in this lesson, you'll learn how to build multilayer neural networks with TensorFlow
- adding a hidden layer to a network allows it to model more complex functions
- also, using a non-linear activation function on the hidden layer lets it model non-linear functions

<img src="resources/two_layer_neural_network_relu.png" style="width: 70%;">

- depicted above is a "2-layer" neural network:
  - the first layer effectively consists of the set of weights and biases applied to $X$ and passed through ReLUs
    - the output of this layer is fed to the next one, but is not observable outside the network, hence it is known as a hidden layer
  - the second layer consists of the weights and biases applied to these intermediate outputs, followed by the softmax function to generate probabilities

- a rectified linear unit (ReLU) is type of [activation function](https://en.wikipedia.org/wiki/Activation_function) that is defined as `f(x) = max(0, x)`
  - the function returns $0$ if `x` is negative, otherwise it returns `x`
- TensorFlow provides the ReLU function as `tf.nn.relu()`

```python
# Hidden Layer with ReLU activation function
hidden_layer = tf.add(tf.matmul(features, hidden_weights), hidden_biases)
hidden_layer = tf.nn.relu(hidden_layer)

output = tf.add(tf.matmul(hidden_layer, output_weights), output_biases)
```

- the above code applies the `tf.nn.relu()` function to the hidden_layer, effectively turning off any negative weights and acting like an on/off switch
- adding additional layers, like the output layer, after an activation function turns the model into a nonlinear function
- this nonlinearity allows the network to solve more complex problems

In [2]:
# In this quiz, you'll use TensorFlow's ReLU function to turn the linear model below into a nonlinear model.

import tensorflow as tf

output = None
hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8],
]
out_weights = [[0.1, 0.6], [0.2, 0.1], [0.7, 0.9]]

# Weights and biases
weights = [tf.Variable(hidden_layer_weights), tf.Variable(out_weights)]
biases = [tf.Variable(tf.zeros(3)), tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable(
    [[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0, -4.0], [11.0, 12.0, 13.0, 14.0]]
)

<IPython.core.display.Javascript object>

In [3]:
# TODO: Create Model
hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

# TODO: save and print session results on a variable named "output"
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    output = sess.run(logits)
    print(output)

[[ 5.1099997  8.44     ]
 [ 0.         0.       ]
 [24.01      38.24     ]]


<IPython.core.display.Javascript object>

# No Neurons

- yes, I could talk about neural networks as metaphors for the brain
- it's nice and it might even be true, but it comes with a lot of baggage and it can sometimes lead you astray
- so I'm not going to talk about it at all in this course
- no need to be a wizard neural scientist
- neural networks naturally make sense if you're simply a lazy engineer with a big GPU who just wants machine learning to work better

# The Chain Rule

- as long as you know how to write the derivatives of your individual functions, there is a simple graphical way to combine them together and compute the derivative for the whole function
- there is a way to write this chain rule that is very efficient computationally, with lots of data reuse and that looks like a very simple data pipeline

<img src="resources/chain_rule.png" style="width: 80%;">

# Backprop

- imagine your network is a stack of simple operations; like linear transforms, ReLUs, whatever you want
  - some have parameters, like the matrix transforms, some don't, like the ReLUs
- when you apply your data to some input $x$, you have data flowing through the stack up to your predictions $y$
- to compute the derivatives, you create another graph that looks like this:

<img src="resources/backprop.png" style="width: 80%;">

- the data in that new graph flows backwards through the network, gets combined using the chain rule that we saw before, and produces gradients
  - that graph can be derived completely automatically from the individual operations in your network
  - so, most deep learning frameworks will just do it for you
  - this is called back propagation and it's a very powerful concept
    - it makes computing derivatives of complex function very efficient, as long as the function is made up of simple blocks with simple derivatives


- running the model up to the predictions is often called the *forward prop*
- and the model that goes backwards is called the *back prop*


- to recap, to run stochastic gradient descent for every single little batch of your data in your training set, you're going to run the forward prop and then the back prop
  - that will give you gradient for each of your weights in your model
- then you're going to apply those gradients with learning rates to your original weights and update them
- you're going to repeat that all over again many, many times
- this is how your entire model gets optimized


- keep in mind this diagram
- in particular, each block of the back prop often takes about twice the memory that's needed for the forward prop and twice to compute
  - that's important when you want to size your model and fit in memory for example

# Deep Neural Network in TensorFlow

- in the following walkthrough, we'll step through TensorFlow code written to classify the letters in the MNIST database
- if you would like to run the network on your computer, the file is provided [here](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58a61a3a_multilayer-perceptron/multilayer-perceptron.zip)
- you can find this and many more examples of TensorFlow at Aymeric Damien's GitHub [repository](https://github.com/aymericdamien/TensorFlow-Examples)

## TensorFlow MNIST

```python
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)
```

- you'll use the MNIST dataset provided by TensorFlow, which batches and One-Hot encodes the data for you

## Learning Parameters

```python
import tensorflow as tf

# Parameters
learning_rate = 0.001
training_epochs = 20
batch_size = 128  # Decrease batch size if you don't have enough memory
display_step = 1

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)
```

- the focus here is on the architecture of multilayer neural networks, not parameter tuning, so here we'll just give you the learning parameters

## Hidden Layer Parameters

```python
n_hidden_layer = 256 # layer number of features
```

- the variable `n_hidden_layer` determines the size of the hidden layer in the neural network
  - this is also known as the width of a layer

## Weights and Biases

```python
# Store layers weight & bias
weights = {
    'hidden_layer': tf.Variable(tf.random_normal([n_input, n_hidden_layer])),
    'out': tf.Variable(tf.random_normal([n_hidden_layer, n_classes]))
}
biases = {
    'hidden_layer': tf.Variable(tf.random_normal([n_hidden_layer])),
    'out': tf.Variable(tf.random_normal([n_classes]))
}
```

- deep neural networks use multiple layers with each layer requiring it's own weight and bias
  - the `'hidden_layer'` weight and bias is for the hidden layer
  - the `'out'` weight and bias is for the output layer
  - if the neural network were deeper, there would be weights and biases for each additional layer

## Input

```python
# tf Graph input
x = tf.placeholder("float", [None, 28, 28, 1])
y = tf.placeholder("float", [None, n_classes])

x_flat = tf.reshape(x, [-1, n_input])
```

- the MNIST data is made up of 28px by 28px images with a single [channel](https://en.wikipedia.org/wiki/Channel_(digital_image%29)
- the `tf.reshape()` function above reshapes the 28px by 28px matrices in `x` into row vectors of 784px

## Multilayer Perceptron

<img src="resources/multi_layer_perceptron.png" style="width: 70%;">

```python
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x_flat, weights['hidden_layer']), biases['hidden_layer'])
layer_1 = tf.nn.relu(layer_1)
# Output layer with linear activation
logits = tf.add(tf.matmul(layer_1, weights['out']), biases['out'])
```

- you've seen the linear function `tf.add(tf.matmul(x_flat, weights['hidden_layer']), biases['hidden_layer'])` before, also known as `xw + b`
- combining linear functions together using a ReLU will give you a two layer network

## Optimizer

```python
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
```

- this is the same optimization technique used in the Intro to TensorFLow lab

## Session

```python
# Initializing the variables
init = tf.global_variables_initializer()


# Launch the graph
with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    for epoch in range(training_epochs):
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            # Run optimization op (backprop) and cost op (to get loss value)
            sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})
```

- the MNIST library in TensorFlow provides the ability to receive the dataset in batches
- calling the `mnist.train.next_batch()` function returns a subset of the training data

## Deeper Neural Network

<img src="resources/deeper_neural_network.png" style="width: 70%;">

- that's it!
- going from one layer to two is easy
- adding more layers to the network allows you to solve more complicated problems

# Training a Deep Learning Network

- so now you have a small neural network; it's not particularly deep, just 2 layers
- you can make it bigger, more complex by increasing the size of that hidden layer in the middle, but it turns out that increasing this $H$ is not particularly efficient in general
- you need to make it very, very big, and then it gets really hard to train
- this is where the central idea of deep learning comes into play
- instead you can also add more layers and make your model deeper

<img src="resources/adding_more_layers.png" style="width: 70%;">

- there are lots of good reasons to do that
- one is parameter efficiency
  - you can typically get much more performance with pure parameters by going deeper, rather than wider

<img src="resources/deeper_vs_wider.png" style="width: 70%;">
  
- another one is that a lot of natural phenomena, that you might be interested in, tend to have a hierarchical structure which deep models naturally capture
  - if you poke at a model for images, for example, and visualize what the model learns you'll often find very simple things at the lowest layers, like lines or edges
  - once you move up, you tend to see more complicated things like geometric shapes
  - go further up, and you start seeing things like objects, faces
  - this is very powerful, because the model structure matches the kind of abstractions that you might expect to see in your data, and as a result the model has an easier time learning them

<img src="resources/hierarchical_structure.png" style="width: 70%;">

# Save and Restore TensorFlow Models

- training a model can take hours
- but once you close your TensorFlow session, you lose all the trained weights and biases
- if you were to reuse the model in the future, you would have to train it all over again!
- fortunately, TensorFlow gives you the ability to save your progress using a class called `tf.train.Saver`
  - this class provides the functionality to save any `tf.Variable` to your file system

## Saving Variables

- let's start with a simple example of saving `weights` and `bias` Tensors
- for the first example you'll just save two variables
- later examples will save all the weights in a practical model

```python
import tensorflow as tf

# The file path to save the data
save_file = "./model.ckpt"

# Two Tensor Variables: weights and bias
weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))

# Class used to save and/or restore Tensor Variables
saver = tf.train.Saver()

with tf.Session() as sess:
    # Initialize all the Variables
    sess.run(tf.global_variables_initializer())

    # Show the values of weights and bias
    print("Weights:")
    print(sess.run(weights))
    print("Bias:")
    print(sess.run(bias))

    # Save the model
    saver.save(sess, save_file)
```

- the Tensors `weights` and `bias` are set to random values using the `tf.truncated_normal()` function
- the values are then saved to the` save_file` location, "model.ckpt", using the `tf.train.Saver.save()` function (the ".ckpt" extension stands for "checkpoint")


- if you're using TensorFlow 0.11.0RC1 or newer, a file called "model.ckpt.meta" will also be created
  - this file contains the TensorFlow graph

## Loading Variables

```python
# Remove the previous weights and bias
tf.reset_default_graph()

# Two Variables: weights and bias
weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))

# Class used to save and/or restore Tensor Variables
saver = tf.train.Saver()

with tf.Session() as sess:
    # Load the weights and bias
    saver.restore(sess, save_file)

    # Show the values of weights and bias
    print("Weight:")
    print(sess.run(weights))
    print("Bias:")
    print(sess.run(bias))
```

- you'll notice you still need to create the weights and bias Tensors in Python
- the `tf.train.Saver.restore()` function loads the saved data into weights and bias
  - since `tf.train.Saver.restore()` sets all the TensorFlow Variables, you don't need to call `tf.global_variables_initializer()`

## Save a Trained Model

- first start with a model:

```python
# Remove previous Tensors and Operations
tf.reset_default_graph()

from tensorflow.examples.tutorials.mnist import input_data
import numpy as np

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('.', one_hot=True)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
```

- let's train that model, then save the weights:

```python
import math

save_file = './train_model.ckpt'
batch_size = 128
n_epochs = 100

saver = tf.train.Saver()

# Launch the graph
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    # Training cycle
    for epoch in range(n_epochs):
        total_batch = math.ceil(mnist.train.num_examples / batch_size)

        # Loop over all batches
        for i in range(total_batch):
            batch_features, batch_labels = mnist.train.next_batch(batch_size)
            sess.run(
                optimizer,
                feed_dict={features: batch_features, labels: batch_labels})

        # Print status for every 10 epochs
        if epoch % 10 == 0:
            valid_accuracy = sess.run(
                accuracy,
                feed_dict={
                    features: mnist.validation.images,
                    labels: mnist.validation.labels})
            print('Epoch {:<3} - Validation Accuracy: {}'.format(
                epoch,
                valid_accuracy))

    # Save the model
    saver.save(sess, save_file)
    print('Trained Model Saved.')
```

## Load a Trained Model

- let's load the weights and bias from memory, then check the test accuracy

```python
saver = tf.train.Saver()

# Launch the graph
with tf.Session() as sess:
    saver.restore(sess, save_file)

    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: mnist.test.images, labels: mnist.test.labels})

print('Test Accuracy: {}'.format(test_accuracy))
```

# Finetuning

- sometimes you might want to adjust, or "finetune" a model that you have already trained and saved
-  however, loading saved Variables directly into a modified model can generate errors
- let's go over how to avoid these problems

## Naming Error

- TensorFlow uses a string identifier for Tensors and Operations called `name`
- if a name is not given, TensorFlow will create one automatically
- TensorFlow will give the first node the name `<Type>`, and then give the name `<Type>_<number>` for the subsequent nodes
- let's see how this can affect loading a model with a different order of weights and bias:

```python
import tensorflow as tf

# Remove the previous weights and bias
tf.reset_default_graph()

save_file = 'model.ckpt'

# Two Tensor Variables: weights and bias
weights = tf.Variable(tf.truncated_normal([2, 3]))
bias = tf.Variable(tf.truncated_normal([3]))

saver = tf.train.Saver()

# Print the name of Weights and Bias
print('Save Weights: {}'.format(weights.name))
print('Save Bias: {}'.format(bias.name))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.save(sess, save_file)

# Remove the previous weights and bias
tf.reset_default_graph()

# Two Variables: weights and bias
bias = tf.Variable(tf.truncated_normal([3]))
weights = tf.Variable(tf.truncated_normal([2, 3]))

saver = tf.train.Saver()

# Print the name of Weights and Bias
print('Load Weights: {}'.format(weights.name))
print('Load Bias: {}'.format(bias.name))

with tf.Session() as sess:
    # Load the weights and bias - ERROR
    saver.restore(sess, save_file)
```

- the code above prints out the following:

```python
Save Weights: Variable:0
```
```python
Save Bias: Variable_1:0
```
```python
Load Weights: Variable_1:0

```
```python
Load Bias: Variable:0
```
```python
...
```
```python
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match.
```
```python
...
```

- you'll notice that the `name` properties for `weights` and `bias` are different than when you saved the model
  - this is why the code produces the "Assign requires shapes of both tensors to match" error
  - the code `saver.restore(sess, save_file)` is trying to load weight data into bias and bias data into weights

- instead of letting TensorFlow set the name property, let's set it manually:

```python
import tensorflow as tf

tf.reset_default_graph()

save_file = 'model.ckpt'

# Two Tensor Variables: weights and bias
weights = tf.Variable(tf.truncated_normal([2, 3]), name='weights_0')
bias = tf.Variable(tf.truncated_normal([3]), name='bias_0')

saver = tf.train.Saver()

# Print the name of Weights and Bias
print('Save Weights: {}'.format(weights.name))
print('Save Bias: {}'.format(bias.name))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.save(sess, save_file)

# Remove the previous weights and bias
tf.reset_default_graph()

# Two Variables: weights and bias
bias = tf.Variable(tf.truncated_normal([3]), name='bias_0')
weights = tf.Variable(tf.truncated_normal([2, 3]) ,name='weights_0')

saver = tf.train.Saver()

# Print the name of Weights and Bias
print('Load Weights: {}'.format(weights.name))
print('Load Bias: {}'.format(bias.name))

with tf.Session() as sess:
    # Load the weights and bias - No Error
    saver.restore(sess, save_file)

print('Loaded Weights and Bias successfully.')
```

- the code above prints out the following:

```python
Save Weights: weights_0:0
```
```python
Save Bias: bias_0:0
```
```python
Load Weights: weights_0:0

```
```python
Load Bias: bias_0:0
```
```python
Loaded Weights and Bias successfully.
```

- that worked! The Tensor names match and the data loaded correctly

# Regularization

- why did we not figure out earlier that deep models were effective?
  - many reasons, but mostly because deep models only really shine if you have enough data to train them
- it's only in recent years that large enough data sets have made their way to the academic world


- we know better today how to train very, very big models using better regularization techniques
- there is a general issue when you're doing numerical optimization which I call the *skinny jeans problem*
  - skinny jeans look great, they fit perfectly, but they're really, really hard to get into
  - so most people end up wearing jeans that are just a bit too big
  - it's exactly the same with deep networks
    - the network that's just the right size for your data is very, very hard to optimize
    - so in practice, we always try networks that are way too big for our data and then we try our best to prevent them from overfitting

- the first way we prevent over fitting, is by looking at the performance on our validation set and stopping to train, as soon as we stop improving
  - it's called early termination, and it's still the best way to prevent your network from over optimizing on the training set

<img src="resources/early_termination.png" style="width: 70%;">

- another way is to apply regularization
  - regularizing means applying artificial constraints on your network, that implicitly reduce the number of free parameters while not making it more difficult to optimize
  - in the skinny jeans analogy, think stretch pants; they fit just as well, but because they're flexible, they don't make things harder to fit in


- the stretch pants of deep learning are called **L2 Regularization**
  - the idea is to add another term ($\beta \dfrac{1}{2} ||w||_2^2 $) to the loss ($£$), which penalizes large weights
  - it's typically achieved by adding the L2 norm of your weights to the loss, multiplied by a small constant

<img src="resources/L2_regularization.png" style="width: 70%;">

- the nice thing about L2 Regularization is that it’s very, very simple
  - because you just add it your loss, the structure of your network doesn’t have to change
  - you can even compute its derivative by hand
  - remember that the L2 norm stands for the sum of the squares of the individual elements in a vector


- the formula for the derivative of L2 norm of a vector:

<img src="resources/L2_norm_derivative_of_vector.png" style="width: 30%;">

- the derivative of $\dfrac{1}{2}x^2$ in one dimension, is simply $x$
  - so when you take that derivative for each of the components of your vector, you get the same components

# Dropout

- there's another important technique for regularization that only emerged relatively recently and works amazingly well
  - it's called **Dropout**
- dropout works like this:
  - imagine that you have one layer that connects to another layer
  - the values that go from one layer to the next are often called *activations*
  - now take those activations and randomly, for every example you train your network on, set half of them to $0$
    - completely and randomly, you basically take half of the data that's flowing through your network, and just destroy it and then randomly do the process again


- so what happens with dropout?
  - your network can never rely on any given activation to be present, because they might be squashed at any given moment
  - so it is forced to learn a redundant representation for everything to make sure that at least some of the information remains
  - it's like a game of whack-a-mole
    - one activations get smashed, but there is always one or more that do the same job and that don't get killed, so everything remains fine at the end


- forcing your network to learn redundant representations might sound very inefficient
- in practice, it makes things more robust, and prevents over fitting
- it also makes your network act as if taking the consensus over an ensemble of networks, which is always a good way to improve performance


- dropout is one of the most important techniques to emerge in the last few years
- if dropout doesn't work for you, you should probably be using a bigger network

- when you evaluate the network that's been trained with drop out, you obviously no longer want this randomness, you want something deterministic
- you're going to want to take the consensus over these redundant models
  - you get the consensus opinion by averaging the activations
  - you want $y_e$ to be the average of all the yts that you got during training ($y_e \sim E(y_t)$)


- here's a trick to make sure this expectation holds
- during training, not only do you use zero outs to the activations that you drop out, but you also scale the remaining activations by a factor of $2$

<img src="resources/consensus opinion_1.png" style="width: 70%;">

- this way, when it comes time to average them during evaluation, you just remove these dropouts and scaling operations from your neural net
- the result is an average of these activations that is properly scaled

<img src="resources/consensus opinion_2.png" style="width: 70%;">

# TensorFlow Dropout

<img src="resources/dropout-node.jpeg" style="width: 50%;">

- dropout is a regularization technique for reducing overfitting
- the technique temporarily drops units ([artificial neurons](https://en.wikipedia.org/wiki/Artificial_neuron)) from the network, along with all of those units' incoming and outgoing connections


- TensorFlow provides the `tf.nn.dropout()` function, which you can use to implement dropout

```python
keep_prob = tf.placeholder(tf.float32) # probability to keep units

hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])
```

- the code above illustrates how to apply dropout to a neural network
- the `tf.nn.dropout()` function takes in two parameters:
  - `hidden_layer`: the tensor to which you would like to apply dropout
  - `keep_prob`: the probability of keeping (i.e. not dropping) any given unit


- `keep_prob` allows you to adjust the number of units to drop
- in order to compensate for dropped units, `tf.nn.dropout()` multiplies all units that are kept (i.e. not dropped) by `1/keep_prob`


- during training, a good starting value for `keep_prob` is $0.5$
- during testing, use a `keep_prob` value of $1.0$ to keep all units and maximize the power of the model

## TensorFlow Dropout Quiz 1

- take a look at the code snippet below; do you see what's wrong?
- there's nothing wrong with the syntax, however the test accuracy is extremely low

```python
...

keep_prob = tf.placeholder(tf.float32) # probability to keep units

hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

...

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(epochs):
        for batch_i in range(batches):
            ....

            sess.run(optimizer, feed_dict={
                features: batch_features,
                labels: batch_labels,
                keep_prob: 0.5})

    validation_accuracy = sess.run(accuracy, feed_dict={
        features: test_features,
        labels: test_labels,
        keep_prob: 0.5})
```

- `keep_prob` should be set to $1.0$ when evaluating validation accuracy
  - you should only drop units while training the model
  - during validation or testing, you should keep all of the units to maximize accuracy

## TensorFlow Dropout Quiz 2

- this quiz will be starting with the code from the ReLU Quiz and applying a dropout layer
  - build a model with a ReLU layer and dropout layer using the `keep_prob` placeholder to pass in a probability of $0.5$
  - print the logits from the model


**Note:** Output will be different every time the code is run. This is caused by dropout randomizing the units it drops.

In [4]:
import tensorflow as tf
from test import *

tf.set_random_seed(123456)


hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8],
]
out_weights = [[0.1, 0.6], [0.2, 0.1], [0.7, 0.9]]


# set random seed
tf.set_random_seed(123456)


# Weights and biases
weights = [tf.Variable(hidden_layer_weights), tf.Variable(out_weights)]
biases = [tf.Variable(tf.zeros(3)), tf.Variable(tf.zeros(2))]


# Input
features = tf.Variable(
    [[0.0, 2.0, 3.0, 4.0], [0.1, 0.2, 0.3, 0.4], [11.0, 12.0, 13.0, 14.0]]
)

<IPython.core.display.Javascript object>

In [5]:
# TODO: Create Model with Dropout
keep_prob = tf.placeholder(tf.float32)
hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])


# TODO: save and print session results as variable named "output"
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    output = sess.run(logits, feed_dict={keep_prob: 0.5})
    print(output)

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
[[ 8.46      9.400001]
 [ 0.        0.      ]
 [14.280001 33.100002]]


<IPython.core.display.Javascript object>