#### Chapter Setup

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "deep"

## Vanishing/Exploding Gradients Problems

The **Vanishing Gradients Problem** is a side effect of applying the Gradient Descent algorithm to deep neural networks wherein the back propogation error of the lower layers is so low that the algorithm leaves the weights of those layers virtually unchanged throughout the training phase. 

The **Exploding Gradients Problem** is a side effect of applying the Gradient Descent algorithm to deep neural networks wherein the back propogation error of the lower layers is so high that the algorithm updates the weights of those layers so substantially every iteration that the alogirthm diverges.  

This problem can occur due to the difference in the variances of the sigmoid activation function and the normal distribution ($\mu = 0$, $\sigma = 1$) used to initialize layer weights. The use of these two functions causes the inputs of a layer to have lower variance than the output of the same layer. This difference eventually propogates enough that output variance is considerably higher in deeper layers when compared to the variance of the begining layers. This causes the inputs to have exteremly positive or negative values which in turn results in a derivative approcahing 0. Such a low derivative means the weights of a layer will remain unchanged. 

### *Xavier and He Initialization*

For a signal to flow properly through a DNN, the variance of the inputs and outputs of each layer must remain virtually the same going in both directions (forward and back propogation). However this is in practice when the layer does not have an equal amount of input and output connections. Instead the connection weights should be intialized following the *Xavier Initlization* strategy when using a logistic activation function:


$\large \text{Normal Distribution w/ } \mu = 0 \text{ and } \sigma = \sqrt{\frac{2}{n_{inputs} + n_{outputs}}}$ 

OR

$\large \text{Uniform Distribution between } [ -r, +r ] \text{ where } r = \sqrt{\frac{6}{n_{inputs} + n_{outputs}}}$ 

Using this strategy speeds up the training of DNNs. For the ReLU activation function you can use *He activation*. By default the $fully\_connected()$ function uses Xavier initialization.

### *Nonsaturating Activation Function*

The exploding/vanishing gradients problems are in part due to a poor choice in activation function. The ReLU function behaves better in DNNs (as it does not saturate for positive values) when compared to the logistic function but it does have some drawbacks. An example of this is the **dying ReLU** problem wherein some nodes in your DNN simply die and output 0 no matter the input. 

The ReLU function has some variants. The *Leaky ReLU* function ($\text{LeakyReLU}_{\alpha}(z) = \text{max}(\alpha z, \, z)$) uses the hyperparameter $\alpha$ to avoid the problem of dying nodes. However it can result in nodes being in a coma, which essentially means they can die but come back to life if given enough time. 

Another activation function is the *exponential linear unit* (ELU): 

$\large \begin{gather*} \text{ELU}_\alpha (z) =
\begin{cases}
  \alpha \, (e^{\,z} - 1) & \text{ if } z < 0\\    
  z & \text{ if } z \geq 0 \\     
\end{cases}
\end{gather*}$

The ELU function in general has an average output close to 0 (so no vanishing gradients problem), has a nonzero gradient when $z<0$ so nodes won't die, and is smooth everywhere so it speeds up convergence. However it is slower to compute than the ReLU function but this can be mitigated as it converges faster during training.  Both the leaky ReLU and the ELU functions can be used in tensorflow:

In [None]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected

def leaky_rlu(z, name=None):
    return tf.maximum(0.01*z, z, name=name) 

hidden1 = fully_connected(X, n_hidden1, activation_fn=tf.nn.elu) # ELU
hidden1 = fully_connected(X, n_hidden1, activation_fn=leaky_rlu) # ELU

### *Batch Normalization*

The **Batch Normalization** algorithm consists of adding an operation in the model just before the activation of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer. This reduces the chances of the exploding/vanishing gradients problem occuring late into training. The algorithm is given below:

1. $\large \quad \mu_B = \frac{1}{m_b} \, \Sigma_{i=1}^{m_B} \, x^{(i)}$
2. $\large \quad \sigma_B^2 = \frac{1}{m_b} \, \Sigma_{i=1}^{m_B} \, (x^{(i)} - \mu_B)^2$
3. $\large \quad \hat{x}^{(i)} = \frac{x^{(i)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
4. $\large \quad z^{(i)} = \gamma \hat{x}^{(i)} + \beta$

Using Batch Normalization will make your DNN more acurate, make it converge in fewer training iterations, and eliminate the need for regularization. However it make predictions slower, as there are more computations, and it makes your model more complex. 

Here's an implementation of this algorithm in TensorFlow:

In [None]:
reset_graph()

import tensorflow as tf

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

training = tf.placeholder_with_default(False, shape=(), name='training')

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = tf.layers.batch_normalization(logits_before_bn, training=training,
                                       momentum=0.9)

The last few lines are very repatative, this can be fixed using Python's partial function:

In [None]:
from functools import partial

my_batch_norm_layer = partial(tf.layers.batch_normalization,
                              training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)

Because of the $training$ variable, when your model is making predictions you should remember to set this parameter to false. 

### *Gradient Clipping*

The **Gradient Clipping** technique is a technique wherein the backpropgation gradient is limited to some max and is *clipped* if the gradient is higher than the max. 

Do do this in tensorflow, you must clip the gradient value before calling $minimize()$ as that function will compute and apply an gradient. Therefore you must ask tensorflow to compute the gradient, clip it, and then apply it to the layer: 

In [None]:
threshold = 1.0

optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss) # Compute
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var) # Clip
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs) # Apply 

## Reusing Pretrained Layers

You should always try to borrow layers from another DNN that tackles a problem similar to yours. This will make training faster as you can fix the weights of those layers and only train the new portion of the DNN. 

### *Reusing a TensorFlow Model*

The following code is an example of reusing only specific portions of a previously trained TensorFlow model:

In [None]:
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden[123]") # regular expression
reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars])
restore_saver = tf.train.Saver(reuse_vars_dict) # to restore layers 1-3

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./my_model_final.ckpt")

    for epoch in range(n_epochs):                                      # not shown in the book
        for iteration in range(mnist.train.num_examples // batch_size): # not shown
            X_batch, y_batch = mnist.train.next_batch(batch_size)      # not shown
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})  # not shown
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,  # not shown
                                                y: mnist.test.labels}) # not shown
        print(epoch, "Test accuracy:", accuracy_val)                   # not shown

    save_path = saver.save(sess, "./my_new_model_final.ckpt")

### *Reusing Models from Other Frameworks*

If the model uses another framework then you will have to load the weights into Python and assign to a layer in TensorFlow:

In [None]:
n_inputs = 2
n_hidden1 = 3

original_w = [...] # Load the weights from the other framework
original_b = [...] # Load the biases from the other framework

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
# [...] Build the rest of the model

# Get a handle on the variables of layer hidden1
with tf.variable_scope("", default_name="", reuse=True):  # root scope
    hidden1_weights = tf.get_variable("hidden1/kernel")
    hidden1_biases = tf.get_variable("hidden1/bias")

# Create dedicated placeholders and assignment nodes
original_weights = tf.placeholder(tf.float32, shape=(n_inputs, n_hidden1))
original_biases = tf.placeholder(tf.float32, shape=n_hidden1)
assign_hidden1_weights = tf.assign(hidden1_weights, original_weights)
assign_hidden1_biases = tf.assign(hidden1_biases, original_biases)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    sess.run(assign_hidden1_weights, feed_dict={original_weights: original_w})
    sess.run(assign_hidden1_biases, feed_dict={original_biases: original_b})
    # [...] Train the model on your new task

### *Freezing the Lower Layers*

To ignore a layer, simply tell the $minimize()$ function which variables to train:

In [None]:
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="hidden[34]|outputs")
training_op = optimizer.minimize(loss, var_list=train_vars)

### *Tweaking, Dropping, or Replacing the Upper Layers*

* You almost always have to drop the output layer
* Freeze all copied layers, then train your model. If you model underperforms, try un-freezing the topmost frozen layer
* The more data you have the more you can unfreeze

### *Model Zoos*

Model Zoos contain models trained by other people. Look in [TensorFlows model zoo](https://github.com/tensorflow/models) or in [Caffe's model zoo](https://github.com/BVLC/caffe/wiki/Model-Zoo) (look [here](https://github.com/ethereon/caffe-tensorflow) for help converting from Caffe models to tensorflow models) for models to base your projects on. 

### *Unsupervised Pretraining*

### *Pretraining on an Auxillary Task*