# TRAINING THEORY

## VANISHING GRADIENTS

Unfortunately, gradients often get smaller and smaller as the algorithm progresses
down to the lower layers. As a result, the Gradient Descent update leaves the lower
layer connection weights virtually unchanged, and training never converges to a good
solution. This is called the vanishing gradients problem. In some cases, the opposite
can happen: the gradients can grow bigger and bigger, so many layers get insanely
large weight updates and the algorithm diverges. This is the exploding gradients problem,
which is mostly encountered in recurrent neural networks (see Chapter 14).
More generally, deep neural networks suffer from unstable gradients; different layers
may learn at widely different speeds.

#### Xavier and He Initialization

Using the Xavier initialization strategy can speed up training considerably, and it is
one of the tricks that led to the current success of Deep Learning. Some recent papers4
have provided similar strategies for different activation functions, as shown in
Table 11-1. The initialization strategy for the ReLU activation function (and its variants,
including the ELU activation described shortly) is sometimes called He initialization
(after the last name of its author).

#### Nonsaturating Activation Functions


So which activation function should you use for the hidden layers
of your deep neural networks? Although your mileage will vary, in
general ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic.
If you care a lot about runtime performance, then you may prefer
leaky ReLUs over ELUs. If you don’t want to tweak yet another
hyperparameter, you may just use the default α values suggested
earlier (0.01 for the leaky ReLU, and 1 for ELU). If you have spare
time and computing power, you can use cross-validation to evaluate
other activation functions, in particular RReLU if your network
is overfitting, or PReLU if you have a huge training set.

TensorFlow offers an elu() function that you can use to build your neural network.
Simply set the activation_fn argument when calling the fully_connected() function,
like this:

`hidden1 = fully_connected(X, n_hidden1, activation_fn=tf.nn.elu)`

TensorFlow does not have a predefined function for leaky ReLUs, but it is easy
enough to define:

In [None]:
def leaky_relu(z, name=None):
    return tf.maximum(0.01 * z, z, name=name)

hidden1 = fully_connected(X, n_hidden1, activation_fn=leaky_relu)

Although using He initialization along with ELU (or any variant of ReLU) can significantly
reduce the vanishing/exploding gradients problems at the beginning of training,
it doesn’t guarantee that they won’t come back during training.

### batch normalization (good for very deep networks)

The authors demonstrated that this technique considerably improved all the deep
neural networks they experimented with. The vanishing gradients problem was
strongly reduced, to the point that they could use saturating activation functions such
as the tanh and even the logistic activation function. The networks were also much
less sensitive to the weight initialization. They were able to use much larger learning
rates, significantly speeding up the learning process. Specifically, they note that
“Applied to a state-of-the-art image classification model, Batch Normalization achieves
the same accuracy with 14 times fewer training steps, and beats the original
model by a significant margin. […] Using an ensemble of batch-normalized networks,
we improve upon the best published result on ImageNet classification: reaching
4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of
human raters.” Finally, like a gift that keeps on giving, Batch Normalization also acts
like a regularizer, reducing the need for other regularization techniques (such as
dropout, described later in the chapter).

Batch Normalization does, however, add some complexity to the model (although it
removes the need for normalizing the input data since the first hidden layer will take
care of that, provided it is batch-normalized). Moreover, there is a runtime penalty:
the neural network makes slower predictions due to the extra computations required
at each layer. So if you need predictions to be lightning-fast, you may want to check
how well plain ELU + He initialization perform before playing with Batch Normalization.

Implementing Batch Normalization with TensorFlow


TensorFlow provides a batch_normalization() function that simply centers and
normalizes the inputs, but you must compute the mean and standard deviation yourself
(based on the mini-batch data during training or on the full dataset during testing,
as just discussed) and pass them as parameters to this function, and you must
also handle the creation of the scaling and offset parameters (and pass them to this
function). It is doable, but not the most convenient approach. Instead, you should use
the batch_norm() function, which handles all this for you. You can either call it
directly or tell the fully_connected() function to use it, such as in the following
code:

In [None]:
import tensorflow as tf
from tensorflow.contrib.layers import batch_norm

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

# bool to tell if it's training data and we use moving average for mean and std or
# is test and we use mean and std from traning
is_training = tf.placeholder(tf.bool, shape=(), name='is_training')

bn_params = {
    'is_training': is_training,
    'decay': 0.99,
    'updates_collections': None
    }

hidden1 = fully_connected(X, n_hidden1, scope="hidden1",
normalizer_fn=batch_norm, normalizer_params=bn_params)
hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2",
normalizer_fn=batch_norm, normalizer_params=bn_params)
logits = fully_connected(hidden2, n_outputs, activation_fn=None,scope="outputs",
normalizer_fn=batch_norm, normalizer_params=bn_params)

To avoid repeating the same parameters over
and over again, you can create an argument scope using the arg_scope() function:
the first parameter is a list of functions, and the other parameters will be passed to
these functions automatically. The last three lines of the preceding code can be modified
like so:

In [None]:
with tf.contrib.framework.arg_scope(
        [fully_connected],
        normalizer_fn=batch_norm,
        normalizer_params=bn_params):
    hidden1 = fully_connected(X, n_hidden1, scope="hidden1")
    hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2")
    logits = fully_connected(hidden2, n_outputs, scope="outputs", activation_fn=None)

The rest of the construction phase is the same as in Chapter 10: define the cost function,
create an optimizer, tell it to minimize the cost function, define the evaluation
operations, create a Saver, and so on.
The execution phase is also pretty much the same, with one exception. Whenever you
run an operation that depends on the batch_norm layer, you need to set the is_train
ing placeholder to True or False:

In [None]:
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        [...]
        for X_batch, y_batch in zip(X_batches, y_batches):
            sess.run(training_op, feed_dict={is_training: True, X: X_batch, y: y_batch})
        accuracy_score = accuracy.eval(feed_dict={is_training: False, X: X_test_scaled, y: y_test}))
        print(accuracy_score)

### Gradient Clipping
A popular technique to lessen the exploding gradients problem is to simply clip the
gradients during backpropagation so that they never exceed some threshold (this is
mostly useful for recurrent neural networks; see Chapter 14). This is called Gradient
Clipping.8 In general people now prefer Batch Normalization, but it’s still useful to
know about Gradient Clipping and how to implement it.

In [None]:
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var) for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

## REUSING PRETRAINED LAYERS

It is generally not a good idea to train a very large DNN from scratch: instead, you
should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle, then just reuse the lower layers of this network:
this is called transfer learning. It will not only speed up training considerably, but will
also require much less training data.

If the input pictures of your new task don’t have the same size as
the ones used in the original task, you will have to add a preprocessing
step to resize them to the size expected by the original
model. More generally, transfer learning will work only well if the
inputs have similar low-level features.

#### Reusing a TensorFlow Model
If the original model was trained using TensorFlow, you can simply restore it and
train it on the new task:

In [None]:
[...] # construct the original model
with tf.Session() as sess:
    saver.restore(sess, "./my_original_model.ckpt")
    [...] # Train it on your new task

However, in general you will want to reuse only part of the original model (as we will
discuss in a moment). A simple solution is to configure the Saver to restore only a
subset of the variables from the original model. For example, the following code
restores only hidden layers 1, 2, and 3:

In [None]:
[...] # build new model with the same definition as before for hidden layers 1-3
init = tf.global_variables_initializer()
reuse_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="hidden[123]")
reuse_vars_dict = dict([(var.name, var.name) for var in reuse_vars])
original_saver = tf.Saver(reuse_vars_dict) # saver to restore the original model
new_saver = tf.Saver() # saver to save the new model
with tf.Session() as sess:
    sess.run(init)
    original_saver.restore("./my_original_model.ckpt") # restore layers 1 to 3
    [...] # train the new model
    new_saver.save("./my_new_model.ckpt") # save the whole model

First we build the new model, making sure to copy the original model’s hidden layers
1 to 3. We also create a node to initialize all variables. Then we get the list of all variables
that were just created with "trainable=True" (which is the default), and we
keep only the ones whose scope matches the regular expression "hidden[123]" (i.e.,
we get all trainable variables in hidden layers 1 to 3). Next we create a dictionary
mapping the name of each variable in the original model to its name in the new
model (generally you want to keep the exact same names). Then we create a Saver
that will restore only these variables, and we create another Saver to save the entire
new model, not just layers 1 to 3. We then start a session and initialize all variables in
the model, then restore the variable values from the original model’s layers 1 to 3.
Finally, we train the model on the new task and save it.

The more similar the tasks are, the more layers you want to reuse
(starting with the lower layers). For very similar tasks, you can try
keeping all the hidden layers and just replace the output layer.

#### Freezing the Lower Layers
It is likely that the lower layers of the first DNN have learned to detect low-level features
in pictures that will be useful across both image classification tasks, so you can
just reuse these layers as they are. It is generally a good idea to “freeze” their weights
when training the new DNN: if the lower-layer weights are fixed, then the higherlayer
weights will be easier to train (because they won’t have to learn a moving target).
To freeze the lower layers during training, the simplest solution is to give the optimizer
the list of variables to train, excluding the variables from the lower layers:

In [None]:
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="hidden[34]|outputs")
training_op = optimizer.minimize(loss, var_list=train_vars)

output layer. This leaves out the variables in the hidden layers 1 and 2. Next we provide
this restricted list of trainable variables to the optimizer’s minimize() function.
Ta-da! Layers 1 and 2 are now frozen: they will not budge during training (these are
often called frozen layers).

Caching the Frozen Layers
Since the frozen layers won’t change, it is possible to cache the output of the topmost
frozen layer for each training instance. Since training goes through the whole dataset
many times, this will give you a huge speed boost as you will only need to go through
the frozen layers once per training instance (instead of once per epoch). For example,
you could first run the whole training set through the lower layers (assuming you
have enough RAM).

Then during training, instead of building batches of training instances, you would
build batches of outputs from hidden layer 2 and feed them to the training operation.

The last line runs the training operation defined earlier (which freezes layers 1 and 2),
and feeds it a batch of outputs from the second hidden layer (as well as the targets for
that batch). Since we give TensorFlow the output of hidden layer 2, it does not try to
evaluate it (or any node it depends on).

In [None]:
hidden2_outputs = sess.run(hidden2, feed_dict={X: X_train})

import numpy as np
n_epochs = 100
n_batches = 500
for epoch in range(n_epochs):
    shuffled_idx = rnd.permutation(len(hidden2_outputs))
    hidden2_batches = np.array_split(hidden2_outputs[shuffled_idx], n_batches)
    y_batches = np.array_split(y_train[shuffled_idx], n_batches)
    for hidden2_batch, y_batch in zip(hidden2_batches, y_batches):
        sess.run(training_op, feed_dict={hidden2: hidden2_batch, y: y_batch})

Then try unfreezing one or two of the top hidden layers to let backpropagation tweak
them and see if performance improves. The more training data you have, the more
layers you can unfreeze.
If you still cannot get good performance, and you have little training data, try dropping
the top hidden layer(s) and freeze all remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plenty of training
data, you may try replacing the top hidden layers instead of dropping them, and
even add more hidden layers.

## FASTER OPTIMIZERS

the most popular ones: Momentum optimization,
Nesterov Accelerated Gradient, AdaGrad, RMSProp, and finally Adam
optimization.

Adam usually yields good results but it has 3 parameters. Defaults are good but in case I want to tweak them you need to dig dipper in what they do and you need to understand the others in order to do that.

## REGULARIZATION (avoding overfitting)