## Chapter 11. Training Deep Neural Nets

Difficulties of large deep neural network:
* _Vanishing gradients problem_ (or _exploding gradients problem_) makes lower layers very hard to train.
* Training is extremely slow.
* Risk overfitting.

### Vanishing/Exploding Gradients Problems

_Vanishing gradients_ problem: gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution.

_Exploding gradient_ problem: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. (mostly encountered in recurrent neural networks)

More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

"Understanding the Difficulty of Training Deep Feedforward Neural Networks" by Xavier Glorot and Yoshua Bengio found a few suspects, including the combination of the popular logistic sigmoid activation function and the weight initialization technique that was most popular at the time, namely random initialization using a normal distribution with a mean of 0 and a standard deviation of 1. In short, they showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. 

#### Xavier and He Initialization

For the signal to flow properly, the variance of the outputs of each layer is need to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. The connection weights must be initialized randomly as 

_Xavier (Glorot) initialization (when using the logistic activation function)_

>Normal distribution with mean $0$ and standard deviation $\ \sigma^2 = \frac{1}{fan_{avg}}$
<br>
>or a uniform distribution between $-r$ and $+r$, with $\ r=\sqrt{\frac{3}{fan_{avg}}}$

where $fan_{avg} = (fan_{in} + fan_{out})/2$ is average of the number of input and output connections for the layer whose weights are being initialized (also called *fan-in* and *fan-out*).

*LeCun initialization*: replace $fan_{avg}$ with $fan_{in}$


The initialization strategy for the ReLU activation function (and its variants, including the ELU activation described shortly) is sometimes called *He* initialization. The SELU activation function should be used with LeCun initialization (preferably with a normal distribution).

*Initialization parameters ofr each type of activation function*

| Initialization   |   Activation functions   |   $\sigma^2$ (Normal)   |
|------|------|------|
| Glorot | None, Tanh, Logistic, Softmax  | $1/fan_{avg}$   | 
| He     | ReLU & variants | $2/fan_{in}$ |
| LeCun  | SELU            | $1/fan_{in}$ |

$$\sigma = \frac{1}{\sqrt{n}} \ \ \ or \ \ \ r = \frac{\sqrt{3}}{\sqrt{n}}$$

The initialization strategy for the ReLU activation function (and its variants, including the ELU activation) is sometimes called _He initialization_.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/tab11-1.png" width=400px alt="tab11-1" style="padding-bottom:1.0em;padding-top:2.0em;"></center>Table 11-1. Initialization parameters for each type of activation function</div>

Use He initialization

In [1]:
import tensorflow as tf
import tensorflow.keras as keras

keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg',
distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

<tensorflow.python.keras.layers.core.Dense at 0x7fce28626090>

<font color=blue>*NOTE*</font>
>He initialization considers only the fan-in, not the average between fan-in and fan-out like in Xavier initialization. This is also the default for the `variance_scaling_initializer()` function, but you can change this by setting the argument `mode="FAN_AVG"`.

#### Nonsaturating Activation Functions

ReLU activation function is much better than sigmoid activation function, because it does not saturate for positive values (and also because it is quite fast to compute).

Problem *dying ReLUs*: during training, some neurons effectively die, meaning they stop outputting anything other than $0$. During training, if a neuron's weights get updated such that the weighted sum of the neuron's input is negative, it will start outputting $0$.

To solve this problem, use *leaky ReLU*

$$ LeakyReLU_\alpha(z)=max(\alpha z, z) $$

where hyperparameter $\alpha$ is typically set to $0.01$. 
- Huge leak ($\alpha = 0.2$) seemed to result inm better performance than small leak.
- *Randomized leak ReLU* (RReLU), where $\alpha$ is picked randomly in a given range during training, and it fixed to an average value during testing. It also performed well and seemed to act as a regularizer (reducing the risk of overfitting the training set). 
- *Parametric leak ReLU)* (PReLU), where $\alpha$ is authorized to be learned during training (instead of a hyperparameter). Strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig11-2.png" width=400px alt="fig11-2" style="padding-bottom:1.0em;padding-top:2.0em;"></center>Figure 11-2. Leaky ReLU</div>

*Exponential linear unit* (ELU). 

$$ELU_\alpha(z) = 
\left\{\begin{matrix}
\alpha(exp(z)-1) \ \ if \ z < 0
\\ 
z \qquad \qquad \ \ \ \ if \ z \ge 0
\end{matrix}\right.$$

Reduced training time and better performed on test set.

Major differences with ReLU:
- Negative value when $z<0$, which allows the unit to have an average output closer to $0$. This helps alleviate the vanishing gradients problem. Usually, $\alpha=1$
- Nonzero gradient for $z<0$, which avoids the dying units issue.
- The function is smooth everywhere, including around $z=0$, which helps speed up Gradient Descent, since it does not bounce as much left and right of $z=0$.

Main drawback: slow to compute than ReLU and its variants. During training this is compensated by the faster convergence rate. However, at test time an ELU network is slower than a ReLU network.

<font color=blue>*TIP*</font>
>ELU > leaky ReLU > ReLU > tanh > logistic. Use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a hu8ge training set.

To use the leaky ReLu activation function, you must create a `LeakyReLU` instance like this:

In [2]:
leaky_relu = keras.layers.LeakyReLU(alpha=0.2)
layer =keras.layers.Dense(10, activation=leaky_relu,
                          kernel_initializer="he_normal")

SELU activation function, a scaled version of the ELU.

In [3]:
layer =keras.layers.Dense(10, activation="selu",
                         kernel_initializer="lecun_normal")

#### Batch Normalization

Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the vanishing/exploring gradients problems at the beginning of training, it doesn't guarantee that they won't come back during training.

*Batch Normalization* (BN), addresses the vanishing/exploring gradient problems, and more generally the problem that the distribution of each layer';s inputs changes during training, as the parameters of the previous layers change (*Internal Covariate Shift* problem).

In order to zero-center and normalize the inputs, the algorithm needs to evaluating the mean and standard deviation of the inputs over the current mini-batch. 

Four parameters are learned for each batch-normalized layer:
- $\gamma$, scale
- $\beta$, offset
- $\mu$, mean
- $\sigma$, standard deviation

The vanishing gradients problem is strongly reduced. The networks are also much less sensitive to the weight initialization. Be able to use much larger learning rates, significantly speeding up the learning process. BN also acts like a regularizer. 

Batch Normalization does, however, add some complexity to the model. Moreover, there is runtime penalty: the neural network makes slower prediction due to the extra computations required at each layer. So if you need predictions to be lightening-fast, you may want to check how ell plain ELU + He initialization perform before playing with Batch Normalization. 

##### Implementing Batch Normalization with TensorFlow

TensorFlow provides `tf.nn.batch_normalization()` function, but you must compute the mean and standard deviation yourself. Instead, you should use the `tf.layers.batch_normalization()` function.

In [4]:
import tensorflow.keras as keras

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_4 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_5 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_6 (Dense)              (None, 10)                1

In [5]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

The BN algorithm uses *exponential decay* to compute the running averages, which is why it requires the *momentum* parameter: given a new value $v$, the running average $\hat v$ is updated as

$$ \hat v \leftarrow \hat v \times momentum + v \times (1-momentum)$$

#### Gradient Clipping

A popular technique to lessen the exploding gradients problems is simply clip the gradients during backpropagation so they never exceed some threshold (mostly useful for recurrent neural network).

### Reusing Pretrained Layers

*Transfer Learning*: find an existing neural network that accomplishes a similar task, then just reuse its lower layers. 

Benefits:
1. speed up training considerably.
2. require much less training data.

<font color=blue>*NOTE*</font>
>1. If the input size of the new task is different from the one in original task, you need to add a preprocessing step to resize them to the size expected by the original model. 
>2. Generally, transfer learning will only work well if the inputs have similar low-level features.

#### Reusing a TensorFlow Model

#### Reusing Models from Other Frameworks

#### Freezing the Lower Layers

#### Caching the Frozen Layer

#### Tweaking, Dropping, or Replacing the Upper Layers

#### Model Zoos

https://github.com/tensorflow/models

https://github.com/ethereon/caffe-tensorflow

#### Unsupervised Pretraining

If you have plenty of unlabeled training data, train the layers one by one, starting with lowest layer and then going up, using an unsupervised feature detector algorithm such as *Restricted Boltzmann Machines* (RBMs) or autoencoders. Each layer is trained on the output of the previous trained layers (all layers except the one being trained are frozen). Once all layers have been trained this way, you can fine-tune the network using supervised learning (i.e., with backpropagation).

#### Pretraining on an Auxiliary Task

One last option is to train a first neural network on an auxiliary task for which you can easily obtain or
generate labeled training data, then reuse the lower layers of that network for your actual task. The first
neural network's lower layers will learn feature detectors that will likely be reusable by the second
neural network.

*max margin learning*: train a first network to output a score for each training instance, and use a cost
function that ensures that a good instance's score is greater than a bad instance's score by at least some
margin.

### Faster Optimizers

Ways to speed up large deep neural network training:
- Apply a good initialization strategy for the connection weights
- Use a good activation function
- Use Batch Normalization
- Reuse parts of the pretrained network (possibly built on a auxiliary task or using unsupervised learning)
- Faster optimizer than the regular Gradient Descent optimizer (Momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and Adam and Nadam optimization).

#### Momentum Optimization

\begin{equation} \label{eq1}
\begin{split}
& 1. \ \ \mathbf{m} \leftarrow \beta \mathbf{m} - \eta \nabla_{\mathbf{\theta}}J(\mathbf{\theta})  \\
& 2. \ \ \mathbf{\theta} \leftarrow \mathbf{\theta} + \mathbf{m}
\end{split}
\end{equation}

- $\mathbf{m}$ - *momentum vector* 
- $\beta$ - *momentum*. 0 - hight friction; 1 - no friction; typical 0.9

In deep neural networks that don't use Batch Normalization, the upper layers will often end up having inputs with very different scales, so using Momentum optimization helps a lot. It can also help roll past local optima.

In [6]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

#### Nesterov Accelerated Gradient

\begin{equation} \label{eq2}
\begin{split}
& 1. \ \ \mathbf{m} \leftarrow \beta \mathbf{m} - \eta \nabla_{\mathbf{\theta}}J(\mathbf{\theta}+\beta \mathbf{m})  \\
& 2. \ \ \mathbf{\theta} \leftarrow \mathbf{\theta} + \mathbf{m}
\end{split}
\end{equation}

The idea is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum. 

In [7]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

#### AdaGrad

\begin{equation} \label{eq3}
\begin{split}
& 1. \ \ \mathbf{s} \leftarrow \mathbf{s} + \nabla_{\mathbf{\theta}}(\mathbf{\theta}) \otimes \nabla_{\mathbf{\theta}}(\mathbf{\theta}) \\
& 2. \ \ \mathbf{\theta} \leftarrow \mathbf{\theta} - \eta \nabla_{\mathbf{\theta}} \oslash \sqrt{\mathbf{s}+\epsilon}
\end{split}
\end{equation}

The algorithm can correct its direction to point a bit more toward the global optimum by scaling down the gradient vector along the steepest dimensions.

In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate. It helps point the resulting updates more directly toward the global optimum. One additional benefit is that it requires much less tuning of the learning rate hyperparameter $\eta$.

AdaGrad often performs well for simple quadratic problems, but unfortunately it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum. So even though Keras has an `Adagrad` optimizer, you should not use it to train deep neural networks

#### RMSProp

\begin{equation} \label{eq4}
\begin{split}
& 1. \ \ \mathbf{s} \leftarrow \beta \mathbf{s} + (1-\beta) \nabla_{\mathbf{\theta}}(\mathbf{\theta}) \otimes \nabla_{\mathbf{\theta}}(\mathbf{\theta}) \\
& 2. \ \ \mathbf{\theta} \leftarrow \mathbf{\theta} - \eta \nabla_{\mathbf{\theta}} \oslash \sqrt{\mathbf{s}+\epsilon}
\end{split}
\end{equation}

Although AdaGrad slows down a bit too fast and ends up never converging to the global optimum, the RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step.

The decay rate $\beta$ is typically set to 0.9. 

Except on very simple problems, this optimizer almost always performs much better than AdaGrad. In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around.

In [8]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

#### Adam and Nadam Optimization

\begin{equation} \label{eq5}
\begin{split}
& 1. \ \ \mathbf{m} \leftarrow \beta_1 \mathbf{m} - (1-\beta_1) \nabla_{\mathbf{\theta}}J(\mathbf{\theta}+\beta \mathbf{m})  \\
& 2. \ \ \mathbf{s} \leftarrow \beta_2 \mathbf{s} + (1-\beta_2) \nabla_{\mathbf{\theta}}(\mathbf{\theta}) \otimes \nabla_{\mathbf{\theta}}(\mathbf{\theta}) \\
& 3. \ \ \hat{\mathbf{m}} \leftarrow \frac{\mathbf{m}}{1-\beta_1^t} \\
& 4. \ \ \hat{\mathbf{s}} \leftarrow \frac{\mathbf{s}}{1-\beta_2^t} \\
& 5. \ \ \mathbf{\theta} \leftarrow \mathbf{\theta} + \eta \hat{\mathbf{m}} \oslash \sqrt{\hat{\mathbf{s}}+\epsilon}
\end{split}
\end{equation}

*Adam*: adaptive moment estimation.

It combines the ideas of Momentum optimization and RMSProp: just like Momentum optimization it keeps track of
an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared gradients.

Since Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp), it requires less tuning of the learning rate hyperparameter $\eta$. You can often use the default value $\eta = 0.001$, making Adam even easier to use than Gradient Descent.

In [9]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

#### Learning Rate Scheduling

Finding a good learning rate can be tricky. If you set it way too high, training may actually diverge. If you set it too low, training will eventually converge to the optimum, but it will take a very long time. If you set it
slightly too high, it will make progress very quickly at first, but it will end up dancing around the optimum, never really settling down. If you have a limited computing budget, you may have to interrupt training before it has converged properly, yielding a suboptimal solution.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig11-8.png" width=400px alt="fig11-8" style="padding-bottom:1.0em;padding-top:2.0em;"></center>Figure 11-8. Learning curves for various learning rates $\eta$</div>

Learning Schedules:
- Power scheduling
- Exponential scheduling
- Piecewise constant scheduling
- Performance scheduling

### Avoiding Overfitting Through Regularization

Early stopping in Chapter 10 and Batch Normalization.

#### $l_1$ and $l_2$ Regularization

Use $l_1$ and $l_2$ regularization to constrain a neural network's connection weights.

In [10]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01)) # l1_l2

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax",
                     kernel_initializer="glorot_uniform")
])

#### Dropout

*Dropout* is one of the most popular regularization techniques for deep neural networks.

It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability $p$ of being temporarily "dropped out," meaning it will be entirely ignored during this training step, but it may be active during the next step (see Figure 11-9). The hyperparameter $p$ is called the *dropout rate*, and it is typically set to $50%$. After training, neurons don't
get dropped anymore.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig11-9.png" width=400px alt="fig11-8" style="padding-bottom:1.0em;padding-top:2.0em;"></center>Figure 11-9. Dropout regularization</div>

Neurons trained with dropout cannot
co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end you get a more robust network that generalizes better.

Another way to understand the power of dropout is to realize that a unique neural network is generated at each training step. Since each neuron can be either present or absent, there is a total of $2^N$ possible networks (where $N$ is the total number of droppable neurons). This is such a huge number that it is virtually impossible for the same neural network to be sampled twice. Once you have run a 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent since they share many of their weights, but they are nevertheless all different. The resulting neural network can be seen as an averaging **ensemble** of all these smaller neural networks.

There is one small but important technical detail. Suppose $p = 50%$, in which case during testing a neuron will be connected to twice as many input neurons as it was (on average) during training. To compensate for this fact, we need to multiply each neuron's input connection weights by 0.5 after training.

In [11]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. It can also help to increase the dropout rate for large layers, and reduce it for small ones. Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.

Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time and effort.

#### Monte-Carlo (MC) Dropout

In short, MC Dropout is a fantastic technique that boosts dropout models and provides better uncertainty estimates. And of course, since it is just regular dropout during training, it also acts like a regularizer.

#### Max-Norm Regularization

For each neuron, it constrains the weights $w$ of the incoming connections such that $∥ *w* ∥_2 \le r$, where $r$ is the max-norm hyperparameter.

### Summary and Practical Guidelines

Don't forget to standardize the input features! Try to reuse parts of a pretrained neural network if you can find one that solves a similar problem, or use unsupervised pretraining if you have a lot of unlabeled data, or pretraining on an auxiliary task if you have a lot of labeled data for a similar task.

Default DNN configuration:

| Hyperparameter | Default value |
|------|------|
|   Kernel initializer:  | LeCun initializer|
|   Activation function: | SELU |
|   Normalization:       | None (self-normalization) |
|   Regularization:      | Early stopping |
|   Optimizer:           | Nadam |
|   Learning rate schedule: | Performance scheduling |

- If your model self-normalizes:
    - If it overfits the training set, then you should add alpha dropout (and always use early stopping as well). Do not use other regularization methods, or else they would break self-normalization.
- If your model cannot self-normalize (e.g., it is a recurrent net or it contains skip connections):
     - You can try using ELU (or another activation function) instead of SELU, it may perform better. Make sure to change the initialization method accordingly (e.g., He init for ELU or ReLU).
    - If it is a deep network, you should use Batch Normalization after every hidden layer. If it overfits the training set, you can also try using max-norm or l 2 regularization.
- If you need a sparse model, you can use l 1 regularization (and optionally zero out the tiny weights after training). If you need an even sparser model, you can try using FTRL instead of Nadam optimization, along with l 1 regularization. In any case, this will break self-normalization, so you will need to switch to BN if your model is deep.
- If you need a low-latency model (one that performs lightning-fast predictions), you may need to use less layers, avoid Batch Normalization, and possibly replace the SELU activation function with the leaky ReLU. Having a sparse model will also help. You may also want to reduce the float precision from 32-bits to 16-bit (or even 8-bits).
- If you are building a risk-sensitive application, or inference latency is not very important in your application, you can use MC Dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates.