In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow import keras
import tensorflow as tf

# 11. Training Deep Neural Networks
In `Chapter 10` we introduced artificial neural networks and trained our first deep neural networks. But they were shallow nets, with just a few hidden layers. What if you need to tackle a complex problem, such as detecting hundreds of types of objects in high-resolution images? You may need to train a much deeper `DNN`, perhaps with 10 layers or many more, each containing hundreds of neurons, linked by hundreds of thousands of connections. Training a deep `DNN` isn’t a walk in the park. Here are some of the problems you could run into:

+ You may be faced with the tricky vanishing gradients problem or the related exploding gradients problem. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the `DNN` during training. Both of these problems make lower layers very hard to train.

+ You might not have enough training data for such a large network, or it might be too costly to label.

+ Training may be extremely slow.

+ A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they are too noisy.

In this chapter we will go through each of these problems and present techniques to solve them. We will start by exploring the vanishing and exploding gradients problems and some of their most popular solutions. Next, we will look at transfer learning and unsupervised pretraining, which can help you tackle complex tasks even when you have little labeled data. Then we will discuss various optimizers that can speed up training large models tremendously. Finally, we will go through a few popular regularization techniques for large neural networks.

With these tools, you will be able to train very deep nets. Welcome to Deep Learning!

## 11.1 The Vanishing/Exploding Gradients Problems
As we discussed in `Chapter 10`, the backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient along the way. Once the algorithm has computed the gradient of the cost function with regard to each parameter in the network, it uses these gradients to update each parameter with a `Gradient Descent` step.

Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the` Gradient Descent` update leaves the lower layers’ connection weights virtually unchanged, and training never converges to a good solution. We call this the *vanishing gradients problem*. In some cases, the opposite can happen: the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is the *exploding gradients problem*, which surfaces in recurrent neural networks (see `Chapter 15`). More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

This unfortunate behavior was empirically observed long ago, and it was one of the reasons deep neural networks were mostly abandoned in the early 2000s. It wasn’t clear what caused the gradients to be so unstable when training a `DNN`, but some light was shed in a 2010 paper by Xavier Glorot and Yoshua Bengio. The authors found a few suspects, including the combination of the popular logistic sigmoid activation function and the weight initialization technique that was most popular at the time (i.e., a normal distribution with a mean of 0 and a standard deviation of 1). In short, they showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This saturation is actually made worse by the fact that the logistic function has a mean of 0.5, not 0 (the hyperbolic tangent function has a mean of 0 and behaves slightly better than the logistic function in deep networks).

Looking at the logistic activation function (see `Figure 11-1`), you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus, when backpropagation kicks in it has virtually no gradient to propagate back through the network; and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.

<img src="images/11_01.png" style="width:600px;"/>

### 11.1.1 Glorot and He Initialization
In their paper, Glorot and Bengio propose a way to significantly alleviate the unstable gradients problem. They point out that we need the signal to flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don’t want the signal to die out, nor do we want it to explode and saturate. For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction (please check out the paper if you are interested in the mathematical details). It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons (these numbers are called the fan-in and fan-out of the layer), but Glorot and Bengio proposed a good compromise that has proven to work very well in practice: the connection weights of each layer must be initialized randomly as described in `Equation 11-1`, where $\text{fan}_{avg} = \displaystyle\frac{(\text{fan}_{in}+\text{fan}_{out})}{2}$. This initialization strategy is called `Xavier initialization` or `Glorot initialization`, after the paper’s first author.

<img src="images/e_11_01.png" style="width:500px;"/>

If you replace $\text{fan}_{avg}$ with $\text{fan}_{in}$ in `Equation 11-1`, you get an initialization strategy that Yann LeCun proposed in the 1990s. He called it `LeCun initialization`. Genevieve Orr and Klaus-Robert Müller even recommended it in their 1998 book `<<Neural Networks: Tricks of the Trade (Springer)>>`. `LeCun initialization` is equivalent to `Glorot initialization` when $\text{fan}_{in} = \text{fan}_{out}$. It took over a decade for researchers to realize how important this trick is. Using `Glorot initialization` can speed up training considerably, and it is one of the tricks that led to the success of Deep Learning.

Some papers have provided similar strategies for different activation functions. These strategies differ only by the scale of the variance and whether they use $\text{fan}_{avg}$ or $\text{fan}_{in}$, as shown in `Table 11-1` (for the uniform distribution, just compute $r = \sqrt{3\sigma^{2}}$). The initialization strategy for the `ReLU activation` function (and its variants, including the `ELU activation` described shortly) is sometimes called `He initialization`, after the paper’s first author. The `SELU activation` function will be explained later in this chapter. It should be used with `LeCun initialization` (preferably with a normal distribution, as we will see).

<img src="images/t_11_01.png" style="width:500px;"/>

By default, `Keras` uses `Glorot initialization` with a uniform distribution. When creating a layer, you can change this to `He initialization` by setting `kernel_initial izer="he_uniform"` or `kernel_initializer="he_normal"` like this:

```python
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
```

If you want `He initialization` with a uniform distribution but based on $\text{fan}_{avg}$ rather than $\text{fan}_{in}$ , you can use the `VarianceScaling` initializer like this:

```python
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)
```

### 11.1.2 Nonsaturating Activation Functions
One of the insights in the 2010 paper by Glorot and Bengio was that the problems with unstable gradients were in part due to a poor choice of activation function. Until then most people had assumed that if Mother Nature had chosen to use roughly sigmoid activation functions in biological neurons, they must be an excellent choice. But it turns out that other activation functions behave much better in deep neural networks—-in particular, the `ReLU activation` function, mostly because it does not saturate for positive values (and because it is fast to compute).

Unfortunately, the `ReLU activation` function is not perfect. It suffers from a problem known as the `dying ReLUs`: during training, some neurons effectively `die`, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting zeros, and `Gradient Descent` does not affect it anymore because the gradient of the `ReLU` function is zero when its input is negative.

To solve this problem, you may want to use a variant of the `ReLU` function, such as the `leaky ReLU`. This function is defined as $\text{LeakyReLU}_{\alpha}(z) = \max(\alpha z, z)$ (see `Figure 11-2`). The hyperparameter $\alpha$ defines how much the function `leaks`: it is the slope of the function for $z < 0$ and is typically set to 0.01. This small slope ensures that `leaky ReLUs` never die; they can go into a long coma, but they have a chance to eventually wake up. A 2015 paper compared several variants of the `ReLU activation` function, and one of its conclusions was that the leaky variants always outperformed the strict `ReLU activation` function. In fact, setting $\alpha = 0.2$ (a huge leak) seemed to result in better performance than $\alpha = 0.01$ (a small leak). The paper also evaluated the `randomized leaky ReLU` (`RReLU`), where $\alpha$ is picked randomly in a given range during training and is fixed to an average value during testing. `RReLU` also performed fairly well and seemed to act as a regularizer (reducing the risk of overfitting the training set). Finally, the paper evaluated the `parametric leaky ReLU` (`PReLU`), where $\alpha$ is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter). `PReLU` was reported to strongly outperform `ReLU` on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

<img src="images/11_02.png" style="width:600px;"/>

Last but not least, a 2015 paper by Djork-Arné Clevert et al proposed a new activation function called the `exponential linear unit` (`ELU`) that outperformed all the `ReLU` variants in the authors’ experiments: training time was reduced, and the neural network performed better on the test set. `Figure 11-3` graphs the function, and `Equation 11-2` shows its definition.

<img src="images/11_03.png" style="width:600px;"/>

The `ELU activation` function looks a lot like the `ReLU` function, with a few major differences:
+ It takes on negative values when $z < 0$, which allows the unit to have an average output closer to 0 and helps alleviate the vanishing gradients problem. The hyperparameter $\alpha$ defines the value that the `ELU function` approaches when $z$ is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter.

+ It has a nonzero gradient for $z < 0$, which avoids the dead neurons problem.

+ If $\alpha$ is equal to 1 then the function is smooth everywhere, including around $z = 0$, which helps speed up `Gradient Descent` since it does not bounce as much to the left and right of $z = 0$.

The main drawback of the `ELU activation` function is that it is slower to compute than the `ReLU function` and its variants (due to the use of the exponential function). Its faster convergence rate during training compensates for that slow computation, but still, at test time an `ELU` network will be slower than a `ReLU` network.

Then, a 2017 paper by Günter Klambauer et al introduced the `Scaled ELU` (`SELU`) activation function: as its name suggests, it is a scaled variant of the `ELU activation` function. The authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the `SELU activation` function, then the network will self-normalize: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, which solves the vanishing/exploding gradients problem. As a result, the `SELU activation` function often significantly outperforms other activation functions for such neural nets (especially deep ones). There are, however, a few conditions for self-normalization to happen (see the paper for the mathematical justification):
+ The input features must be standardized (mean 0 and standard deviation 1).

+ Every hidden layer’s weights must be initialized with LeCun normal initialization.

In Keras, this means setting kernel_initializer="lecun_normal".
+ The network’s architecture must be sequential. Unfortunately, if you try to use `SELU` in nonsequential architectures, such as recurrent networks (see `Chapter 15`) or networks with skip connections (i.e., connections that skip layers, such as in `Wide & Deep nets`), self-normalization will not be guaranteed, so `SELU` will not necessarily outperform other activation functions.
+ The paper only guarantees self-normalization if all layers are dense, but some researchers have noted that the `SELU activation` function can improve performance in convolutional neural nets as well (see `Chapter 14`).

> **NOTES**
> 
> So, which activation function should you use for the hidden layers of your deep neural networks? Although your mileage will vary, in general `SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic`. If the network’s architecture prevents it from self-normalizing, then `ELU` may perform better than `SELU` (since `SELU` is not smooth at $z = 0$). If you care a lot about runtime latency, then you may prefer `leaky ReLU`. If you don’t want to tweak yet another hyperparameter, you may use the default $\alpha$ values used by `Keras` (e.g., 0.3 for `leaky ReLU`). If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, such as `RReLU` if your network is overfitting or `PReLU` if you have a huge training set. That said, because `ReLU` is the most used activation function (by far), many libraries and hardware accelerators provide `ReLU`-specific optimizations; therefore, if speed is your priority, `ReLU` might still be the best choice.

To use the `leaky ReLU activation` function, create a `LeakyReLU` layer and add it to your model just after the layer you want to apply it to:


```python
model = keras.models.Sequential([ 
    [...]
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2), 
    [...]
])
```

For `PReLU`, replace `LeakyRelu(alpha=0.2)` with `PReLU()`. There is currently no official implementation of `RReLU` in `Keras`, but you can fairly easily implement your own (to learn how to do that, see the exercises at the end of `Chapter 12`).

For `SELU activation`, set `activation="selu"` and `kernel_initializer="lecun_normal"` when creating a layer:

```python
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")
```