## Chapter 11. Training Deep Neural Nets

Difficulties of large deep neural network:
* _Vanishing gradients problem_ (or _exploding gradients problem_) makes lower layers very hard to train.
* Training is extremely slow.
* Risk overfitting.

### Vanishing/Exploding Gradients Problems

_Vanishing gradients_ problem: gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution.

_Exploding gradient_ problem: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. (mostly encountered in recurrent neural networks)

More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

"Understanding the Difficulty of Training Deep Feedforward Neural Networks" by Xavier Glorot and Yoshua Bengio found a few suspects, including the combination of the popular logistic sigmoid activation function and the weight initialization technique that was most popular at the time, namely random initialization using a normal distribution with a mean of 0 and a standard deviation of 1. In short, they showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. 

#### Xavier and He Initialization

For the signal to flow properly, the variance of the outputs of each layer is need to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. The connection weights must be initialized randomly as 

_Xavier (Glorot) initialization (when using the logistic activation function)_

>Normal distribution with mean $0$ and standard deviation $\ \sigma = \sqrt{\frac{2}{n_{in}+n_{out}}}$
<br>
>or a uniform distribution between $-r$ and $+r$, with $\ r=\sqrt{\frac{6}{n_{in}+n_{out}}}$

   where $n_{in}$ and $n_{out}$ are the number of input and output connections for the layer whose weights are being initialized (also called _fan-in_ and _fan-out_).

When $n_{in} \approx n_{out}$,

$$\sigma = \frac{1}{\sqrt{n}} \ \ \ or \ \ \ r = \frac{\sqrt{3}}{\sqrt{n}}$$

The initialization strategy for the ReLU activation function (and its variants, including the ELU activation) is sometimes called _He initialization_.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/tab11-1.png" width=400px alt="tab11-1" style="padding-bottom:1.0em;padding-top:2.0em;"></center>Table 11-1. Initialization parameters for each type of activation function</div>

Use He initialization

In [2]:
import tensorflow as tf

tf.reset_default_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                          kernel_initializer=he_init, name="hidden1")

<font color=blue>_NOTE_</font>
>He initialization considers only the fan-in, not the average between fan-in and fan-out like in Xavier initialization. This is also the default for the `variance_scaling_initializer()` function, but you can change this by setting the argument `mode="FAN_AVG"`.

#### Nonsaturating Activation Functions

ReLU activation function is much better than sigmoid activation function, because it does not saturate for positive values (and also because it is quite fast to compute).

Problem *dying ReLUs*: during training, some neurons effectively die, meaning they stop outputting anything other than $0$. During training, if a neuron's weights get updated such that the weighted sum of the neuron's input is negative, it will start outputting $0$.

To solve this problem, use *leaky ReLU*

$$ LeakyReLU_\alpha(z)=max(\alpha z, z) $$

where hyperparameter $\alpha$ is typically set to $0.01$. 
- Huge leak ($\alpha = 0.2$) seemed to result inm better performance than small leak.
- *Randomized leak ReLU* (RReLU), where $\alpha$ is picked randomly in a given range during training, and it fixed to an average value during testing. It also performed well and seemed to act as a regularizer (reducing the risk of overfitting the training set). 
- *Parametric leak ReLU)* (PReLU), where $\alpha$ is authorized to be learned during training (instead of a hyperparameter). Strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig11-2.png" width=400px alt="fig11-2" style="padding-bottom:1.0em;padding-top:2.0em;"></center>Figure 11-2. *Leaky ReLU*</div>

*Exponential linear unit* (ELU). 

$$ELU_\alpha(z) = 
\left\{\begin{matrix}
\alpha(exp(z)-1) \ \ if \ z < 0
\\ 
z \qquad \qquad \ \ \ \ if \ z \ge 0
\end{matrix}\right.$$

Reduced training time and better performed on test set.

Major differences with ReLU:
- Negative value when $z<0$, which allows the unit to have an average output closer to $0$. This helps alleviate the vanishing gradients problem. Usually, $\alpha=1$
- Nonzero gradient for $z<0$, which avoids the dying units issue.
- The function is smooth everywhere, including around $z=0$, which helps speed up Gradient Descent, since it does not bounce as much left and right of $z=0$.

Main drawback: slow to compute than ReLU and its variants. During training this is compensated by the faster convergence rate. However, at test time an ELU network is slower than a ReLU network.

<font color=blue>*TIP*</font>
>ELU > leaky ReLU > ReLU > tanh > logistic. Use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a hu8ge training set.

To use the keaky ReLu activation function, you must create a `LeakyReLU` instance like this:

In [None]:
leaky_relu = keras.layers.LeakyReLU(alpha=0.2)
layer =keras.layers.Dense(10, activation=leaky_relu,
                          kernel_initializer="he_normal")

SELU activation function, a scaled version of the ELU.

In [None]:
layer =keras.layers.Dense(10, activation="selu",
                         kernel_initinalizer+"lecun_normal")

#### Batch Normalization

Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the vanishing/exploring gradients problems at the beginning of training, it doesn't guarantee that they won't come back during training.

*Batch Normalization* (BN), addresses the vanishing/exploring gradient problems, and more generally the problem that the distribution of each layer';s inputs changes during training, as the parameters of the previous layers change (*Internal Covariate Shift* problem).

In order to zero-center and normalize the inputs, the algorithm needs to evaluating the mean and standard deviation of the inputs over the current mini-batch. 

Four parameters are learned for each batch-normalized layer:
- $\gamma$, scale
- $\beta$, offset
- $\mu$, mean
- $\sigma$, standard deviation

The vanishing gradients problem is strongly reduced. The networks are also much less sensitive to the weight initialization. Be able to use much larger learning rates, significantly speeding up the learning process. BN also acts like a regularizer. 

Batch Normalization does, however, add some complexity to the model. Moreover, there is runtime penalty: the neural network makes slower prediction due to the extra computations required at each layer. So if you need predictions to be lightening-fast, you may want to check how ell plain ELU + He initialization perform before playing with Batch Normalization. 

##### Implementing Batch Normalization with TensorFlow

TensorFlow provides `tf.nn.batch_normalization()` function, but you must compute the mean and standard deviation yourself. Instead, you should use the `tf.layers.batch_normalization()` function.

In [8]:
import tensorflow.keras as keras

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_5 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_10 (Batc (None, 784)               3136      
_________________________________________________________________
dense_8 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_11 (Batc (None, 300)               1200      
_________________________________________________________________
dense_9 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_12 (Batc (None, 100)               400       
_________________________________________________________________
dense_10 (Dense)             (None, 10)               

[('batch_normalization_10/gamma:0', True),
 ('batch_normalization_10/beta:0', True),
 ('batch_normalization_10/moving_mean:0', False),
 ('batch_normalization_10/moving_variance:0', False)]

In [9]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization_10/gamma:0', True),
 ('batch_normalization_10/beta:0', True),
 ('batch_normalization_10/moving_mean:0', False),
 ('batch_normalization_10/moving_variance:0', False)]

The BN algorithm uses *exponential decay* to compute the running averages, which is why it requires the *momentum* parameter: given a new value $v$, the running average $\hat v$ is updated as

$$ \hat v \leftarrow \hat v \times momentum + v \times (1-momentum)$$

#### Gradient Clipping

A popular technique to lessen the exploding gradients problems is simply clip the gradients during backpropagation so they never exceed some threshold (mostly useful for recurrent neural network).

### Reusing Pretrained Layers

*Transfer Learning*: find an existing neural network that accomplishes a similar task, then just reuse its lower layers. 

Benifits:
1. speed up training considerably.
2. require much less training data.

<font color=blue>*NOTE*</font>
1. If the input size of the new task is different from the one in original task, you need to add a preprocessing step to resize them to the size expected by the original model. 
2. Generally, transfer learning will only work well if the inputs have similar low-level features.

#### Reusing a TensorFlow Model

#### Reusing Models from Other Frameworks

#### Freezing the Lower Layers

#### Caching the Frozen Layer

#### Tweaking, Dropping, or Replacing the Upper Layers

#### Model Zoos

https://github.com/tensorflow/models

https://github.com/ethereon/caffe-tensorflow

#### Unsupervised Pretraining

If you have plenty of unlabeled training data, train the layers one by one, starting with lowest layer and then going up, using an unsupervised feature detector algorithm such as *Restricted Boltzmann Machines* (RBMs) or autoencoders. Each layer is trained on the output of the previous trained layers (all layers except the one being trained are frozen). Once all layers have been trained this way, you can fine-tune the network using supervised learning (i.e., with backpropagation).

#### Pretraining on an Auxiliary Task

One last option is to train a first neural network on an auxiliary task for which you can easily obtain or
generate labeled training data, then reuse the lower layers of that network for your actual task. The first
neural network's lower layers will learn feature detectors that will likely be reusable by the second
neural network.

*max margin learning*: train a first network to output a score for each training instance, and use a cost
function that ensures that a good instance's score is greater than a bad instance's score by at least some
margin.

### Faster Optimizers

Ways to speed up large deep neural network training:
- Apply a good initialization strategy for the connection weights
- Use a good activation function
- Use Batch Normalization
- Reuse parts of the pretrained network (possibly built on a auxiliary task or using unsupervised learning)
- Faster optimizer than the regular Gradient Descent optimizer (Momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and Adam and Nadam optimization).

#### Momentum Optimization

