# Chapter 11: Training Deep Neural Networks
We looked at shallow and simple, deep neural networks. What if we want to model a complex problem like detecting hundreds of objects in high-res images? We will need much deeper neural networks.

Some problems:
1. Vanishing and exploding gradient problems that makes lower levels hard to train
2. not enough training data or too costly to label
 - solution: transfer learning and unsupervised pretraining
3. training extremely slow
 - solution: different optimizers
4. risk of severly overfitting the training set because of millions of parameters
 - solution: regularization techniques
 
This is deep learning!

## Vanishing and Exploding Gradient Problems
Recall, the backpropogation algorithm goes from the output layer to the input layer, propogating the error gradient along the way. Once the algorithm has computed the gradient of the cost function with respect to each parameter (connection) in the network, it uses these gradients to update each parameter with a gradient descent step. 

But the gradients often get smaller and smaller as the algorithm progresses to the lower layers (near the input); therefore the lower layer connection weights are virtually unchanged and the problem never converges to a good solution; called the **vanishing gradient problem**.

**Exploding Gradient** is the opposite problem, where the gradient diverges because the connection weights get extremely large. This mostly happens with RNNs.

More generally, neural networks may suffer from unstable gradients where the layers learn at much different speeds.

Glorot and Bengio write in a 2010 paper that a culprit to gradient problems is the sigmoid logistic function and random initialization combination: that as the neural network "feeds forward" the variances increases, then becomes saturated. The saturation is that the derivative is too close to 0 or 1 and the gradient either vanishes or diverges.

### Glorot and He Initialization (Connection weight initialization)
Glorot and Bengio propose a way to alleviate the vanishing/exploding gradient problems: the signal must flow "properly in both directions" (feeding forward and backpropogating). We need:

1. The variance of the outputs of each layer equal to the variance of the inputs.
2. Gradients must have equal variance before and after flowing through layer when backpropogating.

It is not possible to guarantee the above unless there are equal number of inputs and neurons, but Glorot and Bengio propose that *connection weights be initialized randomly with the following strategies*:

**Glorot Initialization:**

*Normal distribution with mean 0 and variance* $\sigma^2 = \frac{1}{fan_{avg}}$

Or a *uniform distribution between [-r,+r] with $r = \sqrt{\frac{3}{fan_{avg}}}$

where $fan_{avg} = (fan_{in} + fan_{out}) / 2$

and $fan_{in}$ is **the number of inputs to a layer**, and $fan_{out}$ is **the number of neurons (outputs) in a layer**.

Glorot Initialization speeds up training and was crucial in the current success of Neural Networks.

Note: Substituting $fan_{avg}$ for $fan_{in}$, we have LeCun Initialization which is the same thing as Glorot initialization when $fan_{in}$ == $fan_{out}$

Some papers provide similar **connection-weight initialization strategies for different activation functions**. The strategies really only differ by the variance's scale and whether it uses $fan_{avg}$ or $fan_{in}$.

Table 11-1. Initialization parameters for each type of activation function

|Initialization| Activation Functions| $\sigma^2$ (Normal)|
|--------------|---------------------|--------------------|
| Glorot| None, Tanh, Logistic, Softmax| $\frac{1}{fan_{avg}}$|
|He| ReLU and variants (and ELU)| $\frac{2}{fan_{in}}$|
|LeCun| SELU| $\frac{1}{fan_{in}}$|

Keras uses Glorot uniform initialization, by default. You can change it to He initialization with ` kernel_initializer="he_uniform"` or  `kernel_initializer="he_normal"` as such:

In [None]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

We can also change whether we use $fan_{avg}$ in the tabled equations above as such:

Note: the `scale = 2` is the numerator in the He initialization equation (see table 11-1)

In [None]:
he_avg_init = keras.initializers.VarianceScaling(scale=2., 
                                                 mode='fan_avg',
                                                 distribution='uniform')

keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

## Nonsaturating Activation Functions
Although the sigmoid activation is roughly the concept used in biological neurons, in practice other activation functions like ReLU behave better because it does not saturate at positive values and it computes quickly.

The ReLU function is not perfect, though, as it suffers from the *dying ReLUs* problem where neurons "die" and training and only output 0 (recall that ReLU outputs 0 for negative input). In some cases, more than half of a network's neurons will die. 

The **solution** to "dying ReLUs" is the **leaky ReLU**: the slope for negative inputs assures the Leaky ReLU never dies. **A 2015 paper concluded that *Leaky ReLU variants always outperform the strict ReLU activation***.

We can also set how much the Leaky ReLU "leaks" with the $\alpha$ hyperparameter. $\alpha = .2$ would be a huge leak, and $\alpha = .02$ would be a small leak.

Two other Leaky ReLU variants:
- Randomized Leaky ReLU (RReLU)
 - $\alpha$ hyperparameter is picked randomly during training and the average $\alpha$ (for the epochs?) is used for testing
 - It performs well and seems to reduce the risk of overfitting
- Parametric Leaky ReLU (PReLU)
 - $\alpha$ is a parameter that is learned
 - Strongly outperforms others variants on huge datasets (of images) but overfits small ones.
 
In 2015 a new activation was proposed: the **Exponential Linear Unit** (ELU). It outperformed all variants of ReLU. The drawback is that it's slower to compute, but it has a faster convergence rate.

A 2017 paper showed that a Neural Network with a Sequential, Dense stack with all hidden layers using the SELU activation (Scaled ELU) self-normalizes preserves N(0,1) and alleviates the vanishing/exploding gradient problems.

*Conditions for self-normalization*:
 - Inputs must be standardized (distributed N(0,1))
 - every hidden layer's weights must be initialized with LeCun Normal; `kernel_initializer="lecun_normal"`
 - Architecture must be Sequential. (you can still try using SELU, but self-normalization is not guaranteed)
 - The paper only guarantees self-normalization with all layers being Dense, but it actually works well with CNNs, too.
 

**For hidden layers, in general, the activation functions preference** is:

SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic

- If your network cannot self-normalize then ELU > SELU
- If you need fast runtime then Leaky ReLU > SELU and ELU
- If you have spare time then RReLU and PReLU are worth checking with k-fold CV

To use Leaky Relu:

In [None]:
leaky_relu = keras.layers.LeakyReLU(alpha=0.2)
layer = keras.layers.Dense(10, activation=leaky_relu,
                           kernel_initializer="he_normal")

# For PReLU, substitute LeakyReLU with PReLU()

In [None]:
# For SELU activation
layer = keras.layers.Dense(10, activation="selu",
                           kernel_initializer="lecun_normal")

## Batch Normalization