# Chapter 11: Training Deep Neural Networks
We looked at shallow and simple, deep neural networks. What if we want to model a complex problem like detecting hundreds of objects in high-res images? We will need much deeper neural networks.

Some problems:
1. Vanishing and exploding gradient problems that makes lower levels hard to train
2. not enough training data or too costly to label
 - solution: transfer learning and unsupervised pretraining
3. training extremely slow
 - solution: different optimizers
4. risk of severly overfitting the training set because of millions of parameters
 - solution: regularization techniques
 
This is deep learning!

## Vanishing and Exploding Gradient Problems
Recall, the backpropogation algorithm goes from the output layer to the input layer, propogating the error gradient along the way. Once the algorithm has computed the gradient of the cost function with respect to each parameter (connection) in the network, it uses these gradients to update each parameter with a gradient descent step. 

But the gradients often get smaller and smaller as the algorithm progresses to the lower layers (near the input); therefore the lower layer connection weights are virtually unchanged and the problem never converges to a good solution; called the **vanishing gradient problem**.

**Exploding Gradient** is the opposite problem, where the gradient diverges because the connection weights get extremely large. This mostly happens with RNNs.

More generally, neural networks may suffer from unstable gradients where the layers learn at much different speeds.

Glorot and Bengio write in a 2010 paper that a culprit to gradient problems is the sigmoid logistic function and random initialization combination: that as the neural network "feeds forward" the variances increases, then becomes saturated. The saturation is that the derivative is too close to 0 or 1 and the gradient either vanishes or diverges.

### Glorot and He Initialization (Connection weight init)
Glorot and Bengio propose a way to alleviate the vanishing/exploding gradient problems: the signal must flow "properly in both directions" (feeding forward and backpropogating). We need:

1. The variance of the outputs of each layer equal to the variance of the inputs.
2. Gradients must have equal variance before and after flowing through layer when backpropogating.

It is not possible to guarantee the above unless there are equal number of inputs and neurons, but Glorot and Bengio propose that connection weights be initialized randomly with the following strategies:

**Glorot Initialization:**

*Normal distribution with mean 0 and variance* $\sigma^2 = \frac{1}{fan_{avg}}$

Or a *uniform distribution between [-r,+r] with $r = \sqrt{\frac{3}{fan_{avg}}}$

where $fan_{avg} = (fan_{in} + fan_{out}) / 2$

and $fan_{in}$ is **the number of inputs to a layer**, and $fan_{out}$ is **the number of neurons (outputs) in a layer**.

Glorot Initialization speeds up training and was crucial in the current success of Neural Networks.

Note: Substituting $fan_{avg}$ for $fan_{in}$, we have LeCun Initialization which is the same thing as Glorot initialization when $fan_{in}$ == $fan_{out}$

Some papers provide similar **connection-weight initialization strategies for different activation functions**. The strategies really only differ by the variance's scale and whether it uses $fan_{avg}$ or $fan_{in}$.

Table 11-1. Initialization parameters for each type of activation function

|Initialization| Activation Functions| $\sigma^2$ (Normal)|
|--------------|---------------------|--------------------|
| Glorot| None, Tanh, Logistic, Softmax| $\frac{1}{fan_{avg}}$|
|He| ReLU and variants (and ELU)| $\frac{2}{fan_{in}}$|
|LeCun| SELU| $\frac{1}{fan_{in}}$|