# Vaninshing/Exploding Gradients Problems

Gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. This is called the vanishing gradients problem. In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem, which is mostly encountered in recurrent nerual networks.

## Nonsaturating Activation Funcitons

Unfortunately, the ReLU activation function is not perfect. It suffer from a problem known as the dying ReLus: during training, some neurons effectively die, meanint they stop outputtin anything other than 0. In some cases, you may find that half of your network's neurons are dead, especially if you used a large learning rate. During training, if a neuron's weights get updated such that teh weighted sum of the neuron's inputs is negative, it will start outputting 0.

To solve this problem, you may want to use a variant of the ReLU function, such as Leaky ReLU.

**Exponential linear unit(ELU)**

It looks a lot like the ReLU function, with a few major differences:
* First it takes on negative values when z < 0, which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem. The hyperparameter a defines the value that ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter if you want
* Second, it has a nonzero gradient for z < 0, which avoids the dying units issue
* Third, the function is smooth everywhere, including around z = 0, which helps speed up Gredient Descent, siince it does not bounce as much left and right of z = 0

The main drawback of the ELU acitvation function is that is slower to compute than ReLU and its variants(due to the use of the exponential of function), but during trianing this is compensated by the faster convergence rate. Howevert, at test time an ELU network will be slower than a ReLU network.

## Batch Normalization

Although using ELU or any varint of ReLU can significantly reduce the vanishing/exploding gradients problems at the begining of training, it does not guarantee that they won't come back during training.

Batch Normalization to address the vanishing/exploding gradients problems, and more generally the problem that distribution of each layer's inputs changes during training, as the parameters of the previous layers change

The technique consists of adding an operation in the model just before the activation function of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer(one for scaling, the other for shifting). In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer.

At test time, there is no mini-bacth to compute the empirical mean and standard deviation, so instead you simply use the whole training set's mean and standard deviation.

Batch Normalization also acts like a regularizer, reducing the need for other regularization techniques(such as dropout)

Batch Normalization does, however, add some complexity to the model. Moreover, there is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer.

## Gradient Clipping

A popular technique to lessen the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold

# Reusing Pretraining Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle, then just reuse the lower layers of this network: this is called transfer learning. It will not only speed up training considerably, but will also require much less training data.

The more similar the tasks are, the more layers you want to reuse(starting with the lower layer). For very similar tasks, you can try keeping all the hidden layers and just replace teh output layer.

## Reusing models from other frameworks

If the model was trained using another framework, you will need to load the weights manually, then assign them to the appropriate variables.

## Freezing the lower layers

It is generally a good idea to "freeze" their weights when training the new DNN: if the lower-layer weights are fixed, then teh higher-layer weights will be easier to train(because they won't have to learn a moving target).

## Caching the Frozen layers

Since the frozen layer won't change, it is possible to cache the output of the topmost frozen layer for each training instance. Since training goes through the whole dataset many times, this will give you a huge speed boots as you will only need to go through the frozen layers once per training instance(instead of once per epoch) 