# The Vanishing/Exploding Gradients Problems

## Definitions:
- vanishing gradients: gradients get smaller and smaller as the algorithm progresses down to the lower layers (closer to input layer); as a result, the gradient descent update leaves the lower layers' connection weights virtually unchanged, and training never converges to a good solution
- exploding gradients: (generally surfaces in reurrent neural networks). It is when the gradients can grow bigger an bigger until layers get insanely large weight updates and the algorithm diverges


## Preface:
- The backpropagation algorithm works by going form the output layer to the input layer (foward feed then backward feed)
- After computing the gradient of the cost function with regard to each parameter in the network, the agorithm
- DNNs suffer from unstable gradients; different layres may learn at idely different speeds.
- Vanishing/exploding gradients were the reasons why DNNs were mostly abandoned in the early 2000s


## Problem: 
- Caused by a combination of logistic sigmoid activation function and the weight initialization technique that was most popular at the time (a gaussian distirbution with a mean of 0 and a std. dev of 1)
    - this activation function and initialization scheme caused the variance of the outputs of each layer to be must greater than the variance of its inputs
    - Going forward int he network, the variance (how far each number in the set is from the mean) keeps increasing after each laye runtil the activation function saturates at the top layers
    - the logistic function has a mean of 0.5, not 0, which makes it perform slightly worse than hyperbolic tangent (tanh) activation function
- In the logistic activation function when the inputs are large (negative or positive) the value is extremely close to 0 or 1 and the derivatives are extremely close to 0.
    - With such a large variance at the top layers (the ranges are far away from the mean of 0.5), there would basically be no gradient to propagate back through the network's lowest layer.
    - The small gradient that exists will keep getting diluted as backpropagation progresses down through the top layers, so nothing gets left for the lower layers (**thus vanishing**)
    
    
## Solution:
- Glorot and Bengio (the ones who discovered the cause of the problem) propose a way to alleviate the problem so that the signal flows properly in the forward direction when making prediction and properly backwards when backpropagating gradients
    - Don't let the signal die out, nor make it explode or saturate
- they propose that: 
    1.  the variance of the inputs and the outputs of each layer should be the same
    2. gardients have equal variance before and after flowing through a layer in the reverse direction
- It is **not actually possible** to do both unless # of inputs == # of neurons (*fan-in* == *fan-out*)
- Compromise that works:
    - **connection weights initialization: random where fav_avg=(fan_in+fan_out)/2**
        - called the *Xavier initialization or Glorot initlization
            - normal distribution with mean - and varianc sigma^2=1/fan_avg
            - or uniform distribution between -r and +r, with r=sqrt(3/fan_avg)
- different activation function have different initlizations and variance requirements (activation functions, initlizations, normal distrib variance):
    - No activation, tanh, logistic (sigmoid), softmax -> Glorot init -> 1/fan_avg variance
    - ReLU + variants -> He init -> 2/fan_in
    - SELU -> LeCun init -> 1/fan_in
- "*He initialization*" - the initialization strategy for the ReLU activation function and its variants
    - the weights are initialized keeping in mind the size of the previous layer which helps in attaining a global minimum of the cost function faster and more efficiently
    - The weights are still random but differ in range depending on the size of the previous layer of neurons
    - This provides a controlled initialization hence the faster and more efficient gradient descent

#### Keras default: Gloot initialization + uniform distribution


#### Changing initializations of a layer:

In [None]:
from tensorflow.keras.layers import Dense

Dense(10, activation="reslu", knernel_initializer="he_normal")

# Nonsaturating Activation Functions
- Unstable gradients were caused partly by poor choise of activation function was discovered by Glorot adn Bengio
    - ReLU activation function work well in DNN because it does not saturate for positive values

### Problem with ReLU:
- **dying ReLUs:**
    - neurons lose the ability to output anything other than 0 during training
        - can be caused by a large learning rate
    - It dies when its weights are tweaked so that its weighted sum (input W*X) is negative for all instances in the training set
        - *Does the bias make it 0?*
        - Gradient descent won't afect it anymore because its gradient is zero when input is negative
        
### Solution for ReLU:
- **Leaky ReLU**
    - LeakyReLU_alpha(z)=max(alpha*z, z); *"_alpha is a subscript"*
    - hyperparameter **alpha** defines how much the function "leaks"
        - It is the slope of the function when z < 0
        - typically set to 0.01
        - Small slope ensures that leaky ReLUs **never die**, but they can go into a long coma with a chance to wakeup
- Leaky variants of ReLU always outperformed strict ReLU activation function
    - randomized leaky ReLU (RReLU)
        - alpha is picked randomly in a given range during training and is fixed to an average value during testing
    - parametric leaky ReLU (PReLU)
        - alpha is allowed to be optimized during training so it can be modified by backpropagation
- **Exponential Linear Unit (ELU)**
    - outperformed all ReLU variants in the author's experiements
        - reduced training time
        NN perfomed better on the test set
    - **ELU_alpha(z) = alpha(exp(z)-1) if z < 0 else z**
    - **ELU vs ReLU:**
        - takes on negative values when z < 0 which allows to mean output to be closer to 0 to help alleviate the vanishing gradients problem
            - alpha (usually set to 1) defines the values that ELU approaches when z is a large negative number
        - Nonzero gradient for z < 0 which avoids dead neurons
        - if alpha = 1 (large negatives converge to -1) then the function is smooth everywhere including z=0 (there is an inflection in strict ReLU), which speeds up Gradient Descent sine it does not bounce much to the left and right (no abrupt change in slope?) of z=0
    - **Drawbacks:**
        - Slower to compute than ReLU and its variants b/c of exponential computations w/ the exponential function
            - Faster converge rate *during training* compensates for it though
        - Slower test time than ReLU
- **Scaled ELU (SELU)**
    - scaled variant of the ELU activation func.
    - If a NN is built exclusively on a stack of dense (all neurons in prev layer connects to all neurons in current layer) layers, and if all hidden layers use the SELU activation function, the the network will **self-normalize**
        - self-normalize: the output of each layer will tend to preserve a mean of 0 and an standard dev. of 1 during training, which solves the vanishing/exploding gradients problem
    - Will often significantly outperform other ativation functions for NNs (especially DNNs)
    - **Conditions for use:**
        - Input features must be standardized (mean of 0 and standard dev of 1)
        - Every hidden layer must be intialized with the LeCun normal initialization
        - The NN must have sequential architecture, so it excludes RNNs and networks with skip connections (e.g. Wide & Deep nets). Otherwise, SELU will not self-normalize
        - SELU can help CNNs as well.
        

### So which activation function to choose?
- **SELU > ELU > leak ReLU (and variants) > ReLU > tanh > logistic**