In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras

# The Vanishing/Exploding Gradients Problem

When flowing backwards through neural network, sometimes (when applying chain rule) gradients

* Grow larger and larger
* Grow smaller and smaller

This means that different layers can possibly learn at very different learning rates. This is due to

* Weight initialization technique
* Bad choice of activation function (sigmoid, for example)




## Weight Initialization Technique

1. Need variance of outputs of each layer to be equal to variance of inputs. 
2. Need gradients to have equal variance before and after flowing through a layer.

There are two ways to achieve this in practice. Let $f_{in}$ by number of input neurons and $f_{out}$ be number of output neurons. Then $f_{avg}=\frac{1}{2}(f_{in}+f_{out})$. 

1. Weights chosen by normal ditribution with $\mu=0$ and $\sigma^2 = 1/f_{avg}$
2. Uniform between -$r$ and $r$ with $r=\sqrt{3/f_{avg}}$

By default Keras uses a uniform distribution. When creating a layer can change this by using 

In [2]:
keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal')

<tensorflow.python.keras.layers.core.Dense at 0x2057ff7ef28>

See table on 334 and code on 335 for more initialization possibilities.

## Activation Functions

Generally ReLU is not the best, since below $x=0$ its slope is $0$ and thus neurons can "die" (aka they don't get tweaked anymore during gradient descent) if they enter this region. The best activation functions are **ELU** and **SELU**. The **ELU** replaces the $0$ part in the negative $x$ axis of the ReLu with a decaying exponential. **SELU** is a scaled variant of **ELU**. **SELU** requires

1. Input features must by standardized (mean 0 std 1)
2. Hidden layers weights must be initialized with LeCun normal initialization. This means that $\sigma^2=1/f_{in}$ instead of $1/f_{avg}$.
3. Network must be sequential (no fancy stuff with splitting up training set and having some layers skip ahead etc..)

For implementing these activation functions see 337-338.

## Batch Normalization

Batch normalization is a techinque that zero-centers and normalizes each input (to a neuron) before the activation function. This helps ensure that vanishing/exploding gradient problem doesn't come back during late times in training.

See 338-342 for more mathematical details, but know that it essentially "acts" as a standard scaler between layers. 

The reason it is called "batch" normalization is because it normalizes entries using

$$\hat{x} = (x-\mu_B)/\sigma_B $$

where $\mu_B$ and $\sigma_B$ (vectors) are computed using only a batch of the data (using full dataset is not practically for stocahstic gradient descent). It then weights them into the neuron using

$$ \text{Input} = \gamma \otimes \hat{x} + \beta $$

and thus $\gamma$ and $\beta$ are like the effective neuron weights.



**Important for Coding**: Batch normalization can be accomplished as follows.

In [3]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

One parameter to tweak is **momentum** which affects how means or standard deviations are updated from batch to batch. Let $v$ represent $\mu_B$ or $\sigma_B$ (moving avg)

$$\hat{v} \leftarrow \hat{v} \times \text{momentum} + v \times (1-\text{momentum}) $$

If momentum is 1 (standard) then moving average is just what the current batch, but if not then it retains information from the previous batch (like smoothing function in time series analysis).

Note that after training (when evaluating on test set) the batch layers use $\mu$ and $\sigma$ (true values).