# Optimization difficulties: Vanishing and exploding gradients

In this session we explore the problem of vanishing or exploding gradients that frequently lead to optimization problems, especially in particularly deep (many layers) architectures.

What is vanishing gradients? If you remember from session 02, backpropagation consists of a series of matrix multiplications of a layer's weights with the gradients of the loss function with respect to its output. This means that if the weights have a small magnitude, the gradients will be downscaled. If this happens sequentially, after a couple of layers, the gradients will be too small to lead to effective learning in earlier layers. This is especially important for recurrent neural networks as here the effective depth is not just the number of layers but the number of layers times the number of time-steps!

On the contrary, if the weights are initialized with a variance that's too high, it can happen that gradients explode since backpropagating through each layer scales the gradients up, leading to unstable learning and possibly even numeric overflow.

Ideally learning dynamics are somewhere between instable and too stable learning, sometimes referred to as the "edge of chaos" and we want to initialize weights accordingly. 

You may also remember from session 02 that we also multiply gradients by the derivative of the activation function. Now if we use sigmoid or tanh activation functions (saturating non-linearities in general), the gradients w.r.t. inputs when the output is towards saturation are diminishingly small. This can help with exploding gradient issues but it does not help with vanishing gradients. Partially linear activation functions such as the rectified linear unit activation function (ReLU) are better in this regard. ReLUs either multipy the gradient by 0 (if the input is negative) or by 1 (if the input is above 1).

ML researchers have come up with a plethora of tools to avoid vanishing gradients or circumvent them to some degree. We will look at two options to deal with vanishing gradients:

- smart scaling of the variance
- partially linear activation functions 
- using skip connections in deep architectures, allowing for gradients to flow more directly from the output layer to the earlier layers
- batch, layer, instance and group normalization, rescaling and off-setting the input in different ways such that gradients can flow better

# Initialization

There are different schemes for initialization that are commonly used. The most often used initialization is **He-Uniform initialization** (we used this in the example of session 02), where the weights are sampled from a uniform distribution with min=-limit and max=limit, where limit = sqrt(6 / n_inputs).

In [15]:
import tensorflow as tf
tensor = tf.random.uniform((32,10))
layer = tf.keras.layers.Dense(32,kernel_initializer=tf.keras.initializers.HeUniform())
layer(tensor)
print("variance of weights", layer.trainable_variables[0].numpy().var())

tensor = tf.random.uniform((32,100))
layer = tf.keras.layers.Dense(128,kernel_initializer=tf.keras.initializers.HeUniform())
layer(tensor)
print("variance of weights", layer.trainable_variables[0].numpy().var())

variance of weights 0.19253153
variance of weights 0.019861732


**Glorot Uniform initilization** (this is the default for TensorFlow) is the same as He-Uniform but with limit = sqrt(6 / (n_inputs + n_outputs))

In [16]:
tensor = tf.random.uniform((32,10))
layer = tf.keras.layers.Dense(32,kernel_initializer=tf.keras.initializers.GlorotUniform())
layer(tensor)
print("variance of weights", layer.trainable_variables[0].numpy().var())

tensor = tf.random.uniform((32,100))
layer = tf.keras.layers.Dense(128,kernel_initializer=tf.keras.initializers.GlorotUniform())
layer(tensor)
print("variance of weights", layer.trainable_variables[0].numpy().var())

variance of weights 0.04799015
variance of weights 0.008802493


# Residual and skip connections

In [192]:
# no skip or residual connections

layer1 = tf.keras.layers.Dense(32, activation="relu")
layer2 = tf.keras.layers.Dense(32, activation="relu")
layer3 = tf.keras.layers.Dense(32, activation="relu")
layer4 = tf.keras.layers.Dense(32, activation="relu")

with tf.GradientTape() as tape:
    tape.watch(inputs)
    layer1_out = layer1(inputs)
    
    layer2_out = layer2(layer1_out)
    
    layer3_out = layer3(layer2_out)
    
    layer4_out = layer4(layer3_out)
    mean = tf.reduce_mean(layer4_out)
tf.print("gradient norm for the input", tf.norm(tape.gradient(mean, inputs),ord="euclidean"))

gradient norm for the input 0.0202007275


In [193]:
inputs = tf.random.uniform((32,128))

layer1 = tf.keras.layers.Dense(32, activation="relu")
layer2 = tf.keras.layers.Dense(32, activation="relu")
layer3 = tf.keras.layers.Dense(32, activation="relu")
layer4 = tf.keras.layers.Dense(32, activation="relu")

with tf.GradientTape() as tape:
    tape.watch(inputs)
    layer1_out = layer1(inputs)
    layer2_out = layer2(layer1_out)
    layer3_out = layer3(layer2_out+layer1_out) # residual connection
    layer4_out = layer4(layer3_out+layer2_out) # residual connection
    mean = tf.reduce_mean(layer4_out)
tf.print("gradient norm for the input", tf.norm(tape.gradient(mean, inputs),ord="euclidean"))

gradient norm for the input 0.0324114449


In [194]:
inputs = tf.random.uniform((32,128))

layer1 = tf.keras.layers.Dense(32, activation="relu")
layer2 = tf.keras.layers.Dense(32, activation="relu")
layer3 = tf.keras.layers.Dense(32, activation="relu")
layer4 = tf.keras.layers.Dense(32, activation="relu")

with tf.GradientTape() as tape:
    tape.watch(inputs)
    layer1_out = layer1(inputs)
    
    layer2_out = layer2(tf.concat([layer1_out, 
                                   inputs],axis=-1)) # skip connection
    
    layer3_out = layer3(tf.concat([layer2_out, 
                                   layer1_out, 
                                   inputs],axis=-1)) # skip connection
    
    layer4_out = layer4(tf.concat([layer3_out, 
                                   layer2_out, 
                                   layer1_out],axis=-1)) # skip connection
    mean = tf.reduce_mean(layer4_out)
tf.print("gradient norm for the input", tf.norm(tape.gradient(mean, inputs),ord="euclidean"))

gradient norm for the input 0.0306733567


# Normalization

**Batch normalization** uses statistics over the batch dimension to rescale and shift the inputs. It typically has different training and testing behavior which is why you need to set the training argument (like with dropout).

1. Calculation of Mean and Variance:

$$\mu = \frac{1}{m} \sum_{i=1}^{m} x_i $$

$$\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu)^2$$

2. Normalization:

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

3. Scaling and Shifting:

$$y_i = \gamma \hat{x}_i + \beta$$

where:
- $(X)$ is the mini-batch of activations in a layer,
- $(m)$ is the number of examples in the mini-batch,
- $(x_i)$ represents the (i)-th element in the mini-batch,
- ($\mu$) is the mean of the mini-batch,
- ($\sigma^2$) is the variance of the mini-batch,
- ($\epsilon$) is a small value for numerical stability,
- ($\hat{x}_i$) is the normalized value of ($x_i$),
- ($\gamma$) and ($\beta$) are learnable parameters for scaling and shifting the normalized values,
- ($y_i$) is the final output after normalization, scaling, and shifting.

**Layer Normalization** does not compute any statistics over the batch dimension and thus also works with smaller batch sizes. However layer normalization does have more parameters than batch normalization because it maintains separate learnable affine parameters $\gamma$ and $\beta$ for the different input values.

1. Calculation of Mean and Variance:

$$\mu = \frac{1}{H} \sum_{i=1}^{H} x_i$$

$$\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2$$

where:
- $(H)$ is the number of units/neurons in the layer (for the computation of the mean and variance, the normalization is done over the same layer's units).

2. Normalization:
$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

3. Scaling and Shifting:
$$y_i = \gamma_i \hat{x}_i + \beta_i$$

where:
$(x_i)$ represents the $(i)$-th element in the layer's units,
- $(\mu)$ is the mean of the layer's units,
- $(\sigma^2)$ is the variance of the layer's units,
- $(\epsilon)$ is a small value for numerical stability,
- $(\hat{x}_i)$ is the normalized value of $(x_i)$,
- $(\gamma)$ and $(\beta)$ are learnable parameters for scaling and shifting the normalized values,
- $(y_i)$ is the final output after normalization, scaling, and shifting.

The key difference between Batch Normalization and Layer Normalization is that Batch Normalization normalizes the input using the statistics from a mini-batch of data, whereas Layer Normalization normalizes the input using statistics calculated over the same layer's units (i.e., across the feature dimension). Layer Normalization is commonly used in recurrent neural networks (RNNs) where the mini-batch size may vary significantly, making Batch Normalization less suitable.


**Input normalization** makes sense to use to help the network center its activations around zero at initialization.