# Training Deep Neural Networks and How to solve common problems

## The Vanishing/Exploding Gradient Problem

During backpropagation, when going backwords and calculating the gradients to update the parameters, these gradients get smaller and smaller as they go to the lower levels. This can cause almost 0 changes in the parameters which means the model doesn't converge. This is called <b>The Vanishing Gradient Problem</b>. Also, the oppisite can happen, the gradients can start to become bigger and bigger which is called <b>The Exploding Gradient Problem</b>. Most deep neural networks suffer from unstable gradients, but there are a few ways to solve this issue.

### Glorot and He initialization

In [1]:
import tensorflow as tf

### Keras uses Glorot Uniform initialization by default ###

### He Init
dense = tf.keras.layers.Dense(50, activation='relu', 
                             kernel_initializer='he_normal')

### OR ###
dense = tf.keras.layers.Dense(50, activation='relu', 
                             kernel_initializer=tf.keras.initializers.HeNormal())


2024-12-29 19:04:04.954262: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-29 19:04:05.093603: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1735520645.147522    3975 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1735520645.167252    3975 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-29 19:04:05.345079: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

In [2]:
### He init with uniform distribution based on fan avg
he_avg_init = tf.keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
dense = tf.keras.layers.Dense(50, activation='relu', 
                             kernel_initializer=he_avg_init)

### Better Activation Functions

One of the reasons that Unstable Gradients happen is due to a poor choice in activation functions. <br>
ReLU is a good activation function because it is quick to compute and does not saturate for positive values. However ReLU can cause a problem called <b>dying ReLUs</b>, which means that some neurons "die" or only output 0. This is caused when the weights get tweaked in a way that causes all inputs to neuron to be negative, and ReLU outputs 0 for all negatives. In some cases of neural networks, half of the neurons are "dead" especially when using high learning rates. <br>
To solve this you can use a variation of ReLU called <b>Leaky ReLU</b> <br>
$$
\text{LeakyReLU}(x) = \begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$ <br>

This variation ensures that neurons never die

In [3]:
leaky_relu = tf.keras.layers.LeakyReLU(negative_slope=0.2)
dense = tf.keras.layers.Dense(50, activation=leaky_relu, kernel_initializer='he_normal')

In [4]:
### You can use LeakyReLU as a layer in the network instead of an activation function. It makes no difference
model = tf.keras.models.Sequential([
    # [...]  # more layers
    tf.keras.layers.Dense(50, kernel_initializer="he_normal"),  # no activation
    tf.keras.layers.LeakyReLU(negative_slope=0.2),  # activation as a separate layer
    # [...]  # more layers
])

In [5]:
### Using PReLU instead
prelu = tf.keras.layers.PReLU()

A problem with ReLU, LeakyReLU, and PReLU, is that they are not smooth functions, meaning their derivative changes abruptly at x=0. This can cause the gradient descent to bounce around the optimum which makes it hard to converge. 

#### ELU and SELU

<b>Exponential Linear Unit (ELU)</b> is another activation function that can outperform ReLU and its variations in some cases. <br>
$$
\text{ELU}(x) = \begin{cases}
x & \text{if } x > 0 \\
\alpha (\exp(x) - 1) & \text{if } x \leq 0
\end{cases}
$$<br>
Its advantages include: Taking on negative values when x < 0 which allows the unit to have an output closer to 0 which helps alleviate the vanishing gradients problem. Non-zero gradient at x=0 which avoids dead neurons, and when the function at $ \alpha = 1 $ the function is smooth everywhere which means gradient descent converges faster. <br>
<b>Note: Should always use He Initilzation with ELU, and ELU is slower than ReLU</b>

In [6]:
dense = tf.keras.layers.Dense(50, activation='elu', 
                             kernel_initializer='he_normal')

<b>Scaled ELU (SELU)</b> is another activation function that is a scaled variant of ELU. If you have a deep neural network where the hidden layers are just stacks of Dense layers using SELU, then the network will self-normalize: the output of each layer will tend to perserve a mean of 0 and a standard deviation of 1 during training which solves the vanishing gradient problem. This can cause SELU to outperform other activation functions. <br>
Considerations to keep in mind about SELU:
- The input features must be standardized: mean 0 and standard deviation 1.
- Every hidden layer's weights must be initialized using LeCun normal init.
- The Self-normalizing property is only guaranteed with plain MLPs. If you try SELU with other architectures, like recurrent neural networks or networks with skip connections(ex. Wide & Deep nets), it will not outperform ELU.
- You cannot use regularization techniques with SELU

In [7]:
dense = tf.keras.layers.Dense(50, activation="selu",
                              kernel_initializer="lecun_normal")

SELU is not extremely popular or widly used due to the main considerations and is often outperformed by other activation functions like:<br>
#### GELU, Swish, and Mish

<b>GELU</b> is a smooth variant of the ReLU activation function. Due to its curvy/complex shape, gradient descent seems to find it easier to fit on to it. However, it is more computationaly expensive then the other activation functions and the performance boost doesn't always justify the extra cost. <br>
<b> Sigmoid linear unit (SiLU) aka Swish</b> is very close to GELU but has one extra hyperparameter that can cause it to be more effective in certain cases <br>
<b>Mish</b> is a smooth, nonconvex, and nonmonotonic variant of ReLU. It is similar or Swish and GELU.<br>

<b style="color: blue; font-size: 1.2em;">Which Activation function should you use??</b><br>

- ReLU is a good default for simple tasks: it's often just as good as the more sophisticated activation functions, plus its fast to compute and many libraries and hardware accelerators provide ReLU-specific optimizations.
- Swish is probably a better default for more complex tasks.
- Mish can give you slightly better results than Swish but at the cost of more computation time.
- If you care about runtime latency, then LeakyReLU or parametrized leaky ReLU might be a better option.
- For Deep MLPs, give SELU a try, but make sure to respect the constraints of it.
- If you have time and computing power, try cross validation to find the best activation functions

### Batch Normalization

Batch normalization is another great technique to solving the vanishing/exploding gradent problem.