# Activation Functions in Deep Networks

## What Is an Activation Function?
- An activation function is the **non-linear transformation** placed between linear layers.
- Deep networks = alternating linear layers and non-linear activations.
- This structure gives deep networks their **expressive power**.

There are many different functions people sandwich between linear layers, including:

<br>

<img src="./images/zoo.png" width="500" style="display: block; margin: auto;">

<br>

## ReLU (Rectified Linear Unit)
- Defined as:
  $$
  \text{ReLU}(x) = \max(0, x)
  $$
- Behavior:
  - Linear for $( x > 0 $)
  - Zero for $( x < 0 $)
  - Gradient is:
    - 1 for $( x > 0 $)
    - 0 for $( x < 0 $)
    - Undefined at $( x = 0 $), but handled arbitrarily in practice
- Pros:
  - Very fast and simple
  - Easy to compute
- Cons:
  - **"Dead ReLU" problem**: If the activation becomes negative and stays negative, it will never recover — no gradient flows back
- Mitigation:
  - Use careful **initialization**
  - Try smaller **learning rates**

<br>

<img src="./images/relu2.png" width="500" style="display: block; margin: auto;">

<br>

## Leaky ReLU
- Modification of ReLU:
  $$
  \text{LeakyReLU}(x) = \max(\alpha x, x)
  $$
- Adds a small slope $(\alpha > 0$) on the negative side.
- Pros:
  - Solves the "dead ReLU" problem by allowing a small gradient when $( x < 0 $)
- Cons:
  - Requires tuning of $\alpha$)
  - Can allow unwanted signal "leakage" through negative values
- Variants:
  - **Parametric ReLU (PReLU)**: learns $\alpha$ during training

<br>

<img src="./images/lrelu.png" width="500" style="display: block; margin: auto;">

<br>

## ELU (Exponential Linear Unit)
- Defined as:
  $$
  \text{ELU}(x) =
  \begin{cases}
  x & \text{if } x > 0 \\
  \alpha(\exp(x) - 1) & \text{if } x \leq 0
  \end{cases}
  $$
- Pros:
  - Smooth transition and non-zero gradients even for $( x < 0 $)
  - Helps maintain mean activations near zero
- Cons:
  - Requires computation of $\exp(x)$, which is computationally expensive
  - Requires tuning of $\alpha$

<br>

<img src="./images/elu.png" width="500" style="display: block; margin: auto;">

<br>

## GELU (Gaussian Error Linear Unit)
- Combines:
  - Linear function $ x $
  - With a Gaussian error function
- Behavior:
  - Approximately linear for $ x > 0 $
  - Dampens signal for $ x < 0 $ by multiplying with a value near zero
- Unique Feature:
  - Has a small **dip below zero**, meaning two different inputs may have the same output,
    this was a concern but deep networks have shown to perform better with the dips.
- Pros:
  - Empirically shown to perform well in modern deep networks
  - Gradient is non-zero everywhere
- Cons:
  - Computationally expensive (requires Gaussian PDF computation)

<br>

<img src="./images/gelu.png" width="500" style="display: block; margin: auto;">

<br>

## Sigmoid and Tanh (Legacy Functions)
- **Sigmoid**:
  $$
  \sigma(x) = \frac{1}{1 + e^{-x}}
  $$
- **Tanh**:
  $$
  \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  $$
- These are rarely used today because:
  - They **saturate** on both ends (output approaches constant values)
  - Gradients become very small in saturation zones → hard to train
- Only effective if activations are kept in narrow central range

## Summary: Which Activation Should You Use?

| Activation | Pros | Cons | Use Case |
|------------|------|------|----------|
| **ReLU** | Fast, simple | Can "die" (zero gradients) | Default starting point |
| **Leaky ReLU** | Prevents dead neurons | Needs tuning | Debugging ReLU issues |
| **PReLU** | Learns alpha | Rarely used now | Specialized networks |
| **ELU** | Smooth, always has gradient | Slower | Useful if mean-zero activations help |
| **GELU** | Best empirical results | Slow, complex | State-of-the-art networks |
| **Sigmoid/Tanh** | Historically common | Saturation, vanishing gradients | Avoid unless necessary |

## Practical Tips
- **Start with ReLU**.
- If ReLU units are dying (many outputs stuck at 0):
  - Try **Leaky ReLU** or **PReLU**.
- Avoid **Sigmoid** and **Tanh** unless you have a specific reason.
- For **high-performance models** (e.g., Transformers, BERT), consider **GELU**.
- PyTorch default initializations are usually good.
- Use a **small learning rate** to prevent blowing up activations.

