### Activation Functions in Neural Networks

**1. What are Activation Functions?**

*   An activation function acts as a **mathematical gate** between the input received by a neuron and the output it produces.
*   In an Artificial Neural Network (ANN), each neuron performs a **weighted sum of its inputs**, adds a bias, and then passes this scalar value through a function, which is referred to as the activation function or transfer function.
*   It **decides whether a neuron will be activated or not, and if so, to what extent**.

**2. Why are Activation Functions Necessary?**

*   **To introduce non-linearity:** If activation functions are not applied, the neural network will only be able to capture **linear data patterns**, behaving like a linear regression or logistic regression model. It would perform like a single line.
*   **To solve non-linear problems:** Activation functions are crucial for performing non-linear regression and solving classification problems where data is not linearly separable.
*   Without non-linear activation functions, the relationship between the input and output would be a first-degree polynomial (linear), making the model incapable of learning complex, non-linear relationships.
*   The **Universal Approximation Theorem** states that non-linear activation functions allow a network with at least one hidden layer to approximate any continuous function.

**3. Ideal Qualities of an Activation Function**

An ideal activation function should possess the following qualities:

*   **Non-linear:** Essential for capturing non-linear patterns in data.
*   **Differentiable:** It must be possible to calculate its derivative, which is critical for **gradient descent** and the **backpropagation algorithm** used to update network weights during training.
*   **Computationally Inexpensive:** Derivatives should be simple, easy, and fast to calculate to prevent slow training.
*   **Zero-Centered:** The output (activation) of the function should have a **mean close to zero**. This normalisation helps the neural network converge faster. If the activations are not zero-centered (e.g., all positive or all negative), training can become slower and more restricted.
*   **Non-Saturating:** A saturating function "squeezes" its input into a small, fixed range (e.g., 0-1 or -1 to 1). This can lead to the **Vanishing Gradient Problem**, where gradients become extremely small, preventing effective weight updates and stopping training. A non-saturating function, like ReLU in its positive region, does not suffer from this issue.

**4. Types of Activation Functions**

### A. Sigmoid Activation Function

*   **Description:** The Sigmoid function produces an **S-shaped curve**.
*   **Formula:** `sigmoid(x) = 1 / (1 + e^-x)`.
*   **Output Range:** **0 to 1**. For very large positive inputs, the output approaches 1; for very large negative inputs, it approaches 0. If x is 0, the output is 0.5.
*   **Derivative:** The derivative is at its maximum (0.25) when x is 0. For x values outside the range of approximately -3 to 3, the derivative becomes very close to zero.

*   **Advantages:**
    *   **Probabilistic Interpretation:** The 0-1 output range allows it to be interpreted as a probability, making it suitable for the **output layer in binary classification problems**.
    *   **Non-linear:** It is a non-linear function, capable of capturing non-linear data patterns.
    *   **Differentiable:** It is differentiable everywhere, making derivative calculation straightforward for backpropagation.

*   **Disadvantages:**
    *   **Saturating Function:** It squeezes inputs into the 0-1 range, making it a saturating function.
        *   **Vanishing Gradient Problem:** This saturation leads to **vanishing gradients**. When the input `x` (weighted sum) is very large or very small, the derivative becomes almost zero. Consequently, weight updates during backpropagation become negligible, effectively stopping the training process. This is the **primary reason it is rarely used in hidden layers today**.
    *   **Non-Zero Centered Output:** The output values are all positive (between 0 and 1), meaning the mean is not zero.
        *   This results in **slower training** and convergence problems. When all activations are positive, the gradients for weights in subsequent layers will either all be positive or all be negative. This restricts the directions in which weights can be updated, forcing the optimisation algorithm to take a "zigzag" path, which increases training time.
    *   **Computationally Expensive:** The exponential calculations in its formula make it computationally more intensive compared to simpler functions.

### B. Tanh (Hyperbolic Tangent) Activation Function

*   **Description:** The Tanh function also has an **S-shaped curve**, similar to Sigmoid.
*   **Formula:** `tanh(x) = (e^x - e^-x) / (e^x + e^-x)`.
*   **Output Range:** **-1 to 1**.
*   **Derivative:** The derivative formula is `1 - tanh(x)^2`. The maximum value of its derivative is 1.

*   **Advantages:**
    *   **Non-linear:** It is a non-linear function capable of capturing complex relationships.
    *   **Differentiable:** It is differentiable, allowing for easy derivative calculation.
    *   **Zero-Centered Output:** Unlike Sigmoid, Tanh's output is **zero-centered** (ranging from -1 to 1), with both negative and positive activations. This helps in faster training and convergence compared to Sigmoid, as it alleviates the gradient restriction problem.

*   **Disadvantages:**
    *   **Saturating Function:** Like Sigmoid, Tanh is a saturating function.
        *   **Vanishing Gradient Problem:** It still suffers from the **Vanishing Gradient Problem** for very large or very small inputs, as its derivative also approaches zero in those regions. This fundamental issue was not resolved by Tanh.
    *   **Computationally Expensive:** It involves exponential calculations, making it computationally more expensive.

### C. ReLU (Rectified Linear Unit) Activation Function

*   **Description:** ReLU is the most widely used activation function in hidden layers today.
*   **Formula:** `ReLU(x) = max(0, x)`.
    *   For **negative input values (x < 0)**, the output is **0**.
    *   For **positive input values (x >= 0)**, the output is **x**.

*   **Advantages:**
    *   **Non-linear:** Although it appears piecewise linear, ReLU is a non-linear function. Combining multiple ReLUs can create complex non-linear patterns.
    *   **Non-Saturating (in positive region):** For positive inputs, ReLU does not saturate. This significantly mitigates the **Vanishing Gradient Problem**, as the gradient for positive inputs is constant (1), preventing it from becoming zero.
    *   **Computationally Inexpensive:** Its simple formula (a simple comparison and output) makes it extremely **computationally efficient** for both forward and backward passes. No exponentials are involved.
    *   **Faster Convergence:** Its non-saturating nature and computational efficiency lead to **faster training and convergence** compared to Sigmoid and Tanh.

*   **Disadvantages:**
    *   **Not Completely Differentiable:** ReLU is not differentiable at x = 0 (the point where the slope changes abruptly). However, this is typically handled in coding by assigning a derivative of 0 or 1 at x = 0.
    *   **Not Zero-Centered:** Like Sigmoid, ReLU's output is not zero-centered (all activations are non-negative).
        *   This issue can be addressed by using techniques like **Batch Normalisation**, which normalises the outputs of layers before passing them to the next layer.
    *   **Dying ReLU Problem:** This is a problem where neurons can become "inactive" and stop learning. (This will be discussed in a subsequent video).

***


### Activation Functions in Neural Networks (Part 2)

**5. The Dying ReLU Problem**

Despite its advantages, the Rectified Linear Unit (ReLU) activation function, commonly used in hidden layers, has a significant drawback known as the **Dying ReLU Problem**.

*   **What is a "Dead Neuron"?**
    *   Sometimes, if a ReLU activation function is used, certain neurons can output zero for **any input**.
    *   This neuron effectively "dies" because it stops learning; its output does not change based on the input.
    *   A dead neuron is permanently dead and will always output zero. It's as if the neuron is no longer part of the neural network.

*   **Impact of the Dying ReLU Problem:**
    *   If **more than 50%** of neurons experience this problem, the model's performance significantly degrades.
    *   The model struggles to capture patterns and high-level representations in the data, leading to a **low-level representation** that is not effective.
    *   In the worst-case scenario, if 100% of neurons die, the network essentially ceases to exist and learns nothing, making the application of a neural network pointless.

*   **Why Does Dying ReLU Occur? (Mathematical Intuition)**
    *   Consider a simple setup with two neurons, an input, a hidden unit, and an output layer.
    *   The output of a ReLU neuron is `max(0, Z1)`, where `Z1 = W1*X1 + W2*X2 + B1` (weighted sum plus bias).
    *   The problem arises when this weighted sum, `Z1`, becomes **negative**.
    *   If `Z1 < 0`, then the ReLU output is 0, and more critically, its **derivative with respect to Z1 is also 0**.
    *   During **backpropagation**, to update the weights (e.g., W1, W2), the update rule involves the **derivative of the loss with respect to the weights**. This calculation requires the derivative of the activation function with respect to `Z1`.
    *   If this derivative (`dL/dZ1`) is 0, then the entire gradient for `W1` and `W2` becomes 0.
    *   As a result, `W1` and `W2` (and potentially the bias `B1`) will **not be updated** in subsequent training cycles (`W_new = W_old - learning_rate * gradient`).
    *   Since the weights are not updating, the neuron cannot learn, effectively becoming "dead".

*   **Reasons for `Z1` Becoming Negative:**
    *   **High Learning Rate:** If the learning rate is set too high, during weight updates, the weights (e.g., W1, W2) can become **very negative**. This can cause `Z1` (W1*X1 + W2*X2 + B1) to become negative in subsequent cycles.
    *   **High Negative Bias:** If the bias term (`B1`) becomes very negative (either initialised as such or driven negative by a high learning rate during updates), it can pull the entire `Z1` sum into the negative region, regardless of the positive inputs.

*   **Why is it Permanent?**
    *   Once `Z1` becomes negative, the neuron's output is 0, and its gradient is 0.
    *   Since the gradient is 0, the weights associated with that neuron (W1, W2) **will not update**.
    *   Even if the input data (X1, X2) changes, the small, normalised input values (typically between 0 and 1) are usually **insufficient to overcome a very negative bias or very negative weights** and make `Z1` positive again.
    *   Therefore, the neuron **cannot recover** and remains permanently dead.

*   **Solutions to the Dying ReLU Problem:**
    *   **Set Low Learning Rates:** Reduce the learning rate to prevent weights from becoming excessively negative.
    *   **Set Positive Biases:** Initialise biases with a small positive value (e.g., 0.01).
    *   **Use ReLU Variants:** Instead of standard ReLU, use its variants that are designed to mitigate this problem.

**6. ReLU Variants**

ReLU variants modify the function's behaviour for negative inputs to prevent the gradient from becoming zero. These fall into two categories: **linear variants** and **non-linear variants**.

### A. Leaky ReLU (Linear Variant)

*   **Formula:**
    *   `f(x) = x` if `x >= 0`
    *   `f(x) = 0.01x` if `x < 0`
*   **Description:** For negative inputs, Leaky ReLU does not output zero, but a **small, non-zero fraction** (typically 0.01) of the input.
*   **Derivative:**
    *   `1` for `x >= 0`
    *   `0.01` for `x < 0`
    *   This ensures that a small gradient always flows through the network, even for negative inputs, preventing neurons from dying.

*   **Advantages:**
    *   **No Dying ReLU Problem:** The non-zero gradient for negative inputs ensures weights can still be updated.
    *   **Non-Saturating:** It's unbounded on both sides, avoiding saturation.
    *   **Computationally Inexpensive:** No exponential terms, making it fast to compute.
    *   **Close to Zero-Centred:** Provides both negative and positive outputs, leading to a mean activation closer to zero than standard ReLU.

*   **Disadvantage:**
    *   The fixed slope of `0.01` for negative inputs is an arbitrary choice.

### B. Parametric ReLU (PReLU) (Linear Variant)

*   **Formula:**
    *   `f(x) = x` if `x >= 0`
    *   `f(x) = αx` if `x < 0`
*   **Description:** PReLU is similar to Leaky ReLU, but the slope `α` for negative inputs is a **learnable parameter**.
    *   The model learns the optimal value of `α` during training, adjusting it like other network weights.

*   **Advantages:**
    *   All advantages of Leaky ReLU.
    *   **Increased Flexibility:** Allows the network to learn the best negative slope for the given data, potentially leading to better performance than a fixed `0.01`.

*   **Disadvantage:**
    *   No specific disadvantage mentioned apart from Leaky ReLU's general properties.

### C. Exponential Linear Unit (ELU) (Non-Linear Variant)

*   **Formula:**
    *   `f(x) = x` if `x >= 0`
    *   `f(x) = α(e^x - 1)` if `x < 0`
    *   `α` is a hyperparameter, typically chosen between `0.1` and `0.3`.
*   **Description:** For positive inputs, it behaves like ReLU. For negative inputs, it smoothly approaches `-α` exponentially.
*   **Derivative:**
    *   `1` for `x >= 0`
    *   `αe^x` for `x < 0` (which is `f(x) + α`)

*   **Advantages:**
    *   **Better Performance:** Experiments have shown ELU often outperforms ReLU on various datasets.
    *   **Close to Zero-Centred:** Outputs tend to be closer to zero mean, which aids faster convergence.
    *   **No Dying ReLU Problem:** Provides non-zero gradients for negative inputs.
    *   **Continuously Differentiable:** The function and its derivative are continuous everywhere, which can be beneficial for optimisation.

*   **Disadvantage:**
    *   **Computationally Expensive:** The exponential term (`e^x`) makes it more computationally intensive than ReLU or Leaky ReLU.

### D. Scaled Exponential Linear Unit (SELU) (Non-Linear Variant)

*   **Formula:**
    *   `f(x) = λx` if `x >= 0`
    *   `f(x) = λ(α(e^x - 1))` if `x < 0`
    *   SELU is essentially ELU scaled by a fixed factor `λ`. The values for `λ` (approximately `1.0507`) and `α` (approximately `1.67326`) are pre-defined and fixed.
*   **Description:** This function aims to normalise the activations of layers automatically.
*   **Derivative:**
    *   `λ` for `x >= 0`
    *   `λ(αe^x)` for `x < 0`

*   **Advantages:**
    *   **Self-Normalising:** SELU is designed to normalise activations across layers automatically. This implies that the mean of activations tends towards zero and the variance towards one, which leads to **faster convergence** and **better generalisation** without explicit normalisation techniques like Batch Normalisation.
    *   Includes all benefits of ELU.

*   **Disadvantages:**
    *   **Relatively New and Less Adopted:** As it is a recent development, it is not yet widely used in the industry due to less extensive research and understanding compared to more established functions.
    *   **Complex Theory:** The original paper detailing SELU is quite complex, which might hinder its broader adoption.

***