### Weight Initialisation Techniques in Deep Learning

**1. Introduction to Weight Initialisation**

*   Weight initialisation is a crucial step in setting up a neural network for training.
*   It involves assigning initial values to all the **weights** and **biases** (parameters) of the network before the optimisation algorithm begins.
*   The video discusses the importance of correct weight initialisation and divides the topic into two parts: "What not to do?" (current video) and "What to do?" (next video).

**2. Why is Weight Initialisation Important?**

*   The initial parameter values significantly **dictate how well the neural network will perform**.
*   Incorrect initialisation of weights can lead to several problems:
    *   **Vanishing Gradient Problem:** Gradients become extremely small, leading to negligible weight updates and stalled training.
    *   **Exploding Gradient Problem:** Gradients become excessively large, causing unstable training and making it difficult to converge.
    *   **Slow Convergence:** The model takes a very long time to reach an optimal solution.
*   These problems were so significant that deep learning research around 2012 almost came to a halt. It was later found that the two main culprits were the **Sigmoid activation function** and **wrong weight initialisation techniques**.
*   Continuous research has since been ongoing regarding weight initialisation techniques.

**3. The Neural Network Training Process (Review)**

*   The overall process involves:
    1.  **Initialising all parameters** (weights and biases).
    2.  Choosing an optimisation algorithm.
    3.  Repeating the following steps:
        *   **Forward Propagation:** Input is passed through the network to calculate the output (`y_hat`).
        *   **Loss Calculation:** The difference between `y_hat` and the true output is quantified as loss.
        *   **Gradient Calculation:** Gradients for each parameter are computed based on the loss.
        *   **Parameter Update:** Gradients are used to update all parameters via gradient descent.

**4. Weight Initialisation Techniques to AVOID**

The video focuses on techniques that **should not be used** as they lead to significant problems during training.

### A. Zero Initialisation

*   **Description:** Setting all weights (and biases) to **zero** initially.
*   **Problem with ReLU and Tanh:**
    *   If all weights and biases are zero, the weighted sum (`Z`) for any neuron will be zero.
    *   For **ReLU**, `max(0, 0)` is 0. So, all neurons in a layer will output 0.
    *   For **Tanh**, `tanh(0)` is 0. So, all neurons will also output 0.
    *   During backpropagation, the derivative of the activation function with respect to `Z` will be 0 (for ReLU when `Z` is 0, for Tanh when `Z` is 0, the derivative is not zero but the output `A` is 0). If the activation `A` is 0, then the gradients flowing back will also become zero or be affected by this zero output.
    *   This results in **no update to the weights** (`W_new = W_old - learning_rate * gradient`), meaning the weights remain at zero forever and **no training occurs**.
    *   This applies to both ReLU and Tanh, essentially halting the training process.
*   **Problem with Sigmoid:**
    *   If `Z` is zero, `sigmoid(0)` is **0.5**.
    *   All neurons in the hidden layer will output 0.5.
    *   Crucially, **all neurons in a given layer will have the exact same activation value (0.5)**.
    *   Consequently, the **gradients calculated for weights connected to a specific input will be identical**. For example, if two weights connect one input to two different neurons in the next layer, their gradients will be the same, and they will update by the same amount.
    *   This means that neurons in the same layer will **learn the exact same features** and behave identically, as if they were a **single neuron**.
    *   The network effectively loses its ability to learn complex, non-linear patterns, behaving like a simple **linear model (a perceptron)**.
    *   Even adding more neurons to the hidden layer will not help, as they will all behave as one.

### B. Non-Zero Constant Initialisation

*   **Description:** Setting all weights (and biases) to a **single non-zero constant value** (e.g., 0.5).
*   **Problem:**
    *   Similar to the Sigmoid zero initialisation case, if all weights are initialised to the same constant (e.g., 0.5), and inputs are identical, then all neurons in a hidden layer will produce the **same output/activation**.
    *   This leads to **identical gradients** for weights originating from the same input and going to different neurons in the next layer.
    *   As a result, these weights will update by the same amount and remain identical throughout training.
    *   The network will again act as a **linear model**, unable to capture non-linearity, irrespective of the activation function (ReLU or Tanh also exhibit this behaviour).
    *   In short, you end up with the same "single neuron" problem as with Sigmoid and zero initialisation, but with a non-zero output.

### C. Small Random Initialisation

*   **Description:** Initialising weights with **small random values** (e.g., values multiplied by 0.01). This is typically done for layers with many inputs (e.g., 500 inputs) where each input is drawn from a normal distribution.
*   **Problem with Tanh:**
    *   When many small random weights are multiplied by input values (even if inputs are normalised between -1 and 1) and then summed, the resulting `Z` (weighted sum) will be a **very small number**, close to zero.
    *   When this small `Z` is passed through a **Tanh** activation function, which has an S-shaped curve and is flat around zero (where its derivative is maximum but quickly drops off for values further away), the output activation will also be **very close to zero**.
    *   As these near-zero activations propagate through multiple layers, the derivatives calculated during backpropagation become **extremely small and approach zero**.
    *   This leads to the **Vanishing Gradient Problem**, where weight updates are negligible, and training essentially stops or becomes incredibly slow.
*   **Problem with Sigmoid:**
    *   While Sigmoid outputs around 0.5 for `Z` close to zero, it also suffers from the **Vanishing Gradient Problem** if the network is deep (has many layers) because small gradients multiply through the layers. Training will be very slow or non-existent.
*   **Problem with ReLU:**
    *   For ReLU, small values in `Z` would still result in small outputs (if `Z > 0`) or zero outputs (if `Z <= 0`), but the non-saturating nature in the positive region (`gradient = 1`) means the **Vanishing Gradient Problem is less severe** than with Tanh or Sigmoid.
    *   However, small initial weights with ReLU will lead to **extremely slow training and convergence**. The network will take a very long time (e.g., thousands of epochs) to learn anything substantial.

### D. Large Random Initialisation

*   **Description:** Initialising weights with **large random values** (e.g., between 0 and 1, which is considered "large" in deep learning initialisation context). This is again considered for layers with many inputs.
*   **Problem with Tanh and Sigmoid (Saturation):**
    *   If `Z` (the weighted sum) becomes very large (positive or negative) due to large weights and inputs, both **Tanh** and **Sigmoid** activation functions will **saturate**.
    *   Saturation means the output will be pushed to the extreme ends of their range (e.g., 1 or -1 for Tanh, 0 or 1 for Sigmoid).
    *   In the saturated regions, the **derivatives of these functions are almost zero**.
    *   This again leads to the **Vanishing Gradient Problem** and/or **slow convergence**, as gradients become too small to effectively update weights.
*   **Problem with ReLU (Unstable Gradients):**
    *   Since ReLU is **non-saturating** in the positive direction, a large `Z` will result in a proportionally large output.
    *   If activations are large, the gradients calculated during backpropagation will also be **very large**.
    *   Large gradients cause the weight updates to be very substantial, making the training **unstable**. The optimisation algorithm takes huge steps, often overshooting the optimal solution and failing to converge efficiently.

**5. Summary of What Not to Do**

*   Do not initialise weights to **zero**.
*   Do not initialise weights to a **non-zero constant value**.
*   Do not initialise weights with **small random values**.
*   Do not initialise weights with **large random values**.

***

## Deep Learning: Weight Initialisation Techniques (part 2)

### 1. Introduction: The Importance of Weight Initialisation

Weight initialisation is a crucial step in training neural networks. Incorrect initialisation can severely hinder the training process and lead to poor model performance. This lecture will cover common pitfalls in weight initialisation and introduce two widely used heuristic techniques: **Xavier/Glorot initialisation** and **He initialisation**.

### 2. What NOT to do: Common Pitfalls in Weight Initialisation

Based on experimental findings, certain weight initialisation strategies are problematic:

*   **Zero Initialisation**:
    *   **Problem**: Setting all weights to zero.
    *   **Reason**: During training, the weights will not update (or will update identically for multiple weights) because the gradients will be the same for all weights connected to the same neuron. This means the model won't learn, especially with Tanh and ReLU activation functions.
*   **Constant Non-Zero Initialisation**:
    *   **Problem**: Setting all weights to the same non-zero constant (e.g., 0.5).
    *   **Reason**: Similar to zero initialisation, multiple weights will update together. If a hidden layer has multiple nodes, they will effectively act as a single node, preventing the model from capturing non-linear relationships and making it behave like a linear model.
*   **Small Random Weights**:
    *   **Problem**: Generating random weights from a very small range (e.g., `np.random.rand(shape) * 0.01`).
    *   **Reason**: When inputs (e.g., -1 to 1 or 0 to 1) are multiplied by very small weights and then summed over many connections, the resulting sum (output of the node before activation) becomes very small. This can lead to the **vanishing gradient problem** and **slow convergence**, meaning the model takes a long time to reach an optimal solution.
*   **Large Random Weights**:
    *   **Problem**: Generating random weights from a very large range (e.g., between -3 and 3 without scaling).
    *   **Reason**: When inputs are multiplied by large weights and summed, the resulting sum can become very large (e.g., -250 to 250). When this large value is passed through an activation function like Tanh or Sigmoid, it can **saturate** the activation function (i.e., push the output to its maximum or minimum value), leading to tiny gradients and again, the **vanishing gradient problem** or **exploding gradient problem** (especially with ReLU for very large values).

### 3. The Need for an "Intermediate Solution"

*   The key learning is that while **random weights** are essential, the **range** or **spread (variance)** of these random numbers is critically important.
*   Weights must not be too small (causing vanishing gradients/slow convergence) nor too large (causing saturation/exploding gradients).
*   An "intermediate solution" with weights in a "good range" or with a "good variance" is required.
*   This variance should **depend on the neural network's architecture**, specifically the number of inputs to a node. If there are many inputs, individual weights should be smaller to prevent the sum from becoming too large. If there are few inputs, weights can be slightly larger.

### 4. Heuristics for Weight Initialisation

Fortunately, brilliant minds in deep learning have developed practical, experimentally proven solutions (heuristics or "jugaad"). The two main techniques are:

*   **Xavier/Glorot Initialisation**
*   **He Initialisation**

Both come in **Normal** (Gaussian distribution) and **Uniform** (uniform distribution) versions.

### 5. Xavier/Glorot Initialisation

*   **When to Use**: **Xavier initialisation works best with Tanh and Sigmoid activation functions**.
*   **Intuition**: It aims to keep the variance of the activations consistent across layers. The variance of the weights is typically set to `1/n_in`, where `n_in` is the number of input nodes to the current layer (or the number of inputs coming into a specific neuron). The standard deviation is `sqrt(1/n_in)`.
*   **Normal Version Formula**: Weights are drawn from a normal distribution with a mean of 0 and a standard deviation of `sqrt(1 / n_in)`.
    *   Example: `np.random.randn(n_in, n_out) * np.sqrt(1 / n_in)`.
    *   *Note*: Some variations use `2 / (n_in + n_out)` in the denominator for the variance.
*   **Uniform Version Formula**: Weights are drawn from a uniform distribution within the range `[-limit, limit]`, where `limit` is defined as `sqrt(6 / (n_in + n_out))`.
    *   Here, `n_out` is the number of output nodes from the current layer (or number of outputs from a specific neuron).

### 6. He Initialisation

*   **When to Use**: **He initialisation works best with ReLU and its variants** (e.g., Leaky ReLU).
*   **Intuition**: Similar to Xavier, it maintains consistent variance but is specifically tuned for ReLU's properties (which are zero for negative inputs).
*   **Normal Version Formula**: Weights are drawn from a normal distribution with a mean of 0 and a standard deviation of `sqrt(2 / n_in)`.
    *   The `2` in the numerator accounts for ReLU's property of effectively "killing" half the activations (setting them to zero).
*   **Uniform Version Formula**: Weights are drawn from a uniform distribution within the range `[-limit, limit]`, where `limit` is defined as `sqrt(6 / n_in)`.

### 7. Keras Implementation

Keras makes it very easy to implement these initialisation techniques:

*   You set the `kernel_initializer` parameter for each layer.
*   **Examples**:
    *   For He Normal: `kernel_initializer='he_normal'`.
    *   For He Uniform: `kernel_initializer='he_uniform'`.
    *   For Glorot/Xavier Normal: `kernel_initializer='glorot_normal'`.
    *   For Glorot/Xavier Uniform: `kernel_initializer='glorot_uniform'`.
*   **Default Keras Initialisation**: If `kernel_initializer` is not specified, Keras defaults to **Glorot Uniform**.

### 8. Conclusion

Choosing the correct weight initialisation technique based on your **activation function** is crucial for stable and efficient training of deep neural networks. Xavier/Glorot is preferred for Tanh/Sigmoid, while He is preferred for ReLU. Experimenting with these techniques and observing their impact on model training is recommended.

***