## Deep Learning: Batch Normalisation

### 1. Introduction to Batch Normalisation

**Batch Normalisation** is an **algorithmic method** designed to **speed up neural network training** and enhance its **stability**. This technique was introduced in 2015 and has since become widely adopted in neural networks.

The core idea involves **normalising the activation vectors from hidden layers** by using the mean and variance calculated from the current training mini-batch. This normalisation step is typically applied **right before or right after a non-linear activation function**. By normalising these activations (aiming for a mean of 0 and standard deviation of 1), it effectively provides "normalised inputs" to subsequent layers, analogous to how input data is generally pre-processed.

### 2. Why Use Batch Normalisation? (Necessity and Benefits)

Batch Normalisation addresses several critical challenges in deep learning training:

#### a) Analogy with Input Normalisation for Faster and More Stable Training

*   **Problem with Unnormalised Inputs**: When input features have vastly different scales (e.g., CGPA on a 1-10 scale, IQ on a 1-100 scale), using unnormalised data can lead to a **cost function contour that is "stretched" or non-uniform**. This means the contour appears elongated in some directions and compressed in others.
*   **Consequences**: A stretched cost function contour **prevents the use of high learning rates** because it can cause the optimisation algorithm to "overshoot" the minimum in the stretched direction, leading to **slow and unstable training**. Optimisers must take small steps, resulting in longer convergence times.
*   **Solution**: Normalising inputs (typically to a mean of 0 and standard deviation of 1) transforms the cost function contour into a more **uniform and spherical shape**. This allows for **stable and faster convergence** even with higher learning rates, as the optimiser can take larger, more efficient steps towards the minimum.
*   **Batch Norm's Extension**: The intuition is that if normalising the initial network inputs is beneficial, then normalising the *outputs* of hidden layers (which serve as inputs to subsequent layers) should similarly make training faster and more stable.

#### b) Addressing Internal Covariate Shift (The Primary Motivation)

*   **Covariate Shift**: This occurs when the **distribution of input columns changes** between the training and testing phases. Even if the relationship between inputs (X) and outputs (Y) remains the same, a model trained on one input distribution might perform poorly on a new distribution, necessitating re-training. An example is training a model to distinguish red roses from other flowers, then testing it with roses of various colours; the input distribution has shifted.
*   **Internal Covariate Shift (ICS)**: This is a specific type of covariate shift that occurs *within* a neural network during training.
    *   **Definition**: It is "the change in the distribution of network activations due to the changing input parameters during training".
    *   **Mechanism**: As a neural network trains, the weights of earlier layers are constantly updated. These updates modify the *outputs* of those layers, which then become the *inputs* for subsequent layers. Consequently, the **distribution of inputs to later layers is continuously shifting**, making it difficult for these deeper layers to learn stable mappings. This instability means that a layer constantly has to adapt to new input distributions from the preceding layers.
    *   **Consequences of ICS**: Leads to **unstable training** and **slow convergence**. It also forces the use of **lower learning rates** and demands **very careful weight initialisation**.
*   **Batch Norm's Solution to ICS**: By normalising the activations of *every hidden layer* (specifically, their weighted sums before activation) to have a mean of 0 and a standard deviation of 1, Batch Normalisation **ensures a consistent input distribution for subsequent layers**. This "stable ground" allows the rest of the network to learn more effectively, thereby **reducing the impact of ICS** and leading to faster, more stable learning.

#### c) Other Significant Advantages

1.  **More Stable Training**: Batch Normalisation allows for a **wider range of hyperparameter values** (e.g., learning rates, weight initialisation schemes) without destabilising the training process.
2.  **Faster Training**: It enables the use of **higher learning rates**, which results in faster convergence and requires **fewer epochs** to achieve optimal accuracy.
3.  **Regularisation Effect**: Batch Normalisation can act as a **weak regulariser**, helping to **reduce overfitting**. This occurs because the mean and standard deviation are calculated per batch, introducing a slight randomness or noise into the activations. This noise discourages the model from becoming overly reliant on specific input patterns present in a single batch. However, it is **not a strong regulariser** like dropout and should not replace dedicated regularisation techniques.
4.  **Reduced Importance of Weight Initialisation**: Batch Normalisation makes the training process **less sensitive to the initialisation of weights**. By normalising activations, it effectively improves the shape of the cost function contour, making it easier for the network to find the optimal solution regardless of the initial weight values.

### 3. How Batch Normalisation Works (During Training)

Batch Normalisation is applied on a **layer-by-layer basis** and is primarily used with **mini-batch gradient descent**. It is an optional technique; you can choose which layers to apply it to.

For a given layer, the process typically involves two main stages for each neuron:

#### Stage 1: Normalising the Weighted Sum (Z)

1.  **Calculate Weighted Sum (Z)**: For each neuron in the layer, first calculate the weighted sum of its inputs (`Z = WX + B`).
2.  **Mini-Batch Processing**: Batch Normalisation operates on a **mini-batch** of data points. If the mini-batch size is `m` (e.g., 4 points), you get `m` weighted sums (`Z`) for each neuron.
3.  **Calculate Batch Mean (μ_B)**: Compute the mean of these `m` weighted sums for the current neuron across the mini-batch.
    *   Formula: `μ_B = (1/m) * Σ Z_i` (where `i` goes from 1 to `m`).
4.  **Calculate Batch Variance (σ_B²)**: Compute the variance of these `m` weighted sums for the current neuron across the mini-batch.
    *   Formula: `σ_B² = (1/m) * Σ (Z_i - μ_B)²`.
5.  **Normalise Z**: Use the calculated `μ_B` and `σ_B` to normalise each `Z_i` in the mini-batch.
    *   Formula: `Z_normalised_i = (Z_i - μ_B) / √(σ_B² + ε)`.
    *   An epsilon (`ε`) term is added to the denominator to **prevent division by zero**.
    *   After this step, the `Z_normalised` values will have a mean of 0 and a standard deviation of 1.

#### Stage 2: Scaling and Shifting with Learnable Parameters (Gamma and Beta)

1.  **Scaling and Shifting**: The normalised `Z` values (`Z_normalised`) are then scaled by a learnable parameter `γ` (gamma) and shifted by another learnable parameter `β` (beta).
    *   Formula: `Z_output = γ * Z_normalised + β`.
    *   Initially, `γ` is typically 1, and `β` is 0 in Keras.
2.  **Flexibility**: These `γ` and `β` parameters provide **flexibility** to the neural network. While normalisation brings values to a mean of 0 and std dev 1, the network might sometimes prefer a different distribution for optimal learning. `γ` and `β` allow the network to **undo the normalisation** if it's not beneficial, or to scale and shift the normalised values to a more optimal distribution.
3.  **Learnable Parameters**: `γ` and `β` are **learnable parameters** that are updated during backpropagation, just like weights (W) and biases (B). Each neuron has its **own independent `γ` and `β` parameters**.
4.  **Activation**: Finally, the `Z_output` (after scaling and shifting) is passed through the chosen activation function (e.g., ReLU, Tanh) to produce the neuron's activation.

### 4. How Batch Normalisation Works (During Testing/Prediction)

During training, `μ_B` and `σ_B²` are calculated from the current mini-batch. However, during testing or prediction, you typically **only have a single input point** (not a mini-batch), making it impossible to calculate batch-specific mean and variance.

To address this, Batch Normalisation uses **Exponentially Weighted Moving Averages (EWMA)**:

1.  **EWMA Tracking During Training**: Throughout the training process, the algorithm maintains **moving averages of the batch means (μ_EWMA)** and **moving averages of the batch variances (σ_EWMA²)** across all mini-batches and epochs. These are non-learnable parameters.
2.  **Using EWMA for Testing**: When the model is deployed for testing or prediction, instead of calculating batch-specific means and variances, it uses the **final, aggregated `μ_EWMA` and `σ_EWMA²` values** collected during training.
3.  **Inference Calculation**: For a new test input `X_test`, the normalisation becomes:
    *   `Z_normalised = (Z_test - μ_EWMA) / √(σ_EWMA² + ε)`
    *   `Z_output = γ * Z_normalised + β`
    *   The `γ` and `β` parameters are the **final learned values** from training.

In summary, during training, Batch Normalisation uses **batch-specific statistics** (μ_B, σ_B²), and during testing, it uses **global statistics** (μ_EWMA, σ_EWMA²) estimated from the training set.

A Batch Normalisation layer typically involves **four parameters per neuron**:
*   **Learnable Parameters**: `γ` (gamma) and `β` (beta).
*   **Non-learnable Parameters**: `μ_EWMA` (moving average of mean) and `σ_EWMA²` (moving average of variance).
Therefore, for a hidden layer with 3 units, a Batch Normalisation layer would have `3 * 4 = 12` parameters, with 6 being learnable and 6 non-learnable.

### 5. Exponentially Weighted Moving Averages (EWMA)

When training with mini-batches, the **batch mean** and **batch variance** fluctuate.  
To stabilize inference, Batch Normalization keeps **running averages** using **Exponentially Weighted Moving Averages (EWMA)**.

---

#### 🔹 Formula for EWMA

- **Running Mean**  
```
μ_t = momentum * μ_(t-1) + (1 - momentum) * μ_batch
```

- **Running Variance**  
```
σ²_t = momentum * σ²_(t-1) + (1 - momentum) * σ²_batch
```

where:
- `μ_(t-1), σ²_(t-1)` → previous running estimates  
- `μ_batch, σ²_batch` → mean/variance of the current batch  
- `momentum` → smoothing factor (default in Keras = 0.99)  

---

### 🔹 Intuition
- `momentum = 0.9` → **90% old value + 10% new value**  
- `momentum = 0.99` → updates are **slower** → smoother estimates  
- Helps reduce noise from mini-batches and makes inference stable  

---

### 🔹 In Keras
```python
from tensorflow.keras.layers import BatchNormalization

bn = BatchNormalization(momentum=0.99, epsilon=1e-3)
```

- `momentum=0.99` →  
  ```
  μ_t = 0.99 * μ_(t-1) + 0.01 * μ_batch
  ```

---
### 6. Keras Implementation

Implementing Batch Normalisation in Keras is straightforward:

*   You simply add a `BatchNormalization` layer to your model using `model.add(BatchNormalization())`.
*   It is typically placed **after a convolutional or dense layer and before the activation function**, although placing it after the activation is also an option. The source specifically mentions applying it "right before or right after non-linear function".
*   **Example Code Structure**:
    ```python
    model = Sequential()
    model.add(Dense(3, activation='relu', input_dim=2)) # First hidden layer
    model.add(BatchNormalization()) # Batch Normalization layer
    model.add(Dense(2, activation='relu')) # Second hidden layer
    model.add(BatchNormalization()) # Batch Normalization layer
    model.add(Dense(1, activation='sigmoid')) # Output layer
    ```
*   When a `BatchNormalization` layer is added, Keras automatically manages the learnable `gamma` and `beta` parameters, as well as the non-learnable `moving_mean` and `moving_variance` parameters (the EWMAs).
*   Experiments show that models using `BatchNormalization` achieve **higher accuracy** and converge to that accuracy **much faster** (in fewer epochs) compared to models without it.

***