## I. Context: The Final Component of Transformers

The study of Layer Normalization is the **last topic** required before moving on to the full structure of the Transformer architecture. Previous studies have already covered the other key components:

1.  **Embeddings** (Converting words to numerical vectors).
2.  **Attention** (Self-Attention and Multi-Head Attention).
3.  **Positional Encoding** (Adding word order information).

Layer Normalization is specifically applied after the Attention step in the Transformer architecture.

## II. Normalization in Deep Learning

Normalization is a process in deep learning that transforms data to possess specific statistical properties, generally aiming for values within a given range.

### A. Types of Normalization

1.  **Standardization:** Transforms data so that the resulting column has a **mean of zero** ($\mu=0$) and a **standard deviation of one** ($\sigma=1$).
2.  **Min-Max Scaling:** Transforms data to fall within a predefined range.

### B. Application Points

Normalization is applied to two main components in a Deep Neural Network:

1.  **Input Data:** Normalizing the input features ($F_1, F_2, F_3$, etc.).
2.  **Hidden Layer Activations:** Normalizing the outputs (activations) of the hidden layers.

### C. Benefits of Normalization

Normalization is applied to improve the training process:

1.  **Improved Training Stability and Acceleration:** Normalization ensures that inputs and activations remain within a given range, which helps to **stabilize and accelerate** the training process. It reduces the likelihood of extreme values that can cause gradients to **explode or vanish**.
2.  **Faster Convergence:** Models converge more quickly because the gradients have more consistent magnitudes.
3.  **Mitigating Internal Covariate Shift (ICS):** Normalization helps solve the problem of ICS.
    *   **Covariate Shift:** This is when the distribution of the input data changes between the training phase and the prediction phase.
    *   **Internal Covariate Shift:** This occurs within the deep learning model. Since weights are constantly updated during backpropagation, the output (activation) of one hidden layer constantly changes its distribution, meaning the *next* hidden layer receives an input whose distribution is shifting over time. Normalization counters this internal shift.
4.  **Regularization Effect:** Some forms of normalization (specifically Batch Normalization) provide a regularization effect.

## III. Batch Normalization (BN) Review

Batch Normalization was a prior technique used to normalize hidden layer activations.

### A. The Mechanism
BN calculates the mean ($\mu$) and standard deviation ($\sigma$) **across the batch**.

1.  **Data Collection:** Pre-activation scores ($Z$) are calculated for all nodes across all data points in the batch.
2.  **Statistical Calculation:** For a specific node (a column of $Z$ values), BN calculates the $\mu$ and $\sigma$ for all the numbers in that column.
3.  **Transformation:** The standardization is applied: $\frac{Z - \mu}{\sigma}$.
4.  **Scaling and Shifting:** The result is then scaled by a learnable parameter, **Gamma ($\gamma$)**, and shifted by another learnable parameter, **Beta ($\beta$)**. Every node has its own $\gamma$ and $\beta$ (e.g., $\gamma_1, \beta_1$).

## IV. Why Batch Normalization Fails in Transformers

The most significant reason why Batch Normalization is not used in the Transformer architecture is that it **does not work well with sequential data** and the **Self-Attention** mechanism.

### A. The Padding Problem
<img src="./images/la1.png">

1.  **Batching and Variable Lengths:** To process sequences (sentences) efficiently, they are processed in batches. Since sentences have variable lengths (e.g., 2 words vs. 4 words), they must be standardized to the length of the longest sentence in the batch.
2.  **The Role of Padding:** This standardization requires adding **padding vectors** (which are typically composed of zeros) to the shorter sentences.
3.  **Zero Contamination:** When Self-Attention processes the batch, these padding vectors (zeros) also pass through the QKV matrix multiplication and result in output contextual embeddings that contain columns full of **zero values**.
4.  **Inaccurate Statistics:** Batch Normalization applies normalization **down the columns** across the entire batch. If a column corresponds to a word position that frequently requires padding (e.g., column 100 in a 100-word limit batch), that column will contain a large number of these zero values. Since these zeros are forced additions and **not part of the original data**, calculating the $\mu$ and $\sigma$ for that column results in statistics that are **not a true representation of the data**. This invalidates the normalization, leading to model instability.

## V. Layer Normalization (LN): The Solution

Layer Normalization provides the necessary stability by changing the direction of the statistical calculation.

### A. The Mechanism
Layer Normalization calculates the mean ($\mu$) and standard deviation ($\sigma$) **across the features** (horizontally, across the row).

1.  **Direction:** Normalization occurs **across the feature dimension** (or embedding dimension) rather than across the batch.
2.  **Calculation:** For a single word (represented by a row of embedding dimensions), $\mu$ and $\sigma$ are calculated using **only the numbers in that single row**.
3.  **Scaling and Shifting:** Every embedding dimension (feature) has its own $\gamma$ and $\beta$ parameters, which are applied after standardization.

### B. The Solution to Padding
Because LN calculates statistics **row by row**, the padding zeros do not distort the statistical representation of the actual data points.

*   When calculating $\mu$ and $\sigma$ for a valid word (a non-padding row), **no padding zeros from other sentences** are included in the calculation.
*   The padding zeros only affect the calculation of the row that *itself* is entirely padding, which is acceptable because the resulting output will remain zero.

This ability to exclude the misleading padding zeros from the statistics ensures that Layer Normalization provides stable and correct representations of the data, making it the superior and necessary choice for Transformer models processing sequential data.