**Lets be Normal**
#### Normalization 

![Normalization](media/norm.jpg)


**Understanding Layer Normalization Through a Soccer Story**

Imagine a soccer team composed of three groups of players: **Defenders**, **Midfielders**, and a **Striker**. Each group represents a layer in a neural network, and the ultimate goal of the team is to score a **goal** — but crucially, the goal isn’t just whether they score; it’s **how well-timed** the goal is.

**The Game Setup: Timing is Everything**

The team tries to learn how to score a goal **at the right moment**. The coach watches the timing carefully, measuring how many seconds the team misses the perfect timing by. For instance, if they score too late or too early, it counts as missing by some seconds.

---

*First Pass: The Timing Miss*

On the first attempt, the team scores a goal but misses the perfect timing by **2 seconds**. The coach gives feedback:

> “You need to play faster to get the timing right.”

All players — defenders, midfielders, and striker — agree to play faster.

---

*Second Pass: Overcompensation and Oscillation*

Trying to play faster, the **defenders** kick the ball too far ahead to the midfielders, making it harder for the midfielders to control the ball properly. The coordination between groups breaks down, and the team **fails to score properly**.

The team is oscillating — defenders try one extreme, midfielders react differently, and the striker is out of sync. The feedback loop causes instability and makes it **take a long time for the team to learn proper coordination**.

---

**What’s Happening Here in Neural Network Terms?**

* The players are like **neurons in different layers**.
* The timing feedback is like the **loss signal** measured after the output.
* The team’s attempt to “play faster” without considering the whole coordination is similar to how **changing activations without normalization** can cause instability.
* The oscillation and slow learning represent the challenge of **internal covariate shift** — layers have to constantly adapt to changing inputs from previous layers.

---

**How Layer Normalization Helps**

**Layer Normalization** acts like a **coach who normalizes each player’s performance individually before passing the ball**:

* It ensures each player (neuron) performs at a **standardized level** regardless of how others perform.
* This **stabilizes the inputs** going into each layer, so players are always "on the same page".
* The team can adjust their pace and coordination **more smoothly and reliably**.
* As a result, the team learns faster and scores goals at the right time with fewer oscillations.

---


Without LayerNorm, the team struggles with timing and coordination, leading to unstable and slow learning. With LayerNorm, each player adjusts their game consistently, allowing the team to synchronize, learn faster, and hit their target timing accurately.


#### Why we need td normalization in deep learning?

1. **Problem: Internal Covariate Shift**

- During training, the distribution of activations in intermediate layers keeps changing as model weights update.
- This forces each layer to constantly adapt to changing input distributions — called internal covariate shift.
- This slows down training and makes it harder for the model to converge.

2. **Vanishing / Exploding Gradients**

- Without normalization, activation values can grow too large or shrink too small.
- This leads to vanishing gradients (too small to learn) or exploding gradients (unstable updates).
- Normalization helps keep activations in a stable range, improving gradient flow.

3. **Faster and More Stable Training**

- Normalization techniques (BatchNorm, LayerNorm, RMSNorm, etc.) standardize activations, reducing variation.
- This helps networks train faster, be more stable, and often generalize better.



**Common Normalization Techniques**
| Technique | Normalization Dimension            | Key Idea                                          | Usage                                      |
| --------- | ---------------------------------- | ------------------------------------------------- | ------------------------------------------ |
| BatchNorm | Normalize over batch dimension     | Normalize mean/std over batch                     | CNNs, image models, during training mostly |
| LayerNorm | Normalize per sample, all features | Normalize over features per sample                | Transformers, RNNs, robust to batch size   |
| RMSNorm   | Normalize using RMS of features    | Normalize by root mean square (no mean centering) | Efficient, popular in transformers         |



####



#### RMSNorm — What is It and Why?

RMSNorm normalizes the input vector by its root mean square (RMS) value instead of mean and variance:

$$
\text{RMS}(x) = \sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2}
$$

Normalized output:

$$
\hat{x} = \frac{x}{\text{RMS}(x)} \odot g
$$

where $g \in \mathbb{R}^d$ is a learned scaling parameter (element-wise).

---

**Why RMSNorm?**

* **No mean subtraction** unlike LayerNorm — simpler, fewer computations.
* Normalizes magnitude, keeping the input vector length stable.
* Works well in transformer architectures, sometimes more stable than LayerNorm.

---


**Suppose input vector**:

$$
x = [3, 4]
$$

Dimension $d = 2$.

---

**Step 1: Compute RMS**

$$
\text{RMS}(x) = \sqrt{\frac{3^2 + 4^2}{2}} = \sqrt{\frac{9 + 16}{2}} = \sqrt{\frac{25}{2}} = \sqrt{12.5} \approx 3.535
$$

---

**Step 2: Normalize vector**

$$
\hat{x} = \frac{x}{\text{RMS}(x)} = \left[\frac{3}{3.535}, \frac{4}{3.535}\right] \approx [0.849, 1.131]
$$

---

**Step 3: Scale with learned parameter $g = [g_1, g_2]$**

Assume $g = [1, 1]$ (no scaling initially):

$$
y = \hat{x} \odot g = [0.849, 1.131]
$$

---

| Aspect         | Normalization Purpose              | RMSNorm Specifics                             |
| -------------- | ---------------------------------- | --------------------------------------------- |
| Problem solved | Stabilizes activations & gradients | Normalizes magnitude without mean subtraction |
| Computation    | Mean & variance or RMS             | Only RMS, simpler, efficient                  |
| Use case       | Deep learning, transformers        | Effective alternative to LayerNorm            |
| Benefits       | Faster, stable training            | Reduced computation, often better stability   |


In [1]:
import torch 
import torch.nn as nn 

torch.manual_seed(0)

<torch._C.Generator at 0x75605ffa5fb0>

- Lets compute RMS per sample (last dimension).
- Normalize by dividing input by RMS.
- Scale by learnable parameter $g$
- Small epsilon added for numerical stability.

In [3]:
class RMSNorm(nn.Module):
    def __init__(self, dim:int, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.g = nn.Parameter(torch.ones(dim))

    def forward(self, x:torch.Tensor)-> torch.Tensor:
        rms = torch.sqrt(torch.mean(x**2, dim=-1, keepdim=True) +self.eps)
        x_norm = x / rms
        return self.g * x_norm


In [4]:
dim = 3
inputs = torch.randn(2, 2, dim)
rms_norm = RMSNorm(dim)
outputs = rms_norm(inputs)
print(f"Shape:{outputs.shape}\n{outputs}")

Shape:torch.Size([2, 2, 3])
tensor([[[-0.2992, -1.1023,  1.3021],
         [-1.2794,  0.2985, -1.1287]],

        [[-1.0082,  1.3676, -0.3365],
         [ 1.3013,  0.5074,  1.0242]]], grad_fn=<MulBackward0>)
