Normalization is a fundamental technique used in deep learning to stabilize and accelerate training.  

### 🔍 Why Normalize?

- Prevents **internal covariate shift** by keeping activations within a stable range.
- Helps gradients flow more reliably through the network.
- Leads to **faster convergence**, **better generalization**, and **less sensitivity** to hyperparameters.

### 🧠 Two Common Types

| Type         | Normalizes Across              | Commonly Used In        |
|--------------|--------------------------------|--------------------------|
| BatchNorm    | Across the batch               | CNNs, vision models      |
| LayerNorm    | Across features (per sample)   | Transformers, NLP        |

We'll now implement both `BatchNorm1d` and `LayerNorm` to understand how they work.

*Note :* More details about normalization can be found in the corresponding PDF file, where the topic is covered in **greater depth**.


# 1. Batch Normalization

In [None]:
import torch

def manual_batch_norm(x, gamma, beta, eps=1e-5):
  """
  x     : input tensor of shape (batch_size, num_features)
  gamma : learnable scale of shape (num_features)
  beta  : learnable shift of shape (num_features)
  eps   : constant to avoid division by zero -> numerical stability

  retunrns normalized tensor of shape (batch_size, num_features)"""

  # Step 1 : Compute Mean and Variance per feature (across the batch -> axis = 0)
  mean = x.mean(dim=0, keepdim=True)
  var = x.var(dim=0, unbiased=False)

  # Step 2 : Normalize
  x_normalized = (x - mean) / torch.sqrt(var + eps)

  # Step 3 : Scale & Shift
  out = gamma * x_normalized + beta

  return out

In [None]:
# Test
x = torch.tensor(
    [[1, 2],
    [3, 4],
    [5, 6]], dtype=float
)

gamma = torch.tensor([1, -2], dtype=float) # scale
beta = torch.tensor([0, 0], dtype=float) # shift

out = manual_batch_norm(x, gamma, beta)

print(out)

tensor([[-1.2247,  2.4495],
        [ 0.0000,  0.0000],
        [ 1.2247, -2.4495]], dtype=torch.float64)


# 2. Layer Normalization

In [None]:
def manual_layer_normalization(x, gamma, beta, eps=1e-5):
  """
  x       : input tensor of shape (batch_size, num_features)
  gamma   : learnable scale of shape (num_features)
  beta    : learnable shift of shape (num_features)
  eps     : constant to avoid division by zero -> numerical stability

  return normalized tensor of shape (batch_size, num_features)"""

  # Step 1 : Calculate mean and variance per sample (across the features of the same sample)
  mean = x.mean(dim=1, keepdim=True)
  variance = x.var(dim=1, keepdim=True)

  # Step 2 : Normalize
  x_hat = (x - mean) / torch.sqrt(variance + eps)

  # Step 3 : Scale & Shift
  out = gamma * x_hat + beta

  return out

In [None]:
# Test
x = torch.tensor(
    [[1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]], dtype=float
)

gamma = torch.ones(3)
beta = torch.zeros(3)

out = manual_layer_normalization(x, gamma, beta)

print(out)

tensor([[-1.0000,  0.0000,  1.0000],
        [-1.0000,  0.0000,  1.0000],
        [-1.0000,  0.0000,  1.0000],
        [-1.0000,  0.0000,  1.0000]], dtype=torch.float64)
