# RMS Normalization Implementation

## Problem Statement

**Title**: Implement RMS Normalization for Large Language Models

**Description**: You are tasked with implementing **Root Mean Square (RMS) Normalization**, a normalization technique used in Large Language Models (LLMs) like LLaMA, to stabilize training by scaling activations based on their root mean square value. RMSNorm is a simpler alternative to LayerNorm, omitting mean subtraction and bias terms. Your implementation should be integrated into a simple transformer-like block to demonstrate its effect on stabilizing activations. Use PyTorch to define the RMSNorm class and a small model, and train it on a synthetic dataset to verify that normalization maintains stable output scales while learning a target transformation.

## Mathematical Definition

For an input vector x ∈ R^d (dimension d), RMSNorm computes:

```
RMS(x) = √((1/d) * ∑_{i=1}^d x_i^2 + ε)
```

```
x̂_i = (x_i / RMS(x)) * g_i
```

where:
- ε: Small constant (e.g., 10^-5) for numerical stability
- g_i: Learnable scale parameter (vector of size d)
- x̂_i: Normalized output

For a batch X ∈ R^{n×d} (n samples), compute RMS per sample:

```
RMS(X_j) = √((1/d) * ∑_{i=1}^d X_{j,i}^2 + ε)
```

```
X̂_{j,i} = (X_{j,i} / RMS(X_j)) * g_i
```

## Requirements

- Implement an `RMSNorm` class inheriting from `torch.nn.Module`
- Define `forward` to compute RMS normalization with learnable scale parameters
- Integrate RMSNorm into a simple model (e.g., RMSNorm followed by a linear layer)
- Use a synthetic dataset of random embeddings (100 samples, 64 dimensions)
- Train the model with Mean Squared Error (MSE) loss and Adam optimizer
- Evaluate the model's output stability (e.g., variance of normalized outputs) and loss convergence
- Provide detailed **Purpose** and **Theory** comments for each line of code

## Constraints

- Use only PyTorch for tensor operations and model definition (no scikit-learn or other ML libraries)
- Handle batch inputs (X ∈ R^{n×d})
- Ensure numerical stability with ε = 10^-5
- Ensure compatibility with PyTorch's autograd for training
- Use a learning rate of 0.01 and train for 1000 epochs

## Synthetic Dataset

- **Input**: X ∈ R^{100×64}, random embeddings drawn from a normal distribution N(0,1)
- **Target**: y = 0.5 · X + noise, where noise is N(0,0.1), shape [100,64]
- **Test Data**: 2 samples, shape [2,64], to verify model generalization

## Expected Output

- **Loss**: Decreases from ~0.1 to ~0.01 over 1000 epochs, indicating convergence
- **Normalized outputs**: Variance close to 1 (due to normalization), scaled by learnable parameters
- **Test predictions**: Approximate y = 0.5 · X_test, with stable magnitudes due to RMSNorm

## Implementation Guidelines

### RMSNorm Class Structure

```python
class RMSNorm(torch.nn.Module):
    def __init__(self, dim, eps=1e-5):
        """
        Purpose: Initialize RMS Normalization layer
        Theory: RMSNorm normalizes by RMS value instead of mean/variance
        """
        super().__init__()
        # Initialize learnable scale parameters
        # Initialize epsilon for numerical stability
        
    def forward(self, x):
        """
        Purpose: Apply RMS normalization to input tensor
        Theory: Normalize by RMS and apply learnable scaling
        """
        # Compute RMS per sample
        # Apply normalization and scaling
        # Return normalized output
```

### Model Architecture

```python
class SimpleRMSModel(torch.nn.Module):
    def __init__(self, input_dim, output_dim):
        """
        Purpose: Simple model with RMSNorm + Linear layer
        Theory: Demonstrates RMSNorm's stabilization effect
        """
        super().__init__()
        # RMSNorm layer
        # Linear transformation layer
        
    def forward(self, x):
        # Apply RMSNorm
        # Apply linear transformation
        # Return output
```

### Training Pipeline

1. **Data Generation**: Create synthetic dataset with target transformation
2. **Model Initialization**: Initialize RMSModel with appropriate dimensions
3. **Training Loop**: 
   - Forward pass through model
   - Compute MSE loss
   - Backpropagation and optimization
   - Track loss convergence
4. **Evaluation**: Assess normalization stability and test predictions

## Key Implementation Details

### Numerical Stability
- Add epsilon (ε = 1e-5) to prevent division by zero
- Use stable computation of RMS to avoid numerical issues

### Batch Processing
- Compute RMS per sample (across feature dimension)
- Maintain batch dimension throughout computation

### Learnable Parameters
- Initialize scale parameters (g) appropriately
- Ensure parameters are registered for gradient computation

### Gradient Flow
- Maintain differentiability for backpropagation
- Verify gradients flow through normalization layer

## Evaluation Metrics

- **MSE Loss**: Training loss convergence over epochs
- **Variance Analysis**: Variance of normalized outputs (should be ~1)
- **Test Predictions**: Model generalization on unseen data
- **Stability Check**: Output magnitude consistency

## Code Structure

```python
# 1. RMSNorm implementation
class RMSNorm(torch.nn.Module):
    # Implementation with detailed comments

# 2. Simple model with RMSNorm
class SimpleRMSModel(torch.nn.Module):
    # Model architecture

# 3. Data generation
def generate_synthetic_data():
    # Create training and test datasets

# 4. Training function
def train_model():
    # Training loop with loss tracking

# 5. Evaluation function
def evaluate_model():
    # Compute evaluation metrics

# 6. Main execution
if __name__ == "__main__":
    # Run complete pipeline
```

## Expected Results

### Training Convergence
- Initial loss: ~0.1
- Final loss: ~0.01
- Smooth convergence over 1000 epochs

### Normalization Effect
- Normalized output variance ≈ 1.0
- Stable activation magnitudes
- Consistent scaling across batches

### Test Performance
- Predictions approximate y = 0.5 · X_test
- Stable output magnitudes
- Good generalization to unseen data

## Comparison with LayerNorm

| Feature | RMSNorm | LayerNorm |
|---------|---------|-----------|
| Mean Subtraction | No | Yes |
| Bias Term | No | Yes |
| Computational Cost | Lower | Higher |
| Parameters | d (scale only) | 2d (scale + bias) |
| Stability | Good | Excellent |

## Usage Example

```python
# Create model
model = SimpleRMSModel(input_dim=64, output_dim=64)

# Generate data
X_train, y_train = generate_synthetic_data()

# Train model
train_model(model, X_train, y_train, epochs=1000)

# Evaluate
evaluate_model(model, X_test, y_test)
```

## Deliverables

1. Complete `RMSNorm` class implementation
2. `SimpleRMSModel` integration
3. Synthetic dataset generation
4. Training pipeline with loss tracking
5. Evaluation metrics and analysis
6. Detailed code comments explaining purpose and theory
7. Results validation and comparison

In [None]:



# Purpose: Import PyTorch for tensor operations and neural network functionality.
# Theory: PyTorch provides tensors with GPU support and autograd for automatic differentiation, essential for RMSNorm and training.

import torch.nn as nn
# Purpose: Import neural network modules to define RMSNorm and model classes.
# Theory: nn.Module enables custom layers like RMSNorm to integrate with PyTorch’s autograd and parameter management.

import torch.optim as optim
# Purpose: Import optimization algorithms like Adam for updating model parameters.
# Theory: Adam uses adaptive learning rates (momentum and squared gradients), effective for training with normalization layers.

# Set random seed for reproducibility
torch.manual_seed(42)
# Purpose: Fix the random seed to ensure consistent data generation and model initialization.
# Theory: Reproducibility aligns with previous TorchLeet problems (e.g., KL Divergence, DNN), using seed 42 for consistency.

# Generate synthetic data
n_samples, d_model = 100, 64
# Purpose: Define dataset size (100 samples) and embedding dimension (64).
# Theory: Simulates transformer input embeddings (e.g., hidden states in LLMs). 64 is a typical small hidden size for testing.

X = torch.randn(n_samples, d_model)
# Purpose: Generate random input embeddings, shape [100, 64].
# Theory: Normal distribution (mean 0, std 1) mimics activations in transformer layers, suitable for testing normalization.

y = X * 0.5 + torch.randn(n_samples, d_model) * 0.1
# Purpose: Generate target outputs as a scaled version of X with Gaussian noise, shape [100, 64].
# Theory: Simulates a regression task where the model learns y = 0.5 * X + noise. Noise (std 0.1) adds realism.

# Define RMSNorm
class RMSNorm(nn.Module):
    # Purpose: Define RMSNorm layer for normalizing activations in LLMs.
    # Theory: Normalizes inputs by their root mean square, scaling with learnable parameters, to stabilize training.
    
    def __init__(self, d_model, eps=1e-5):
        # Purpose: Initialize RMSNorm with embedding dimension and epsilon.
        # Theory: d_model is the feature dimension (e.g., 64); eps prevents division by zero in RMS computation.
        
        super(RMSNorm, self).__init__()
        # Purpose: Call parent nn.Module constructor to set up the module.
        # Theory: Registers parameters and enables autograd integration for training.
        
        self.eps = eps
        # Purpose: Store epsilon as an instance variable for numerical stability.
        # Theory: Small constant (1e-5) ensures the RMS denominator is non-zero, preventing division errors.
        
        self.scale = nn.Parameter(torch.ones(d_model))
        # Purpose: Initialize learnable scale parameters as a vector of ones, shape [d_model].
        # Theory: Scales normalized outputs, allowing the model to learn optimal magnitudes. Initialized to 1 for stable starting point.
    
    def forward(self, x):
        # Purpose: Compute RMSNorm for input tensor x.
        # Theory: Normalizes x by dividing by RMS(x) = sqrt(mean(x^2) + ε), then scales with learnable scale. x shape: [batch_size, d_model].
        
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        # Purpose: Compute RMS for each sample in the batch.
        # Theory: mean(x^2, dim=-1) averages squared elements along the feature dimension; sqrt adds ε for stability. Shape: [batch_size, 1].
        
        x_norm = x / rms
        # Purpose: Normalize input by dividing by RMS.
        # Theory: Scales each element to have unit RMS, reducing covariate shift. Output shape: [batch_size, d_model].
        
        return x_norm * self.scale
        # Purpose: Apply learnable scale to normalized output.
        # Theory: Element-wise multiplication with scale parameter allows adaptive magnitudes. Output shape: [batch_size, d_model].

# Define a simple model with RMSNorm
class SimpleTransformerBlock(nn.Module):
    # Purpose: Define a transformer-like block with RMSNorm and a linear layer.
    # Theory: Mimics a transformer’s feed-forward block, testing RMSNorm’s stabilization in a realistic setting.
    
    def __init__(self, d_model):
        super(SimpleTransformerBlock, self).__init__()
        # Purpose: Initialize parent nn.Module class.
        # Theory: Ensures proper parameter registration for autograd and model training.
        
        self.norm = RMSNorm(d_model)
        # Purpose: Initialize RMSNorm layer for input normalization.
        # Theory: Normalizes inputs before linear transformation, as in LLMs like LLaMA.
        
        self.linear = nn.Linear(d_model, d_model)
        # Purpose: Initialize a linear layer to transform normalized inputs.
        # Theory: Applies z = Wx + b, where W is [d_model, d_model], b is [d_model], simulating a feed-forward layer.
    
    def forward(self, x):
        # Purpose: Define forward pass through normalization and linear layers.
        # Theory: Normalizes inputs to stabilize training, then applies a linear transformation.
        
        x = self.norm(x)
        # Purpose: Apply RMSNorm to input tensor.
        # Theory: Stabilizes activations, ensuring consistent scales before further processing.
        
        return self.linear(x)
        # Purpose: Apply linear transformation to normalized input.
        # Theory: Outputs transformed embeddings, shape [batch_size, d_model], for downstream tasks.

# Initialize the model, loss function, and optimizer
model = SimpleTransformerBlock(d_model)
# Purpose: Create an instance of the transformer block.
# Theory: Initializes weights (Xavier initialization) and scale parameters (ones), tracked by autograd.

criterion = nn.MSELoss()
# Purpose: Define Mean Squared Error loss for regression.
# Theory: MSE = (1/n) * sum((y_pred - y_true)^2), suitable for regression, consistent with DNN regression (Day 4).

optimizer = optim.Adam(model.parameters(), lr=0.01)
# Purpose: Initialize Adam optimizer with learning rate 0.01.
# Theory: Adam adapts learning rates using momentum (β1=0.9, β2=0.999), effective for models with normalization.

# Training loop
epochs = 1000
# Purpose: Set the number of training iterations to 1000 epochs.
# Theory: Sufficient epochs ensure convergence, aligning with previous TorchLeet problems (e.g., KL Divergence).

for epoch in range(epochs):
    # Purpose: Iterate over the dataset for training.
    # Theory: Each epoch updates parameters to minimize loss, testing RMSNorm’s effect on stability.
    
    # Forward pass
    predictions = model(X)
    # Purpose: Compute model predictions by passing input through the model.
    # Theory: X [100, 64] produces predictions [100, 64] via RMSNorm and linear layer.
    
    loss = criterion(predictions, y)
    # Purpose: Calculate MSE loss between predictions and targets.
    # Theory: Computes scalar loss for optimization, measuring prediction accuracy.
    
    # Backward pass and optimization
    optimizer.zero_grad()
    # Purpose: Reset gradients of all parameters to zero.
    # Theory: Prevents gradient accumulation from previous iterations, ensuring correct updates.
    
    loss.backward()
    # Purpose: Compute gradients of the loss with respect to model parameters.
    # Theory: Autograd backpropagates through linear layer, RMSNorm, and scale parameters.
    
    optimizer.step()
    # Purpose: Update model parameters using computed gradients.
    # Theory: Adam applies adaptive updates to minimize loss, leveraging momentum.
    
    # Log progress every 100 epochs
    if (epoch + 1) % 100 == 0:
        # Purpose: Print training progress to monitor convergence.
        # Theory: Loss monitoring helps detect issues like instability or poor learning rates.
        
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")
        # Purpose: Display epoch number and loss value.
        # Theory: loss.item() extracts the scalar loss for readable output.

# Evaluate normalized output variance
with torch.no_grad():
    # Purpose: Disable gradient tracking for evaluation.
    # Theory: Saves memory and computation during inference.
    
    normalized_output = model.norm(X)
    # Purpose: Compute normalized outputs before the linear layer.
    # Theory: Tests RMSNorm’s effect on stabilizing output variance.
    
    variance = torch.var(normalized_output, dim=-1).mean().item()
    # Purpose: Calculate the mean variance of normalized outputs.
    # Theory: Variance close to 1 indicates proper normalization (before scaling).
    
    print(f"Normalized Output Variance (before scaling): {variance:.4f}")
    # Purpose: Print the variance to verify normalization.
    # Theory: RMSNorm should produce outputs with unit RMS, adjusted by scale parameters.

# Testing on new data
X_test = torch.randn(2, d_model)
# Purpose: Generate test inputs, shape [2, 64].
# Theory: Tests model generalization on new random embeddings.

with torch.no_grad():
    # Purpose: Disable gradient tracking for test inference.
    # Theory: Ensures efficient evaluation without gradient computation.
    
    predictions = model(X_test)
    # Purpose: Compute predictions for test inputs.
    # Theory: Outputs [2, 64], approximating y = 0.5 * X_test + noise, with stable scales.
    
    print(f"Test Predictions (first 5 dims): {predictions[:, :5].tolist()}")
    # Purpose: Print first 5 dimensions of test predictions for readability.
    # Theory: Shows model output, expected to align with target transformation.