# Implement Sinusoidal Positional Embeddings from Scratch

Description: Implement Sinusoidal Positional Embeddings as described in Vaswani et al. (2017) to provide Transformers with sequence order information, since attention mechanisms are inherently order-agnostic. The class SinusoidalPositionalEmbedding(nn.Module) generates deterministic positional encodings using sine and cosine functions with varying frequencies, stored in a non-trainable buffer tensor of shape (max_seq_len, d_model). The forward method returns encodings for an input tensor’s sequence length, broadcastable over the batch dimension. The implementation must use only PyTorch operations, avoid external libraries like Hugging Face or fairseq, and support sequences up to max_seq_len. Tests will verify shape correctness and numerical properties, ensuring compatibility with Transformer inputs.
Mathematical Definition:

Inputs:

max_seq_len: Maximum sequence length (e.g., 100).
d_model: Embedding dimension (e.g., 64).
Input tensor $ x \in \mathbb{R}^{N \times L \times d_{\text{model}}} $, where $ N $ is batch size, $ L \leq \text{max\_seq\_len} $.


Positional Encoding:

For position $ pos \in \{0, 1, \ldots, \text{max\_seq\_len}-1\} $ and dimension $ i \in \{0, 1, \ldots, d_{\text{model}}-1\} $:
$$PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)$$
$$PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)$$

Frequency term: $ \omega_i = 10000^{-2i / d_{\text{model}}} = e^{-(2i / d_{\text{model}}) \cdot \ln(10000)} $.
Resulting tensor: $ PE \in \mathbb{R}^{\text{max\_seq\_len} \times d_{\text{model}}} $.


Forward Pass:

Given input $ x \in \mathbb{R}^{N \times L \times d_{\text{model}}} $, return $ PE[:L, :] \in \mathbb{R}^{1 \times L \times d_{\text{model}}} $, broadcastable over batch dimension.


Output:

Positional encodings of shape $ (1, L, d_{\text{model}}) $, added to token embeddings in a Transformer.



Requirements:

Implement SinusoidalPositionalEmbedding(nn.Module):

Initialize with max_seq_len and d_model.
Compute $ PE $ tensor using sine/cosine functions.
Register $ PE $ as a non-trainable buffer via self.register_buffer.
In forward(x), return encodings for the input’s sequence length.


Use PyTorch operations (torch.sin, torch.cos, torch.arange, torch.exp).
Test with max_seq_len=100, d_model=64, and sequence length 50:

Verify output shape: $ (1, 50, 64) $.
Check numerical properties (e.g., sine/cosine alternation, bounded values).


Provide detailed Purpose and Theory comments.
Avoid integrating with token embeddings (focus on positional encodings only).

Constraints:

Use only PyTorch operations (no Hugging Face, fairseq, or built-in positional encoding modules).
Ensure $ PE $ is not a trainable parameter.
Support sequence lengths $ L \leq \text{max\_seq\_len} $.
Output shape: $ (1, L, d_{\text{model}}) $.
Ensure numerical stability (values in $[-1, 1]$).

Synthetic Dataset:

Inputs:

Input tensor: Random tensor of shape $(3, 4, 8)$, generated with torch.rand and seed 42 (provided code).
Test case: max_seq_len=100, d_model=64, sequence length 50.


Test Cases:

Shape test: Output shape $(1, 50, 64)$.
Numerical test: Verify even indices use sine, odd use cosine.
Boundary test: Sequence length 1 and max_seq_len.

In [1]:
import torch
# Purpose: Import PyTorch for tensor operations.
# Theory: Provides tensor computations and autograd support.

import torch.nn as nn
# Purpose: Import neural network module for nn.Module.
# Theory: Base class for positional embedding module.

import math
# Purpose: Import math for logarithmic computations.
# Theory: Used to compute frequency terms.

# Set random seed for reproducibility
torch.manual_seed(42)
# Purpose: Fix random seed for consistent testing.
# Theory: Ensures reproducible results across runs.

class SinusoidalPositionalEmbedding(nn.Module):
    def __init__(self, max_seq_len: int, d_model: int):
        """
        Initializes the sinusoidal positional embedding.
        
        Args:
            max_seq_len (int): Maximum sequence length.
            d_model (int): Embedding dimension.
        """
        # Purpose: Initialize the positional embedding module.
        # Theory: Sets up fixed sinusoidal encodings for positions.
        
        super().__init__()
        # Purpose: Initialize parent nn.Module class.
        # Theory: Enables module functionality (e.g., buffer registration).
        
        assert d_model % 2 == 0, "d_model must be even for sin/cos pairs"
        # Purpose: Ensure d_model supports even/odd sin/cos splitting.
        # Theory: Each pair of dimensions uses sin and cos.
        
        # Create positional encoding tensor
        pe = torch.zeros(max_seq_len, d_model)
        # Purpose: Initialize PE tensor of shape (max_seq_len, d_model).
        # Theory: Stores encodings for all positions up to max_seq_len.
        
        # Compute position indices: [0, 1, ..., max_seq_len-1]
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        # Purpose: Create position indices tensor of shape (max_seq_len, 1).
        # Theory: Represents sequence positions for encoding.
        
        # Compute frequency terms: exp(-2i/d_model * ln(10000))
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        # Purpose: Compute frequency terms of shape (d_model/2,).
        # Theory: Defines geometric frequency progression for encodings.
        
        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        # Purpose: Fill PE tensor with sin/cos values.
        # Theory: Even dims use sin(pos/10000^(2i/d_model)), odd use cos.
        
        # Register PE as a buffer
        self.register_buffer("pe", pe)
        # Purpose: Store PE as a non-trainable buffer.
        # Theory: Ensures PE is saved with model but not optimized.
    
    def forward(self, x):
        """
        Returns the positional embedding for the input tensor's sequence length.
        
        Args:
            x (Tensor): Input tensor of shape (batch_size, seq_len, d_model).
        
        Returns:
            Tensor: Positional embeddings of shape (1, seq_len, d_model).
        """
        # Purpose: Return positional encodings for input sequence length.
        # Theory: Slices PE tensor to match input seq_len, broadcasts over batch.
        
        return self.pe[:x.shape[1], :].unsqueeze(0)
        # Purpose: Slice PE and add batch dimension.
        # Theory: Shape (1, seq_len, d_model), broadcastable to (N, seq_len, d_model).

# Test implementation
if __name__ == "__main__":
    # Purpose: Test SinusoidalPositionalEmbedding for correctness.
    # Theory: Verifies shape, numerical properties, and buffer registration.
    
    # Test parameters
    max_seq_len, d_model = 100, 64
    seq_len = 50
    batch_size = 3
    # Purpose: Define test dimensions.
    # Theory: Matches problem requirements, tests realistic values.
    
    # Initialize module
    pos_emb = SinusoidalPositionalEmbedding(max_seq_len, d_model)
    # Purpose: Create positional embedding instance.
    # Theory: Initializes PE tensor with sin/cos encodings.
    
    # Test shape
    x = torch.rand(batch_size, seq_len, d_model)
    pe = pos_emb(x)
    # Purpose: Generate positional encodings for input tensor.
    # Theory: Tests forward pass with seq_len=50.
    
    print("Input shape:", x.shape)
    print("Output shape:", pe.shape)
    assert pe.shape == (1, seq_len, d_model), f"Expected shape (1, {seq_len}, {d_model}), got {pe.shape}"
    # Purpose: Verify output shape.
    # Theory: Ensures shape (1, seq_len, d_model).
    
    print("Shape test passed!")
    # Purpose: Confirm shape test success.
    # Theory: Validates correct tensor slicing.
    
    # Test numerical correctness
    assert torch.allclose(pe[0, 0, 0], torch.sin(torch.tensor(0.0)), atol=1e-6), "PE[0,0] should be sin(0)"
    assert torch.allclose(pe[0, 0, 1], torch.cos(torch.tensor(0.0)), atol=1e-6), "PE[0,1] should be cos(0)"
    assert pe.abs().max() <= 1.0, "PE values should be in [-1, 1]"
    # Purpose: Verify sin/cos alternation and boundedness.
    # Theory: Ensures even/odd dims use sin/cos, values are stable.
    
    print("Numerical test passed!")
    # Purpose: Confirm numerical test success.
    # Theory: Validates correct encoding computation.
    
    # Test buffer registration
    assert "pe" in pos_emb._buffers, "PE should be registered as a buffer"
    assert not any(p.requires_grad for p in pos_emb.parameters()), "No trainable parameters expected"
    # Purpose: Verify PE is a buffer, not a parameter.
    # Theory: Ensures non-trainable property.
    
    print("Buffer registration test passed!")
    # Purpose: Confirm buffer test success.
    # Theory: Validates correct module setup.
    
    # Test boundary cases
    x_short = torch.rand(batch_size, 1, d_model)
    x_max = torch.rand(batch_size, max_seq_len, d_model)
    pe_short = pos_emb(x_short)
    pe_max = pos_emb(x_max)
    # Purpose: Test sequence lengths 1 and max_seq_len.
    # Theory: Ensures robustness across sequence lengths.
    
    assert pe_short.shape == (1, 1, d_model), "Short sequence shape incorrect"
    assert pe_max.shape == (1, max_seq_len, d_model), "Max sequence shape incorrect"
    # Purpose: Verify boundary case shapes.
    # Theory: Confirms slicing works for edge cases.
    
    print("Boundary test passed!")
    # Purpose: Confirm boundary test success.
    # Theory: Validates support for all valid sequence lengths.
    
    # Test integration with token embeddings (for visualization)
    token_emb = torch.rand(batch_size, seq_len, d_model)
    input_emb = token_emb + pos_emb(token_emb)
    # Purpose: Simulate Transformer input by adding PE to token embeddings.
    # Theory: Demonstrates typical usage, though not required.
    
    print("Input embedding shape:", input_emb.shape)
    assert input_emb.shape == (batch_size, seq_len, d_model), "Input embedding shape incorrect"
    # Purpose: Verify combined embedding shape.
    # Theory: Ensures PE broadcasts correctly over batch.
    
    print("Integration test passed!")
    # Purpose: Confirm integration test success.
    # Theory: Validates PE’s role in Transformer pipeline.

Input shape: torch.Size([3, 50, 64])
Output shape: torch.Size([1, 50, 64])
Shape test passed!
Numerical test passed!
Buffer registration test passed!
Boundary test passed!
Input embedding shape: torch.Size([3, 50, 64])
Integration test passed!
