# Notebook 05: Building a Transformer Encoder Block

## Assembling the Complete Architecture

Now we'll combine everything we've learned to build a complete transformer encoder! In this notebook:

1. **Multi-Head Self-Attention** - The core mechanism
2. **Feed-Forward Networks** - Position-wise transformations
3. **Layer Normalization** - Stabilizing training
4. **Residual Connections** - Enabling deep networks
5. **Complete Encoder Block** - Putting it all together

This is the architecture that powers GPT, BERT, and modern LLMs!

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Part 1: Multi-Head Self-Attention (Review)

Let's implement a clean, production-ready version:

In [None]:
class MultiHeadAttention(nn.Module):
    """Multi-head self-attention mechanism."""
    
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        # Linear layers for Q, K, V, and output
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor,
                mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        batch_size = query.size(0)
        
        # Linear projections in batch
        Q = self.W_Q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_K(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_V(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        attn_output = torch.matmul(attn_weights, V)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(batch_size, -1, self.d_model)
        
        # Final linear layer
        output = self.W_O(attn_output)
        
        return output

# Test
mha = MultiHeadAttention(d_model=512, n_heads=8).to(device)
x = torch.randn(2, 10, 512, device=device)
output = mha(x, x, x)
print(f"âœ… Multi-Head Attention: {x.shape} â†’ {output.shape}")

## Part 2: Position-wise Feed-Forward Network

### The FFN Architecture

After attention, each position passes through an identical feed-forward network:

$$\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$

Or with GELU activation (modern variant):

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

In [None]:
class PositionwiseFeedForward(nn.Module):
    """Position-wise feed-forward network."""
    
    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, seq_len, d_model)
        x = self.linear1(x)         # (batch, seq_len, d_ff)
        x = F.gelu(x)               # GELU activation
        x = self.dropout(x)
        x = self.linear2(x)         # (batch, seq_len, d_model)
        return x

# Test
ffn = PositionwiseFeedForward(d_model=512, d_ff=2048).to(device)
x = torch.randn(2, 10, 512, device=device)
output = ffn(x)
print(f"âœ… Feed-Forward Network: {x.shape} â†’ {output.shape}")
print(f"Parameters: {sum(p.numel() for p in ffn.parameters()):,}")

## Part 3: Layer Normalization

### Why Normalization?

Layer norm stabilizes training by normalizing across features:

$$\text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Where:
- $\mu$ = mean across features
- $\sigma^2$ = variance across features
- $\gamma, \beta$ = learnable scale and shift

In [None]:
# PyTorch provides LayerNorm, but let's understand it:

def manual_layer_norm(x: torch.Tensor, eps: float = 1e-5) -> torch.Tensor:
    """Manual layer normalization for understanding."""
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)
    return (x - mean) / torch.sqrt(var + eps)

# Compare manual vs PyTorch
x = torch.randn(2, 10, 512, device=device)
manual = manual_layer_norm(x)
pytorch_ln = nn.LayerNorm(512, device=device)
pytorch = pytorch_ln(x)

print(f"Manual LayerNorm mean: {manual.mean(dim=-1)[0, 0]:.6f} (should be ~0)")
print(f"Manual LayerNorm std: {manual.std(dim=-1)[0, 0]:.6f} (should be ~1)")
print(f"\nâœ… Layer Normalization understood!")

## Part 4: Residual Connections

### The Power of Skip Connections

Residual connections enable deep networks:

$$\text{output} = \text{Sublayer}(x) + x$$

Benefits:
- **Gradient flow:** Direct path for gradients
- **Identity mapping:** Easy to learn identity if needed
- **Stable training:** Prevents degradation in deep networks

In [None]:
class SublayerConnection(nn.Module):
    """Residual connection with layer normalization."""
    
    def __init__(self, d_model: int, dropout: float = 0.1):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor, sublayer: nn.Module) -> torch.Tensor:
        """Apply residual connection to any sublayer with the same size."""
        # Pre-norm: LayerNorm â†’ Sublayer â†’ Dropout â†’ Residual
        return x + self.dropout(sublayer(self.norm(x)))

# Test
sublayer_conn = SublayerConnection(512).to(device)
ffn = PositionwiseFeedForward(512, 2048).to(device)
x = torch.randn(2, 10, 512, device=device)
output = sublayer_conn(x, ffn)
print(f"âœ… Residual Connection: {x.shape} â†’ {output.shape}")

## Part 5: Complete Transformer Encoder Block

### Architecture

```
Input
  â†“
LayerNorm â†’ Multi-Head Attention â†’ Dropout â†’ (+) Residual
  â†“
LayerNorm â†’ Feed-Forward â†’ Dropout â†’ (+) Residual
  â†“
Output
```

In [None]:
class TransformerEncoderBlock(nn.Module):
    """Complete transformer encoder block."""
    
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, n_heads, dropout)
        
        # Feed-forward network
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
        
        # Layer normalization for both sub-layers
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Args:
            x: Input tensor (batch, seq_len, d_model)
            mask: Optional attention mask
        
        Returns:
            Output tensor (batch, seq_len, d_model)
        """
        # Sub-layer 1: Multi-head attention
        attn_output = self.attention(x, x, x, mask)
        x = x + self.dropout(attn_output)
        x = self.norm1(x)
        
        # Sub-layer 2: Feed-forward
        ff_output = self.feed_forward(x)
        x = x + self.dropout(ff_output)
        x = self.norm2(x)
        
        return x

# Test single encoder block
encoder_block = TransformerEncoderBlock(
    d_model=512,
    n_heads=8,
    d_ff=2048,
    dropout=0.1
).to(device)

x = torch.randn(4, 20, 512, device=device)
output = encoder_block(x)

print(f"âœ… Transformer Encoder Block")
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Parameters: {sum(p.numel() for p in encoder_block.parameters()):,}")

## Part 6: Stacking Multiple Encoder Blocks

Transformers stack multiple encoder blocks (typically 6-12 for base models, up to 96 for large models):

In [None]:
class TransformerEncoder(nn.Module):
    """Stack of N encoder blocks."""
    
    def __init__(self, n_layers: int, d_model: int, n_heads: int, 
                 d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        self.layers = nn.ModuleList([
            TransformerEncoderBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        
    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """Pass through all encoder layers."""
        for layer in self.layers:
            x = layer(x, mask)
        
        return self.norm(x)

# Create a 6-layer transformer encoder (like BERT-base)
encoder = TransformerEncoder(
    n_layers=6,
    d_model=512,
    n_heads=8,
    d_ff=2048,
    dropout=0.1
).to(device)

x = torch.randn(2, 50, 512, device=device)
output = encoder(x)

print(f"âœ… 6-Layer Transformer Encoder")
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Total parameters: {sum(p.numel() for p in encoder.parameters()):,}")
print(f"Parameter breakdown:")
for i, layer in enumerate(encoder.layers):
    params = sum(p.numel() for p in layer.parameters())
    print(f"  Layer {i+1}: {params:,} parameters")

## Part 7: Positional Encoding

### Why Positional Encoding?

Attention is **permutation invariant** - it doesn't know the order of tokens!

Solution: Add position information to embeddings:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

In [None]:
class PositionalEncoding(nn.Module):
    """Sinusoidal positional encoding."""
    
    def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension
        pe = pe.unsqueeze(0)
        
        # Register as buffer (not a parameter, but part of state)
        self.register_buffer('pe', pe)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Add positional encoding to input."""
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

# Visualize positional encodings
pos_enc = PositionalEncoding(d_model=128, max_len=100)
pe_matrix = pos_enc.pe[0].cpu().numpy()

plt.figure(figsize=(14, 6))
plt.imshow(pe_matrix.T, cmap='RdBu', aspect='auto')
plt.colorbar(label='Value')
plt.xlabel('Position', fontsize=12)
plt.ylabel('Dimension', fontsize=12)
plt.title('Positional Encoding Matrix', fontsize=14, fontweight='bold')
plt.show()

print("\nðŸ’¡ Key Properties:")
print("   - Unique encoding for each position")
print("   - Smooth variation across positions")
print("   - Different frequencies capture different scales")

## Part 8: Complete Transformer with Embeddings

Putting it all together:

In [None]:
class Transformer(nn.Module):
    """Complete transformer model."""
    
    def __init__(self, vocab_size: int, d_model: int, n_layers: int, 
                 n_heads: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        
        # Token embedding
        self.embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_len, dropout)
        
        # Transformer encoder
        self.encoder = TransformerEncoder(n_layers, d_model, n_heads, d_ff, dropout)
        
        # Scale embeddings
        self.scale = math.sqrt(d_model)
        
    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Args:
            x: Input token indices (batch, seq_len)
            mask: Optional attention mask
        
        Returns:
            Encoded representations (batch, seq_len, d_model)
        """
        # Embed and scale
        x = self.embedding(x) * self.scale
        
        # Add positional encoding
        x = self.pos_encoding(x)
        
        # Pass through encoder
        x = self.encoder(x, mask)
        
        return x

# Create a small transformer
model = Transformer(
    vocab_size=10000,
    d_model=512,
    n_layers=6,
    n_heads=8,
    d_ff=2048,
    dropout=0.1
).to(device)

# Test
input_tokens = torch.randint(0, 10000, (4, 30), device=device)
output = model(input_tokens)

print(f"âœ… Complete Transformer Model")
print(f"Input tokens shape: {input_tokens.shape}")
print(f"Output embeddings shape: {output.shape}")
print(f"\nModel Summary:")
print(f"  Vocabulary size: 10,000")
print(f"  Model dimension: 512")
print(f"  Number of layers: 6")
print(f"  Number of heads: 8")
print(f"  Feed-forward size: 2048")
print(f"  Total parameters: {sum(p.numel() for p in model.parameters()):,}")

## Exercise Section

### Exercise 1: Analyze Parameter Distribution
Calculate what percentage of parameters are in:
- Embeddings
- Attention layers
- Feed-forward layers
- Layer norms

In [None]:
# TODO: Analyze parameter distribution

### Exercise 2: Implement Decoder Block
Create a transformer decoder block with:
- Masked self-attention
- Cross-attention to encoder
- Feed-forward network

In [None]:
# TODO: Implement decoder block

### Exercise 3: Gradient Flow Analysis
Measure gradient magnitudes with and without:
- Residual connections
- Layer normalization

Show how they stabilize training.

In [None]:
# TODO: Analyze gradient flow

## Summary

### Key Takeaways

âœ… **Transformer Architecture:**
- Multi-head attention captures relationships
- Feed-forward adds non-linearity
- Layer norm and residuals stabilize training

âœ… **Components:**
- **Embeddings:** Convert tokens to vectors
- **Positional Encoding:** Add position information
- **Encoder Blocks:** Process sequences
- **Layer Norm:** Normalize activations

âœ… **Design Choices:**
- Pre-norm vs post-norm (we used pre-norm)
- GELU activation (smoother than ReLU)
- Dropout for regularization

âœ… **Scaling:**
- Stack layers for depth
- Increase d_model for capacity
- More heads for diverse attention

### Next Steps

In **Notebook 06**, we'll:
- Train a transformer on real data
- Character-level language modeling
- Text generation
- Monitor training dynamics

## Further reading (Archive.org)

For a higher-level view of Transformer encoder blocks, search Archive.org for:

- "transformer architecture deep learning"
- "self-attention networks"
- "sequence modeling transformers"

Pair these with up-to-date Transformer surveys to see how multi-head self-attention, feed-forward layers, residual connections, and layer normalization fit together in full encoder/decoder stacks, complementing the minimal encoder block implemented in this notebook.