# 1.1 - Deep Dive: The Transformer Architecture

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/madeforai/madeforai/blob/main/docs/paths/developer/01-transformer-architecture.ipynb)

---

**💻 The Engineer Path | Module 1 of 7**

The Transformer architecture, introduced in the seminal paper "Attention Is All You Need", revolutionized natural language processing by dispensing with recurrence and convolutions entirely. Instead, it relies solely on attention mechanisms to draw global dependencies between input and output.

## 📚 What You'll Learn

- Self-attention mechanism explained
- Multi-head attention implementation
- Positional encoding strategies
- Complete Transformer block in PyTorch

> **Prerequisites**: Basic understanding of vector embeddings and matrix multiplication. We'll use PyTorch for implementation.

## 🔧 Setup

First, let's install and import the necessary libraries.

In [None]:
# Install required packages (uncomment if running in Colab)
# !pip install torch numpy matplotlib -q

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import math

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"PyTorch version: {torch.__version__}")
print(f"Using device: {device}")
print("✅ Setup complete!")

---

## 1. The Self-Attention Mechanism

At the core of the Transformer is the **self-attention** mechanism. It allows the model to weigh the importance of different words in a sentence when encoding a specific word.

### The Attention Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- **Q** (Query): What we're looking for
- **K** (Key): What we match against
- **V** (Value): What we retrieve
- **d_k**: Dimension of keys (for scaling)

In [None]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Queries of shape (batch, seq_len, d_k)
        K: Keys of shape (batch, seq_len, d_k)
        V: Values of shape (batch, seq_len, d_v)
        mask: Optional mask for future tokens
        
    Returns:
        Output and attention weights
    """
    d_k = Q.size(-1)
    
    # Compute attention scores: QK^T / sqrt(d_k)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    # Apply mask if provided (for decoder)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    
    # Multiply by values
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

In [None]:
# Example: Self-attention on a simple sequence
batch_size = 1
seq_len = 4
d_model = 8

# Create sample input (random vectors for 4 tokens)
torch.manual_seed(42)
x = torch.randn(batch_size, seq_len, d_model)

# For self-attention, Q = K = V = input (or linear projections)
Q = K = V = x

# Compute attention
output, attn_weights = scaled_dot_product_attention(Q, K, V)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nAttention weights (how much each token attends to others):")
print(attn_weights.squeeze().numpy().round(3))

In [None]:
# Visualize attention weights
plt.figure(figsize=(8, 6))
plt.imshow(attn_weights.squeeze().numpy(), cmap='Blues', aspect='auto')
plt.colorbar(label='Attention Weight')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.title('Self-Attention Weights')
plt.xticks(range(seq_len), [f'Token {i}' for i in range(seq_len)])
plt.yticks(range(seq_len), [f'Token {i}' for i in range(seq_len)])
plt.tight_layout()
plt.show()

---

## 2. Multi-Head Attention

Multi-head attention runs the attention mechanism **multiple times in parallel** with different learned projections. This allows the model to attend to information from different representation subspaces.

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

In [None]:
class MultiHeadAttention(nn.Module):
    """Multi-Head Attention mechanism."""
    
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        
        # Output projection
        self.W_o = nn.Linear(d_model, d_model, bias=False)
        
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        # Linear projections
        Q = self.W_q(Q)
        K = self.W_k(K)
        V = self.W_v(V)
        
        # Reshape to (batch, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Compute attention
        attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )
        
        # Final linear projection
        output = self.W_o(attn_output)
        
        return output, attn_weights

In [None]:
# Test Multi-Head Attention
d_model = 512
num_heads = 8
seq_len = 10
batch_size = 2

mha = MultiHeadAttention(d_model, num_heads)
x = torch.randn(batch_size, seq_len, d_model)

output, attn_weights = mha(x, x, x)  # Self-attention

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"\n✅ Multi-Head Attention working!")

---

## 3. Positional Encoding

Since transformers process all positions in parallel, we need to inject information about token positions. We use sinusoidal positional encoding:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

In [None]:
class PositionalEncoding(nn.Module):
    """Sinusoidal positional encoding."""
    
    def __init__(self, d_model, max_seq_len=5000):
        super().__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len).unsqueeze(1).float()
        
        # Compute the div term
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * 
            (-math.log(10000.0) / d_model)
        )
        
        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension and register as buffer
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        """Add positional encoding to input."""
        return x + self.pe[:, :x.size(1), :]

In [None]:
# Visualize positional encoding
pe_module = PositionalEncoding(d_model=64, max_seq_len=100)
pe_matrix = pe_module.pe.squeeze().numpy()

plt.figure(figsize=(12, 5))
plt.imshow(pe_matrix[:50, :].T, cmap='RdBu', aspect='auto')
plt.colorbar(label='Value')
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('Positional Encoding Visualization')
plt.tight_layout()
plt.show()

---

## 4. Complete Transformer Block

A Transformer block combines:
1. Multi-Head Self-Attention
2. Layer Normalization
3. Feed-Forward Network
4. Residual Connections

In [None]:
class TransformerBlock(nn.Module):
    """A single Transformer block."""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output, _ = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)
        
        return x

In [None]:
# Test the complete Transformer block
d_model = 512
num_heads = 8
d_ff = 2048
seq_len = 20
batch_size = 4

transformer_block = TransformerBlock(d_model, num_heads, d_ff)
x = torch.randn(batch_size, seq_len, d_model)

output = transformer_block(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\n✅ Transformer Block working!")

# Count parameters
total_params = sum(p.numel() for p in transformer_block.parameters())
print(f"\nTotal parameters: {total_params:,}")

---

## 🎯 Key Takeaways

1. **Self-Attention** lets each token attend to all other tokens in parallel
2. **Multi-Head Attention** learns multiple attention patterns simultaneously
3. **Positional Encoding** injects sequence order information
4. **Transformer Blocks** combine attention, FFN, and residual connections

> **💡 Optimization Tip**: When running on production data, always ensure your input tensors are on the GPU by calling `.to('cuda')` on your model and data.

## 📖 Next Steps

Continue to **Module 2: Tokenization & Embeddings** to learn how text is converted to the vectors that Transformers process.

---

*© 2026 MadeForAI. Learn more at [madeforai.github.io](https://madeforai.github.io/madeforai)*