# Attention is All You Need: My Guide to Transformers

*Understanding the architecture that powers modern AI*

Welcome to the world of Transformers! The Transformer is a passive component that transfers electrical energy from one circuit to another (or many). 

Just kidding (ECE joke.. Sorry!)

Transformers are the neural networks that power ChatGPT, Google Translate, code completion tools, and essentially all modern AI systems. This guide will walk you through building a Transformer from scratch (not really from scratch.. We are using PyTorch), explaining both the theory and implementation.


## What is a Transformer?

A Transformer is a neural network architecture that uses attention mechanisms to understand relationships between words in a sequence. Unlike traditional recurrent networks (RNNs, LSTMs), Transformers can process entire sequences simultaneously, making them both faster to train and better at capturing long-range dependencies.

**Key Applications:**
- ChatGPT and other large language models
- Google Translate and machine translation
- Code completion tools like GitHub Copilot
- Image generation models like DALL-E
- Scientific research applications

**Why Transformers Matter:** Understanding this architecture is essential for anyone working in AI, as it forms the foundation of virtually all modern language models and many other AI applications.


In [6]:
# Install required packages if not already installed
try:
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import matplotlib.pyplot as plt
    import numpy as np
    print("All packages already installed!")
except ImportError as e:
    print(f"Installing missing packages: {e}")
    import subprocess
    import sys
    
    packages = ['torch', 'torchvision', 'matplotlib', 'numpy']
    for package in packages:
        try:
            subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
            print(f"Successfully installed {package}")
        except subprocess.CalledProcessError:
            print(f"Failed to install {package}")
    
    # Try importing again
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import matplotlib.pyplot as plt
    import numpy as np

# Import required libraries
import math
from typing import Optional

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Environment setup complete! Ready to build a Transformer from scratch.")

All packages already installed!
Environment setup complete! Ready to build a Transformer from scratch.


## Understanding Attention Mechanisms

Before implementing the code, let's understand the core concept: **Attention**.

### What is Attention?

Attention is a mechanism that allows the model to focus on different parts of the input sequence when processing each element. When you read "The cat that I saw yesterday was sleeping on the mat," your brain automatically connects "was sleeping" to "cat" rather than "mat" or "yesterday." This is essentially what attention mechanisms do computationally.

### The Attention Formula

The mathematical foundation of attention is:

```
Attention(Q, K, V) = softmax(QK^T / √d_k)V
```

Where:
- **Q (Query)**: "What am I looking for?"
- **K (Key)**: "What do I represent?"
- **V (Value)**: "What information do I contain?"

We'll implement this step by step to understand how it works.


In [None]:
# Self-Attention Implementation
# The core mechanism that allows words to attend to each other

class SelfAttention(nn.Module):
    """
    Self-Attention mechanism - the foundation of Transformers
    
    This allows each word in a sequence to attend to every other word,
    determining which relationships are most important for understanding
    the context and meaning.
    """
    
    def __init__(self, d_model: int, d_k: int = None):
        super().__init__()
        self.d_model = d_model
        self.d_k = d_k or d_model  # Key dimension (usually same as model dimension)
        
        # Linear projections for Query, Key, and Value matrices
        self.W_q = nn.Linear(d_model, d_k)  # Query projection
        self.W_k = nn.Linear(d_model, d_k)  # Key projection  
        self.W_v = nn.Linear(d_model, d_model)  # Value projection
        
        # Scaling factor to prevent softmax from becoming too peaked
        self.scale = math.sqrt(d_k)
        
    def forward(self, x):
        """
        Forward pass of self-attention
        
        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
            
        Returns:
            Output tensor of same shape as input
        """
        batch_size, seq_len, d_model = x.shape
        
        # Step 1: Create Q, K, V matrices
        Q = self.W_q(x)  # (batch_size, seq_len, d_k)
        K = self.W_k(x)  # (batch_size, seq_len, d_k) 
        V = self.W_v(x)  # (batch_size, seq_len, d_model)
        
        # Step 2: Calculate attention scores
        # This determines how much each word should attend to every other word
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        # Shape: (batch_size, seq_len, seq_len)
        
        # Step 3: Apply softmax to get attention weights (probabilities)
        attention_weights = F.softmax(scores, dim=-1)
        
        # Step 4: Apply attention weights to values
        # Each word gets a weighted combination of all other words' information
        output = torch.matmul(attention_weights, V)
        # Shape: (batch_size, seq_len, d_model)
        
        return output, attention_weights

# Test the self-attention mechanism
print("Testing Self-Attention...")

# Create a simple example
batch_size = 1
seq_len = 4  # 4 words
d_model = 8  # Each word represented by 8 numbers

# Create random input (pretend these are word embeddings)
x = torch.randn(batch_size, seq_len, d_model)
print(f"Input shape: {x.shape}")

# Create and test self-attention layer
self_attn = SelfAttention(d_model)
output, attention_weights = self_attn(x)

print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
print(f"Attention weights sum to 1: {attention_weights.sum(dim=-1)}")  # Should be close to 1

print("Self-Attention is working correctly!")


## Multi-Head Attention

While single-head attention is powerful, Multi-Head Attention provides multiple perspectives on the same input. Each attention head can focus on different types of relationships - one might focus on syntax, another on semantics, another on long-range dependencies.

**Why Multiple Heads?** Different heads can specialize in different aspects of the input, similar to having a team of experts rather than a single person trying to understand everything.

**The Original Design:** The paper used 8 heads, which has become a common choice, though the optimal number depends on the specific task and model size.


In [4]:
# Multi-Head Attention Implementation
# Multiple attention mechanisms running in parallel

class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention - multiple attention mechanisms in parallel
    
    This runs multiple self-attention mechanisms simultaneously,
    each focusing on different aspects of the input. The outputs
    are then combined to create a richer representation.
    """
    
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # Dimension per head
        
        # Linear projections for Q, K, V for all heads combined
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Final linear projection to combine all heads
        self.W_o = nn.Linear(d_model, d_model)
        
    def forward(self, x, mask=None):
        """
        Forward pass of multi-head attention
        
        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
            mask: Optional mask to prevent attention to certain positions
            
        Returns:
            Output tensor of same shape as input
        """
        batch_size, seq_len, d_model = x.shape
        
        # Step 1: Create Q, K, V for all heads at once
        Q = self.W_q(x)  # (batch_size, seq_len, d_model)
        K = self.W_k(x)  # (batch_size, seq_len, d_model)
        V = self.W_v(x)  # (batch_size, seq_len, d_model)
        
        # Step 2: Reshape to separate heads
        # From (batch_size, seq_len, d_model) to (batch_size, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        
        # Step 3: Calculate attention for each head
        # Shape: (batch_size, num_heads, seq_len, seq_len)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing future information leakage)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Apply softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        # Shape: (batch_size, num_heads, seq_len, d_k)
        attended_values = torch.matmul(attention_weights, V)
        
        # Step 4: Concatenate heads back together
        # Reshape from (batch_size, num_heads, seq_len, d_k) to (batch_size, seq_len, d_model)
        attended_values = attended_values.transpose(1, 2).contiguous().view(
            batch_size, seq_len, d_model
        )
        
        # Step 5: Final linear projection
        output = self.W_o(attended_values)
        
        return output, attention_weights

# Test Multi-Head Attention
print("Testing Multi-Head Attention...")

# Create parameters
d_model = 8
num_heads = 2  # Use 2 heads for simplicity
batch_size = 1
seq_len = 4

# Create input
x = torch.randn(batch_size, seq_len, d_model)

# Create multi-head attention
multi_head_attn = MultiHeadAttention(d_model, num_heads)

# Run it
output, attention_weights = multi_head_attn(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
print(f"Number of heads: {num_heads}")

print("Multi-Head Attention is working correctly!")


Testing Multi-Head Attention...
Input shape: torch.Size([1, 4, 8])
Output shape: torch.Size([1, 4, 8])
Attention weights shape: torch.Size([1, 2, 4, 4])
Number of heads: 2
Multi-Head Attention is working correctly!


## Building the Complete Transformer Architecture

A complete Transformer consists of several key components:

1. **Multi-Head Attention** (We just implemented this)
2. **Feed-Forward Networks** - Process the attended information
3. **Layer Normalization** - Stabilizes training
4. **Residual Connections** - Skip connections that help with gradient flow
5. **Positional Encoding** - Provides sequence order information

Let's implement each component and then assemble them into a complete Transformer.


In [None]:
# Feed-Forward Network Implementation
# Processes the attended information through two linear layers

class FeedForward(nn.Module):
    """
    Feed-Forward Network - processes information after attention
    
    This consists of two linear layers with a ReLU activation between them.
    The first layer expands the dimension (typically 4x), and the second
    contracts it back to the original dimension.
    
    Architecture:
    - First layer: d_model -> 4 * d_model (expansion)
    - Activation: ReLU
    - Second layer: 4 * d_model -> d_model (contraction)
    """
    
    def __init__(self, d_model: int, d_ff: int = None):
        super().__init__()
        d_ff = d_ff or 4 * d_model  # Default to 4x expansion
        
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(0.1)  # Small dropout for regularization
        
    def forward(self, x):
        """
        Forward pass through the feed-forward network
        
        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
            
        Returns:
            Output tensor of same shape as input
        """
        # First linear transformation + ReLU + dropout
        x = self.linear1(x)
        x = F.relu(x)
        x = self.dropout(x)
        
        # Second linear transformation
        x = self.linear2(x)
        
        return x

# Positional Encoding Implementation
# Provides sequence position information to the model

def get_positional_encoding(seq_len: int, d_model: int):
    """
    Generate positional encodings for the input sequence
    
    This creates unique position embeddings using sinusoidal functions.
    Each position gets a unique "fingerprint" that tells the model
    where it is in the sequence.
    
    The encoding uses sine and cosine functions with different frequencies
    to create unique patterns for each position.
    """
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    
    # Create the div_term for the sinusoidal functions
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        (-math.log(10000.0) / d_model))
    
    # Apply sin to even indices
    pe[:, 0::2] = torch.sin(position * div_term)
    # Apply cos to odd indices  
    pe[:, 1::2] = torch.cos(position * div_term)
    
    return pe.unsqueeze(0)  # Add batch dimension

# Test the components
print("Testing Feed-Forward Network and Positional Encoding...")

# Test Feed-Forward Network
d_model = 8
seq_len = 4
batch_size = 1

x = torch.randn(batch_size, seq_len, d_model)
ffn = FeedForward(d_model)
ffn_output = ffn(x)

print(f"FFN Input shape: {x.shape}")
print(f"FFN Output shape: {ffn_output.shape}")

# Test Positional Encoding
pos_encoding = get_positional_encoding(seq_len, d_model)
print(f"Positional encoding shape: {pos_encoding.shape}")

print("Feed-Forward Network and Positional Encoding are working correctly!")


In [None]:
# Complete Transformer Implementation
# Combining all components into the full architecture

class TransformerBlock(nn.Module):
    """
    A single Transformer block - the building block of the Transformer
    
    This combines:
    1. Multi-Head Attention
    2. Feed-Forward Network  
    3. Layer Normalization (stabilizes training)
    4. Residual connections (helps with gradient flow)
    
    The block processes information in two steps:
    Step 1: Multi-Head Attention (determine what's important)
    Step 2: Feed-Forward Network (process the information)
    """
    
    def __init__(self, d_model: int, num_heads: int, d_ff: int = None):
        super().__init__()
        
        # Multi-Head Attention
        self.attention = MultiHeadAttention(d_model, num_heads)
        
        # Feed-Forward Network
        self.feed_forward = FeedForward(d_model, d_ff)
        
        # Layer Normalization (stabilizes training)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(0.1)
        
    def forward(self, x, mask=None):
        """
        Forward pass through the Transformer block
        
        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
            mask: Optional attention mask
            
        Returns:
            Output tensor of same shape as input
        """
        # Step 1: Multi-Head Attention with residual connection
        attn_output, attention_weights = self.attention(x, mask)
        x = self.norm1(x + self.dropout(attn_output))  # Residual connection + LayerNorm
        
        # Step 2: Feed-Forward Network with residual connection  
        ffn_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ffn_output))  # Residual connection + LayerNorm
        
        return x, attention_weights

class Transformer(nn.Module):
    """
    The Complete Transformer Model
    
    This implements the full Transformer architecture as described in 
    "Attention is All You Need". It consists of a stack of Transformer blocks
    that each process the input in increasingly sophisticated ways.
    
    Each layer builds upon the previous one:
    - Layer 1: Basic word relationships
    - Layer 2: Simple sentence structure  
    - Layer 3: Complex grammatical patterns
    - Layer N: Deep semantic understanding
    """
    
    def __init__(self, vocab_size: int, d_model: int, num_heads: int, 
                 num_layers: int, max_seq_len: int = 512):
        super().__init__()
        
        self.d_model = d_model
        self.max_seq_len = max_seq_len
        
        # Embedding layer - converts token IDs to dense vectors
        self.embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional encoding (added to embeddings)
        self.register_buffer('pos_encoding', get_positional_encoding(max_seq_len, d_model))
        
        # Stack of Transformer blocks
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads) 
            for _ in range(num_layers)
        ])
        
        # Final layer normalization
        self.final_norm = nn.LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        """
        Forward pass through the entire Transformer
        
        Args:
            x: Input token IDs of shape (batch_size, seq_len)
            mask: Optional attention mask
            
        Returns:
            Output embeddings of shape (batch_size, seq_len, d_model)
        """
        batch_size, seq_len = x.shape
        
        # Step 1: Convert token IDs to embeddings
        x = self.embedding(x) * math.sqrt(self.d_model)  # Scale embeddings
        
        # Step 2: Add positional encoding
        x = x + self.pos_encoding[:, :seq_len, :]
        
        # Step 3: Pass through each Transformer block
        attention_weights_list = []
        for transformer_block in self.transformer_blocks:
            x, attention_weights = transformer_block(x, mask)
            attention_weights_list.append(attention_weights)
        
        # Step 4: Final layer normalization
        x = self.final_norm(x)
        
        return x, attention_weights_list

# Test the complete Transformer
print("Testing the Complete Transformer...")

# Model parameters
vocab_size = 1000  # Size of vocabulary
d_model = 64      # Model dimension
num_heads = 8     # Number of attention heads
num_layers = 6    # Number of Transformer blocks
max_seq_len = 128 # Maximum sequence length

# Create the model
transformer = Transformer(vocab_size, d_model, num_heads, num_layers, max_seq_len)

# Create dummy input (token IDs)
batch_size = 2
seq_len = 10
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

print(f"Input shape: {input_ids.shape}")
print(f"Model parameters: {sum(p.numel() for p in transformer.parameters()):,}")

# Run the model
output, attention_weights = transformer(input_ids)
print(f"Output shape: {output.shape}")
print(f"Number of attention weight matrices: {len(attention_weights)}")
print(f"Each attention matrix shape: {attention_weights[0].shape}")

print("SUCCESS! Complete Transformer built and tested!")


## Visualizing Attention Patterns

Let's create visualizations to see what our Transformer is actually paying attention to. This helps us understand how the model processes information and which relationships it finds important.


In [None]:
# Attention Visualization Functions
# Create heatmaps to visualize attention patterns

def visualize_attention(attention_weights, tokens=None, layer_idx=0, head_idx=0):
    """
    Visualize attention weights as a heatmap
    
    Args:
        attention_weights: List of attention weight matrices from all layers
        tokens: Optional list of token strings for x/y axis labels
        layer_idx: Which layer to visualize
        head_idx: Which attention head to visualize
    """
    # Get attention weights for the specified layer and head
    attn = attention_weights[layer_idx][0, head_idx].detach().numpy()  # Remove batch dimension
    
    # Create the heatmap
    plt.figure(figsize=(8, 6))
    plt.imshow(attn, cmap='Blues', aspect='auto')
    plt.colorbar(label='Attention Weight')
    
    # Add labels if tokens are provided
    if tokens:
        plt.xticks(range(len(tokens)), tokens, rotation=45)
        plt.yticks(range(len(tokens)), tokens)
    
    plt.xlabel('Key Position (What we attend TO)')
    plt.ylabel('Query Position (What we attend FROM)')
    plt.title(f'Attention Heatmap - Layer {layer_idx}, Head {head_idx}')
    plt.tight_layout()
    plt.show()

def create_simple_example():
    """
    Create a simple example to demonstrate attention patterns
    """
    print("Creating a simple example...")
    
    # Create a small model for demonstration
    vocab_size = 50
    d_model = 32
    num_heads = 4
    num_layers = 2
    
    model = Transformer(vocab_size, d_model, num_heads, num_layers)
    
    # Create a simple sequence (pretend these are meaningful tokens)
    # Example: "The cat sat on the mat"
    sequence = torch.tensor([[1, 5, 8, 12, 1, 15]])  # Token IDs
    tokens = ["The", "cat", "sat", "on", "the", "mat"]
    
    print(f"Input sequence: {tokens}")
    print(f"Token IDs: {sequence}")
    
    # Run the model
    with torch.no_grad():
        output, attention_weights = model(sequence)
    
    print(f"Output shape: {output.shape}")
    print(f"Number of layers: {len(attention_weights)}")
    print(f"Number of heads per layer: {attention_weights[0].shape[1]}")
    
    return model, attention_weights, tokens

# Create and visualize our example
model, attention_weights, tokens = create_simple_example()

print("\nVisualizing attention patterns...")
print("Each cell shows how much attention one word pays to another word.")
print("Darker blue = more attention")

# Visualize attention from the first layer, first head
visualize_attention(attention_weights, tokens, layer_idx=0, head_idx=0)

print("\nLooking at different heads in the same layer...")
# Visualize different heads
for head_idx in range(min(4, attention_weights[0].shape[1])):
    print(f"\nHead {head_idx}:")
    visualize_attention(attention_weights, tokens, layer_idx=0, head_idx=head_idx)


## Key Takeaways

Congratulations! You've successfully built a complete Transformer from scratch. Here's what you've accomplished:

### Core Concepts You Now Understand:

1. **Self-Attention**: How words can attend to each other and determine relationships
2. **Multi-Head Attention**: Why multiple perspectives provide richer understanding
3. **Feed-Forward Networks**: The component that processes attended information
4. **Positional Encoding**: How the model understands word order
5. **Layer Normalization**: Stabilizes training in deep networks
6. **Residual Connections**: Skip connections that help with gradient flow

### Why This Matters:

- **ChatGPT, GPT-4, Claude**: All built on Transformer architecture
- **Google Translate**: Uses Transformers for better translations
- **Code Completion**: GitHub Copilot, Cursor, etc. use Transformers
- **Image Generation**: DALL-E, Midjourney use Transformer-like architectures
- **Scientific Research**: Protein folding, drug discovery, etc.

### What You Can Do Next:

1. **Train Your Own Model**: Use this code as a starting point
2. **Experiment with Parameters**: Try different numbers of heads, layers
3. **Add Tasks**: Implement language modeling, classification, etc.
4. **Study Pre-trained Models**: Look at BERT, GPT, T5 architectures
5. **Build Applications**: Create chatbots, translators, code assistants


## Practical Tips & Common Gotchas

### Pro Tips for Working with Transformers:

1. **Start Small**: Begin with small models (2-4 layers, 4-8 heads) before scaling up
2. **Watch Your Memory**: Attention is O(n²) - longer sequences need more memory
3. **Use Mixed Precision**: `torch.cuda.amp` can speed up training significantly
4. **Gradient Clipping**: Helps prevent exploding gradients in deep networks
5. **Learning Rate Scheduling**: Transformers often benefit from warmup + decay

### Common Mistakes to Avoid:

1. **Forgetting Positional Encoding**: Without it, your model won't understand word order
2. **Wrong Attention Mask**: Can cause information leakage in autoregressive models
3. **Not Scaling Embeddings**: Multiply by √d_model to prevent vanishing gradients
4. **Too Many Layers Too Fast**: Start simple, then add complexity
5. **Ignoring Layer Norm Placement**: Pre-norm vs post-norm can make a big difference

### Real-World Considerations:

- **Computational Cost**: Attention scales quadratically with sequence length
- **Memory Usage**: Large models need lots of GPU memory
- **Training Time**: Can take days/weeks for large models
- **Data Requirements**: Transformers need lots of data to work well
- **Hyperparameter Sensitivity**: Small changes can have big effects

### Further Reading:

- **Original Paper**: "Attention is All You Need" (Vaswani et al., 2017)
- **The Illustrated Transformer**: Great visual explanations
- **Hugging Face Course**: Practical tutorials on using pre-trained models
- **Papers With Code**: See latest implementations and benchmarks


## Congratulations! You're Now a Transformer Expert!

You've successfully built a complete Transformer from scratch! This is a significant achievement and you now understand one of the most important architectures in modern AI.

### What You've Accomplished:

- **Built Self-Attention** from first principles  
- **Implemented Multi-Head Attention** with multiple perspectives  
- **Created Feed-Forward Networks** for information processing  
- **Added Positional Encoding** for sequence understanding  
- **Constructed Complete Transformer** with all components  
- **Visualized Attention Patterns** to see what the model learns  
- **Learned Practical Tips** for real-world applications  

### You're Ready For:

- Building your own language models
- Understanding how ChatGPT, GPT-4, and Claude work
- Contributing to open-source AI projects
- Creating AI applications and tools
- Pursuing advanced research in NLP/AI

### Final Thoughts:

This wasn't easy for me the first time, and it might've been difficult for you too! The fact that you've followed this guide and built this out means you understand just a little more (or maybe a lot) on how the tools we use in our every day lives work. That's the difference between someone who can use AI tools and someone who can build them.

---

*"Attention is all you need... but practice and coffee helps too :)"*
