# Chapter 3: Attention Mechanisms - The Engine of Transformers**Portfolio Project: Building LLMs from Scratch on AWS** 🧠[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yourusername/llm-from-scratch-aws/blob/main/03_Attention_Mechanisms.ipynb)---## 📋 Chapter OverviewImplement the attention mechanism - the core innovation of transformers:- Self-attention fundamentals- Scaled dot-product attention- Multi-head attention- Causal masking for autoregressive models- Efficient attention implementations**Learning Objectives:**✅ Understand self-attention conceptually  ✅ Implement attention from scratch  ✅ Build multi-head attention  ✅ Apply causal masking  ✅ Optimize for efficiency  **AWS Services:** SageMaker Notebook  **Estimated Cost:** $0.10 - $0.50---

## 🔧 Environment Setup### Cell Purpose: Install packages and import libraries

In [None]:
# Install and import required packagesimport sysIN_COLAB = 'google.colab' in sys.modulesIN_SAGEMAKER = '/opt/ml' in sys.executableif IN_COLAB or IN_SAGEMAKER:    !pip install -q torch matplotlib numpy    print("✅ Packages installed!")import torchimport torch.nn as nnimport numpy as npimport matplotlib.pyplot as pltfrom importlib.metadata import versionprint("="*60)print("ENVIRONMENT")print("="*60)print(f"PyTorch: {version('torch')}")print(f"CUDA: {torch.cuda.is_available()}")print("="*60)

## 3.1 Self-Attention FundamentalsSelf-attention allows each token to attend to all other tokens in the sequence.### Key Concepts:- **Query (Q)**: What am I looking for?- **Key (K)**: What do I contain?- **Value (V)**: What information do I provide?### Intuition:```"The cat sat on the mat"For token "cat":- Query: "I need context about this animal"- Keys: All other tokens advertise their content- Values: Actual information from relevant tokens- Attention: "The" and "sat" are most relevant to "cat"```### Mathematics:```Attention(Q, K, V) = softmax(QK^T / √d_k) V```Where:- Q, K, V are linear projections of input- √d_k is scaling factor (square root of key dimension)- softmax normalizes attention weights to sum to 1

### Cell Purpose: Implement simple self-attention from scratch

In [None]:
# Simple Self-Attention (no trainable weights)# Example sentence: "Your journey starts with one step"inputs = torch.tensor(  [[0.43, 0.15, 0.89], # Your   [0.55, 0.87, 0.66], # journey   [0.57, 0.85, 0.64], # starts   [0.22, 0.58, 0.33], # with   [0.77, 0.25, 0.10], # one   [0.05, 0.80, 0.55]] # step)print("="*60)print("SIMPLE SELF-ATTENTION")print("="*60)print(f"Input shape: {inputs.shape}")print(f"Input (first 3 tokens):\n{inputs[:3]}\n")# Step 1: Compute attention scores (using token 2 as query)query = inputs[1]  # "journey"attn_scores_2 = torch.empty(inputs.shape[0])for i, x_i in enumerate(inputs):    attn_scores_2[i] = torch.dot(x_i, query)print(f"Query token: 'journey' (index 1)")print(f"Attention scores: {attn_scores_2}\n")# Step 2: Normalize with softmaxattn_weights_2 = torch.softmax(attn_scores_2, dim=0)print(f"Attention weights (normalized):")print(f"{attn_weights_2}")print(f"Sum: {attn_weights_2.sum():.4f}\n")# Step 3: Compute context vectorcontext_vec_2 = torch.zeros(query.shape)for i, x_i in enumerate(inputs):    context_vec_2 += attn_weights_2[i] * x_iprint(f"Context vector for 'journey':")print(f"{context_vec_2}")print("="*60)

### Cell Purpose: Visualize attention patterns

In [None]:
# Compute attention for all tokensdef compute_attention_scores(inputs):    """Compute attention scores for all token pairs"""    attn_scores = torch.empty(inputs.shape[0], inputs.shape[0])        for i, query in enumerate(inputs):        for j, key in enumerate(inputs):            attn_scores[i, j] = torch.dot(query, key)        return attn_scoresdef compute_attention_weights(attn_scores):    """Normalize scores to weights using softmax"""    return torch.softmax(attn_scores, dim=-1)# Compute full attention matrixattn_scores = compute_attention_scores(inputs)attn_weights = compute_attention_weights(attn_scores)print(f"Attention weight matrix shape: {attn_weights.shape}")print(f"\nAttention weights:\n{attn_weights}\n")# Visualizefig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))tokens = ['Your', 'journey', 'starts', 'with', 'one', 'step']# Heatmapim = ax1.imshow(attn_weights.numpy(), cmap='Blues')ax1.set_xticks(range(len(tokens)))ax1.set_yticks(range(len(tokens)))ax1.set_xticklabels(tokens, rotation=45)ax1.set_yticklabels(tokens)ax1.set_xlabel('Keys', fontweight='bold')ax1.set_ylabel('Queries', fontweight='bold')ax1.set_title('Attention Weight Matrix', fontsize=13, fontweight='bold')# Add valuesfor i in range(len(tokens)):    for j in range(len(tokens)):        text = ax1.text(j, i, f'{attn_weights[i, j].item():.2f}',                       ha="center", va="center", color="black", fontsize=8)plt.colorbar(im, ax=ax1, label='Attention Weight')# Bar chart for "journey"ax2.barh(tokens, attn_weights[1].numpy(), color='#3498db', edgecolor='black')ax2.set_xlabel('Attention Weight', fontweight='bold')ax2.set_title('Attention for "journey"', fontsize=13, fontweight='bold')ax2.invert_yaxis()ax2.grid(axis='x', alpha=0.3)plt.tight_layout()plt.show()print("📊 Attention allows each token to gather context from all others!")

## 3.2 Scaled Dot-Product AttentionAdd trainable parameters and scaling for better performance.### Improvements:1. **Learnable Projections**: W_q, W_k, W_v matrices2. **Scaling**: Divide by √d_k to prevent large values3. **Batch Processing**: Handle multiple sequences### Formula:```pythonQ = X @ W_q  # Query projectionK = X @ W_k  # Key projectionV = X @ W_v  # Value projectionscores = (Q @ K.T) / sqrt(d_k)weights = softmax(scores)output = weights @ V```

### Cell Purpose: Implement scaled dot-product attention with trainable weights

In [None]:
# Scaled Dot-Product Attentionclass SelfAttention_v1(nn.Module):    """Self-attention with trainable projections"""    def __init__(self, d_in, d_out):        super().__init__()        self.W_query = nn.Linear(d_in, d_out, bias=False)        self.W_key = nn.Linear(d_in, d_out, bias=False)        self.W_value = nn.Linear(d_in, d_out, bias=False)        def forward(self, x):        keys = self.W_key(x)        queries = self.W_query(x)        values = self.W_value(x)                # Scaled dot-product        attn_scores = queries @ keys.T        attn_scores = attn_scores / keys.shape[-1]**0.5                attn_weights = torch.softmax(attn_scores, dim=-1)                context_vectors = attn_weights @ values        return context_vectors# Testtorch.manual_seed(123)d_in, d_out = 3, 2sa_v1 = SelfAttention_v1(d_in, d_out)# Example inputx = torch.tensor(    [[0.43, 0.15, 0.89],     [0.55, 0.87, 0.66],     [0.57, 0.85, 0.64],     [0.22, 0.58, 0.33],     [0.77, 0.25, 0.10],     [0.05, 0.80, 0.55]])print("="*60)print("SCALED DOT-PRODUCT ATTENTION")print("="*60)print(f"Input shape: {x.shape}")print(f"Output shape: {sa_v1(x).shape}")print(f"\nOutput:\n{sa_v1(x)}")print("="*60)# Visualize learned attentionwith torch.no_grad():    queries = sa_v1.W_query(x)    keys = sa_v1.W_key(x)    attn_scores = queries @ keys.T / keys.shape[-1]**0.5    attn_weights = torch.softmax(attn_scores, dim=-1)plt.figure(figsize=(8, 6))im = plt.imshow(attn_weights.numpy(), cmap='viridis')plt.colorbar(im, label='Attention Weight')plt.xlabel('Key Position', fontweight='bold')plt.ylabel('Query Position', fontweight='bold')plt.title('Learned Attention Weights', fontsize=13, fontweight='bold')plt.tight_layout()plt.show()

## 3.3 Causal Self-AttentionFor autoregressive models (like GPT), we need to prevent attending to future tokens.### Why Causal Masking?- During training: Model shouldn't see future tokens- Simulates generation: Only past tokens available- Implementation: Mask with -inf before softmax### Mask Structure:```[[1, 0, 0, 0],   After softmax:  [[1.0, 0.0, 0.0, 0.0], [1, 1, 0, 0],    ->              [0.5, 0.5, 0.0, 0.0], [1, 1, 1, 0],                    [0.33, 0.33, 0.33, 0.0], [1, 1, 1, 1]]                    [0.25, 0.25, 0.25, 0.25]]```

### Cell Purpose: Implement causal self-attention with masking

In [None]:
# Causal Self-Attentionclass CausalAttention(nn.Module):    """Causal self-attention for autoregressive models"""    def __init__(self, d_in, d_out, context_length, dropout=0.0, qkv_bias=False):        super().__init__()        self.d_out = d_out        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)        self.dropout = nn.Dropout(dropout)                # Create causal mask        self.register_buffer(            'mask',            torch.triu(torch.ones(context_length, context_length), diagonal=1)        )        def forward(self, x):        b, num_tokens, d_in = x.shape                keys = self.W_key(x)        queries = self.W_query(x)        values = self.W_value(x)                # Scaled attention scores        attn_scores = queries @ keys.transpose(1, 2)        attn_scores = attn_scores / keys.shape[-1]**0.5                # Apply causal mask        attn_scores.masked_fill_(            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf        )                attn_weights = torch.softmax(attn_scores, dim=-1)        attn_weights = self.dropout(attn_weights)                context_vec = attn_weights @ values        return context_vec# Testtorch.manual_seed(123)context_length = 6ca = CausalAttention(d_in=3, d_out=2, context_length=context_length, dropout=0.1)# Batch of sequencesbatch = x.unsqueeze(0)  # Add batch dimensionprint("="*60)print("CAUSAL ATTENTION")print("="*60)print(f"Input shape: {batch.shape}")print(f"Output shape: {ca(batch).shape}")print("="*60)# Visualize causal maskfig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 4))# Raw maskmask = torch.triu(torch.ones(6, 6), diagonal=1)ax1.imshow(mask.numpy(), cmap='Greys', vmin=0, vmax=1)ax1.set_title('Causal Mask\n(1 = masked)', fontsize=11, fontweight='bold')ax1.set_xlabel('Key Position')ax1.set_ylabel('Query Position')# Attention scores before softmaxwith torch.no_grad():    queries = ca.W_query(batch)    keys = ca.W_key(batch)    scores = queries @ keys.transpose(1, 2) / keys.shape[-1]**0.5    scores_masked = scores.clone()    scores_masked.masked_fill_(mask.bool(), -torch.inf)im2 = ax2.imshow(scores[0].numpy(), cmap='RdBu_r', aspect='auto')ax2.set_title('Scores Before Mask', fontsize=11, fontweight='bold')ax2.set_xlabel('Key Position')plt.colorbar(im2, ax=ax2)im3 = ax3.imshow(scores_masked[0].numpy(), cmap='RdBu_r', aspect='auto', vmin=-10, vmax=2)ax3.set_title('Scores After Mask\n(-inf for future)', fontsize=11, fontweight='bold')ax3.set_xlabel('Key Position')plt.colorbar(im3, ax=ax3)plt.tight_layout()plt.show()print("🔒 Causal masking ensures we only attend to past tokens!")

## 3.4 Multi-Head AttentionInstead of one attention mechanism, use multiple "heads" in parallel.### Benefits:- **Diverse Representations**: Each head learns different patterns- **Parallel Processing**: Computed simultaneously- **Better Performance**: Empirically superior### Architecture:```Input → [Head 1, Head 2, ..., Head h] → Concat → Linear → Output```### Example:- 12 heads with d_model=768- Each head: 768/12 = 64 dimensions- Concatenated: 12 × 64 = 768 dimensions- Final projection: 768 → 768

### Cell Purpose: Implement complete multi-head attention

In [None]:
#####################################class MultiHeadAttention(nn.Module):    """Multi-head self-attention mechanism"""    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):        super().__init__()        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"                self.d_out = d_out        self.num_heads = num_heads        self.head_dim = d_out // num_heads                self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)        self.out_proj = nn.Linear(d_out, d_out)        self.dropout = nn.Dropout(dropout)        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))        def forward(self, x):        b, num_tokens, d_in = x.shape                # QKV projections        keys = self.W_key(x)        queries = self.W_query(x)        values = self.W_value(x)                # Split into heads        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)        values = values.view(b, num_tokens, self.num_heads, self.head_dim)                # Transpose for attention        keys = keys.transpose(1, 2)        queries = queries.transpose(1, 2)        values = values.transpose(1, 2)                # Scaled dot-product attention        attn_scores = queries @ keys.transpose(2, 3)                # Apply causal mask        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]        attn_scores.masked_fill_(mask_bool, -torch.inf)                attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)        attn_weights = self.dropout(attn_weights)                # Combine heads        context_vec = (attn_weights @ values).transpose(1, 2)        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)        context_vec = self.out_proj(context_vec)                return context_vec###################################### Test Multi-Head Attentiontorch.manual_seed(123)batch_size = 2context_length = 6d_in = 3# Create sample batchbatch = torch.randn(batch_size, context_length, d_in)# Multi-head attention parametersd_out = 8num_heads = 2mha = MultiHeadAttention(    d_in=d_in,    d_out=d_out,    context_length=context_length,    dropout=0.1,    num_heads=num_heads,    qkv_bias=False)print("="*60)print("MULTI-HEAD ATTENTION")print("="*60)print(f"Input shape: {batch.shape}")print(f"Number of heads: {num_heads}")print(f"Head dimension: {d_out // num_heads}")print(f"Output shape: {mha(batch).shape}")print("="*60)# Visualize multi-head attention patternswith torch.no_grad():    # Get attention weights from each head    b, num_tokens, d = batch.shape        queries = mha.W_query(batch)    keys = mha.W_key(batch)        # Reshape for heads    queries = queries.view(b, num_tokens, num_heads, mha.head_dim).transpose(1, 2)    keys = keys.view(b, num_tokens, num_heads, mha.head_dim).transpose(1, 2)        # Compute attention for each head    attn_scores = queries @ keys.transpose(2, 3)    mask = mha.mask.bool()[:num_tokens, :num_tokens]    attn_scores.masked_fill_(mask, -torch.inf)    attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)# Plot attention from each headfig, axes = plt.subplots(1, num_heads, figsize=(12, 5))if num_heads == 1:    axes = [axes]for head_idx in range(num_heads):    ax = axes[head_idx]    im = ax.imshow(attn_weights[0, head_idx].numpy(), cmap='viridis', aspect='auto')    ax.set_title(f'Head {head_idx+1}', fontsize=12, fontweight='bold')    ax.set_xlabel('Key Position')    ax.set_ylabel('Query Position')    plt.colorbar(im, ax=ax, label='Attention')plt.suptitle('Multi-Head Attention Patterns (First Sample)',              fontsize=14, fontweight='bold', y=1.05)plt.tight_layout()plt.show()print("🎯 Each head learns different attention patterns!")

## 📝 Chapter Summary### What We Implemented:1. ✅ **Self-Attention**: Basic attention mechanism2. ✅ **Scaled Attention**: With trainable projections3. ✅ **Causal Masking**: For autoregressive models4. ✅ **Multi-Head**: Parallel attention heads5. ✅ **Complete Implementation**: Production-ready code### Key Insights:- Attention allows dynamic context aggregation- Scaling prevents numerical instability- Causal masking enables autoregressive training- Multi-head provides diverse representations### Mathematical Summary:```python# For each head h:Q_h = X @ W_q_hK_h = X @ W_k_hV_h = X @ W_v_hAttention_h = softmax(Q_h K_h^T / √d_k) V_h# Combine heads:MultiHead = Concat(Attention_1, ..., Attention_h) @ W_o```### Performance Characteristics:- Time Complexity: O(n²d) where n=sequence length- Space Complexity: O(n²) for attention matrix- Parallelizable: All heads computed simultaneously### Next Steps:➡️ **Chapter 4**: Combine attention with feed-forward networks to build the complete GPT model!---## 🔗 Resources**Papers:**- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)- [Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)- [Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/)**Ready for Chapter 4? Let's build GPT! 🤖**