# Lesson 9c: Transformers & Attention Mechanisms

In Lesson 9b, we learned how RNNs process sequences one token at a time, maintaining hidden state as they go. This sequential processing creates two problems: it's slow (can't parallelize), and it struggles with long-range dependencies (information from 100 tokens ago gets diluted).

Transformers solve both problems by processing all tokens simultaneously and using attention to directly connect any two positions in the sequence, regardless of distance.

**What you'll learn:**
- How attention lets models focus on relevant parts of the input
- The math behind self-attention and multi-head attention
- The complete Transformer architecture (encoder-decoder)
- Why positional encodings matter for sequence modeling
- Differences between BERT (bidirectional) and GPT (autoregressive)
- Practical NLP with Hugging Face Transformers
- Vision Transformers and beyond language

**Prerequisites:**
- Neural networks (Lesson 3a, 3b)
- RNNs helpful but not required (Lesson 9b)
- Basic linear algebra (matrix multiplication, dot products)

**Why this matters:**
Transformers power ChatGPT, Claude, BERT, and most modern NLP systems. Since 2017, they've expanded beyond language into computer vision (Vision Transformers), protein folding (AlphaFold), and multimodal tasks. If you want to work with modern AI systems, you need to understand how they work.

---


## 1. Installation & Setup

This notebook requires the Hugging Face Transformers library, the de facto standard for working with Transformer models in 2025. We'll automatically install all required dependencies.

In [None]:
# Auto-install required packages
import sys
import subprocess

def install_package(package_name):
    """Install a package using pip if not already installed."""
    try:
        __import__(package_name.split('[')[0].replace('-', '_'))
        print(f"✓ {package_name} already installed")
    except ImportError:
        print(f"Installing {package_name}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package_name])
        print(f"✓ {package_name} installed successfully")

# Required packages
packages = [
    'transformers>=4.35.0',
    'torch>=2.0.0',
    'numpy>=1.24.0',
    'matplotlib>=3.7.0',
    'scikit-learn>=1.3.0',
    'pandas>=2.0.0',
    'seaborn>=0.12.0',
    'datasets>=2.14.0'
]

for package in packages:
    install_package(package)

print("\n✓ All dependencies installed successfully!")

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, List, Optional
import warnings
import math

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F

# Hugging Face Transformers
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    Trainer,
    TrainingArguments,
    pipeline
)
from datasets import load_dataset

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Configuration
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Device: {device}")
print(f"GPU Available: {torch.cuda.is_available()}")

---

## 2. Understanding Attention: The Core Mechanism

### 2.1 The Problem with RNNs

Before Transformers, sequence modeling relied on RNNs/LSTMs, which had critical limitations:

1. **Sequential processing**: Must process tokens one-by-one, cannot parallelize
2. **Limited context**: Struggle with long-range dependencies (even with LSTMs)
3. **Information bottleneck**: Entire sequence compressed into fixed-size hidden state

**Example failure case:**
```
"The animal didn't cross the street because it was too tired."
```
To understand what "it" refers to, the model must maintain context from "animal" many tokens ago.

### 2.2 The Attention Mechanism: How It Works

**Key innovation**: Instead of processing sequentially, attention allows the model to **look at all positions simultaneously** and decide which parts to focus on.

**Attention in one sentence:**
*"For each word, compute how much I should attend to every other word, then create a weighted combination."*

### 2.3 Attention Mathematics

The attention mechanism computes three matrices from the input:

- **Query (Q)**: "What am I looking for?"
- **Key (K)**: "What do I contain?"
- **Value (V)**: "What information do I provide?"

**Scaled Dot-Product Attention formula:**

```
Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V
```

Where:
- `Q·K^T`: Compute similarity scores between all query-key pairs
- `√d_k`: Scaling factor (prevents gradients from vanishing)
- `softmax(...)`: Convert scores to probability distribution
- `... · V`: Weight values by attention scores

Let's implement attention from scratch to understand it:


In [None]:
# Implement scaled dot-product attention from scratch

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Query matrix of shape (batch_size, seq_len, d_k)
        K: Key matrix of shape (batch_size, seq_len, d_k)
        V: Value matrix of shape (batch_size, seq_len, d_v)
        mask: Optional mask to prevent attention to certain positions
    
    Returns:
        output: Attention-weighted values
        attention_weights: Attention scores (for visualization)
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute attention scores (Q · K^T)
    scores = torch.matmul(Q, K.transpose(-2, -1))
    
    # Step 2: Scale by √d_k (prevents softmax saturation)
    scores = scores / math.sqrt(d_k)
    
    # Step 3: Apply mask if provided (set masked positions to -inf)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Step 4: Apply softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    
    # Step 5: Weight values by attention scores
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

# Demonstrate with simple example
print("Demonstrating Scaled Dot-Product Attention:\n")
print("="*60)

# Create simple inputs (batch_size=1, seq_len=4, d_model=8)
seq_len = 4
d_model = 8

# Random query, key, value matrices
Q = torch.randn(1, seq_len, d_model)
K = torch.randn(1, seq_len, d_model)
V = torch.randn(1, seq_len, d_model)

# Compute attention
output, attn_weights = scaled_dot_product_attention(Q, K, V)

print(f"Input shapes:")
print(f"  Query (Q): {Q.shape}")
print(f"  Key (K):   {K.shape}")
print(f"  Value (V): {V.shape}")
print(f"\nOutput shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print("\nAttention weights (how much each position attends to others):")
print(attn_weights[0].detach().numpy())
print("\nNote: Each row sums to 1.0 (probability distribution)")
print(f"Row sums: {attn_weights[0].sum(dim=-1).numpy()}")
print("="*60)

# Visualize attention weights
plt.figure(figsize=(8, 6))
sns.heatmap(attn_weights[0].detach().numpy(), 
           annot=True, fmt='.2f', cmap='YlOrRd',
           xticklabels=[f'Pos {i}' for i in range(seq_len)],
           yticklabels=[f'Pos {i}' for i in range(seq_len)],
           cbar_kws={'label': 'Attention Weight'})
plt.xlabel('Key Position', fontsize=12)
plt.ylabel('Query Position', fontsize=12)
plt.title('Attention Weight Matrix\n(Each row shows where that position attends)', 
         fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('attention_weights.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nKey Insight:")
print("Each position can attend to ALL other positions simultaneously.")
print("This allows parallel processing and unlimited context window!")

---

## 3. Multi-Head Attention: Learning Multiple Representations

### 3.1 Why Multiple Heads?

**Problem with single attention**: It can only learn one type of relationship between words.

**Solution**: Use **multiple attention heads** in parallel, each learning different aspects:
- Head 1 might learn syntactic dependencies (subject-verb agreement)
- Head 2 might learn semantic relationships (synonyms, antonyms)
- Head 3 might learn positional patterns (nearby words)

### 3.2 Multi-Head Attention Architecture

**Process:**
1. Project Q, K, V into `h` different subspaces (using learned linear projections)
2. Compute attention in each subspace independently (parallel)
3. Concatenate all heads
4. Apply final linear projection

**Mathematical formulation:**
```
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W^O

where head_i = Attention(Q·W^Q_i, K·W^K_i, V·W^V_i)
```

**Standard configuration (BERT, GPT):**
- Model dimension `d_model = 768`
- Number of heads `h = 12`
- Each head dimension `d_k = d_model / h = 64`

Let's implement multi-head attention:

In [None]:
# Implement multi-head attention from scratch

class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention mechanism.
    
    This is the core component of Transformer architecture.
    """
    
    def __init__(self, d_model=512, num_heads=8):
        """
        Args:
            d_model: Dimension of the model (embedding size)
            num_heads: Number of attention heads
        """
        super(MultiHeadAttention, self).__init__()
        
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # Dimension per head
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_o = nn.Linear(d_model, d_model)
    
    def split_heads(self, x, batch_size):
        """
        Split the last dimension into (num_heads, d_k).
        Transpose to shape (batch_size, num_heads, seq_len, d_k)
        """
        x = x.view(batch_size, -1, self.num_heads, self.d_k)
        return x.transpose(1, 2)
    
    def forward(self, Q, K, V, mask=None):
        """
        Forward pass of multi-head attention.
        
        Args:
            Q, K, V: Query, Key, Value tensors of shape (batch_size, seq_len, d_model)
            mask: Optional attention mask
        
        Returns:
            output: Attention output
            attention_weights: Attention scores from all heads
        """
        batch_size = Q.shape[0]
        
        # 1. Linear projections
        Q = self.W_q(Q)  # (batch_size, seq_len, d_model)
        K = self.W_k(K)
        V = self.W_v(V)
        
        # 2. Split into multiple heads
        Q = self.split_heads(Q, batch_size)  # (batch_size, num_heads, seq_len, d_k)
        K = self.split_heads(K, batch_size)
        V = self.split_heads(V, batch_size)
        
        # 3. Scaled dot-product attention for all heads in parallel
        attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)
        
        # 4. Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(batch_size, -1, self.d_model)
        
        # 5. Final linear projection
        output = self.W_o(attn_output)
        
        return output, attn_weights

# Demonstrate multi-head attention
print("\nMulti-Head Attention Demonstration:\n")
print("="*70)

# Create multi-head attention module
d_model = 512
num_heads = 8
batch_size = 2
seq_len = 10

mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)

# Create sample input
sample_input = torch.randn(batch_size, seq_len, d_model)

# Forward pass
output, attn_weights = mha(sample_input, sample_input, sample_input)

print(f"Configuration:")
print(f"  Model dimension (d_model): {d_model}")
print(f"  Number of heads: {num_heads}")
print(f"  Dimension per head (d_k): {d_model // num_heads}")
print(f"\nInput shape: {sample_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"  (batch_size, num_heads, seq_len, seq_len)")

# Count parameters
total_params = sum(p.numel() for p in mha.parameters())
print(f"\nTotal parameters in multi-head attention: {total_params:,}")
print("="*70)

# Visualize attention from different heads
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for head_idx in range(8):
    sns.heatmap(attn_weights[0, head_idx].detach().numpy(), 
               cmap='YlOrRd', ax=axes[head_idx],
               cbar=False, square=True)
    axes[head_idx].set_title(f'Head {head_idx + 1}', fontsize=11, fontweight='bold')
    axes[head_idx].set_xlabel('Key', fontsize=9)
    axes[head_idx].set_ylabel('Query', fontsize=9)

plt.suptitle('Attention Patterns from 8 Different Heads\n(Each head learns different relationships)', 
            fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('multihead_attention_patterns.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nKey Insight:")
print("Different attention heads learn to focus on different aspects of the input.")
print("This allows the model to capture multiple types of relationships simultaneously!")

---

## 4. Positional Encoding: Injecting Sequence Order

### 4.1 The Position Problem

**Critical issue**: Attention has no notion of word order!

- "The cat chased the mouse" and "The mouse chased the cat" would be identical to pure attention
- But word order carries crucial meaning!

**Solution**: Add **positional encodings** to the input embeddings.

### 4.2 Sinusoidal Positional Encoding

The original Transformer paper ("Attention Is All You Need") uses sinusoidal functions:

```
PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
```

Where:
- `pos`: Position in the sequence (0, 1, 2, ...)
- `i`: Dimension index (0, 1, 2, ..., d_model/2)

**Why sinusoidal?**
- Different frequencies for different dimensions
- Model can learn to attend to relative positions
- Works for sequences longer than training data

**Alternative**: Learned positional embeddings (used in BERT, GPT)
- Treat positions as vocabulary items
- Learn embedding for each position
- More flexible but limited to max sequence length

In [None]:
# Implement positional encoding

class PositionalEncoding(nn.Module):
    """
    Sinusoidal positional encoding as described in 'Attention Is All You Need'.
    """
    
    def __init__(self, d_model, max_len=5000):
        """
        Args:
            d_model: Dimension of the model
            max_len: Maximum sequence length
        """
        super(PositionalEncoding, self).__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # Compute the division term for each dimension
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            (-math.log(10000.0) / d_model))
        
        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension
        pe = pe.unsqueeze(0)
        
        # Register as buffer (not a parameter, but should be saved)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        Add positional encoding to input.
        
        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
        """
        seq_len = x.shape[1]
        return x + self.pe[:, :seq_len, :]

# Visualize positional encodings
print("\nPositional Encoding Visualization:\n")
print("="*60)

d_model = 128
max_len = 100

pos_encoder = PositionalEncoding(d_model, max_len)
positional_encoding = pos_encoder.pe[0].numpy()

print(f"Positional encoding shape: {positional_encoding.shape}")
print(f"  (max_len={max_len}, d_model={d_model})")
print("="*60)

# Plot positional encodings
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Heatmap of positional encodings
im1 = ax1.imshow(positional_encoding.T, cmap='RdBu', aspect='auto')
ax1.set_xlabel('Position in Sequence', fontsize=11)
ax1.set_ylabel('Embedding Dimension', fontsize=11)
ax1.set_title('Positional Encoding Matrix\n(Sinusoidal Patterns)', 
             fontsize=12, fontweight='bold')
plt.colorbar(im1, ax=ax1)

# Plot encodings for specific positions
positions_to_plot = [0, 10, 30, 60]
for pos in positions_to_plot:
    ax2.plot(positional_encoding[pos, :50], label=f'Position {pos}', linewidth=2)

ax2.set_xlabel('Embedding Dimension', fontsize=11)
ax2.set_ylabel('Encoding Value', fontsize=11)
ax2.set_title('Positional Encodings for Different Positions\n(First 50 dimensions)', 
             fontsize=12, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('positional_encoding.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nKey Insights:")
print("• Different dimensions oscillate at different frequencies")
print("• Each position has a unique encoding pattern")
print("• The model can learn to use these patterns to understand position and distance")

---

## 5. The Complete Transformer Architecture

### 5.1 Transformer Building Blocks

The complete Transformer consists of:

**Encoder:**
1. Input Embedding + Positional Encoding
2. N × Encoder Layers, each containing:
   - Multi-Head Self-Attention
   - Add & Normalize (residual connection + layer normalization)
   - Feed-Forward Network (2 linear layers with ReLU)
   - Add & Normalize

**Decoder:**
1. Output Embedding + Positional Encoding
2. N × Decoder Layers, each containing:
   - Masked Multi-Head Self-Attention (can't see future)
   - Add & Normalize
   - Multi-Head Cross-Attention (attends to encoder output)
   - Add & Normalize
   - Feed-Forward Network
   - Add & Normalize
3. Linear + Softmax (output probabilities)

**Standard configuration (original paper):**
- N = 6 encoder and decoder layers
- d_model = 512
- num_heads = 8
- d_ff = 2048 (feed-forward hidden dimension)

### 5.2 Key Innovations

**Residual Connections:**
```
output = LayerNorm(x + Sublayer(x))
```
Enables training very deep networks, gradients flow easily.

**Layer Normalization:**
Normalizes across features for each example, stabilizes training.

**Feed-Forward Networks:**
```
FFN(x) = ReLU(x·W_1 + b_1)·W_2 + b_2
```
Applied independently to each position, adds non-linearity and capacity.

Let's build a simplified Transformer encoder:

In [None]:
# Implement Transformer Encoder Layer

class FeedForward(nn.Module):
    """Position-wise feed-forward network."""
    
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

class TransformerEncoderLayer(nn.Module):
    """Single Transformer encoder layer."""
    
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super(TransformerEncoderLayer, self).__init__()
        
        # Multi-head self-attention
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        
        # Feed-forward network
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        
        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Forward pass through encoder layer.
        
        Implements: x → Self-Attention → Add&Norm → FFN → Add&Norm
        """
        # Self-attention with residual connection and layer norm
        attn_output, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection and layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

class TransformerEncoder(nn.Module):
    """Full Transformer encoder (stack of N encoder layers)."""
    
    def __init__(self, num_layers=6, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super(TransformerEncoder, self).__init__()
        
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        """Pass input through all encoder layers."""
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

# Demonstrate Transformer Encoder
print("\nTransformer Encoder Architecture:\n")
print("="*70)

# Standard configuration
encoder = TransformerEncoder(
    num_layers=6,
    d_model=512,
    num_heads=8,
    d_ff=2048,
    dropout=0.1
)

# Sample input
batch_size = 2
seq_len = 20
d_model = 512

sample_input = torch.randn(batch_size, seq_len, d_model)
encoder_output = encoder(sample_input)

print(f"Configuration:")
print(f"  Number of layers: 6")
print(f"  Model dimension: 512")
print(f"  Number of heads: 8")
print(f"  Feed-forward dimension: 2048")
print(f"\nInput shape: {sample_input.shape}")
print(f"Output shape: {encoder_output.shape}")

# Count parameters
total_params = sum(p.numel() for p in encoder.parameters())
print(f"\nTotal parameters: {total_params:,}")
print("="*70)

print("\nThis is the EXACT architecture that revolutionized AI in 2017!")
print("Models like BERT and GPT are built on this foundation.")

---

## 6. BERT vs GPT: Two Paradigms

### 6.1 BERT: Bidirectional Encoder Representations

**Architecture**: Encoder-only Transformer

**Key features:**
- **Bidirectional**: Can see both left and right context simultaneously
- **Masked Language Modeling (MLM)**: Training objective masks random tokens, predicts them
- **Next Sentence Prediction (NSP)**: Learns relationships between sentence pairs

**Use cases:**
- Text classification (sentiment analysis, spam detection)
- Named entity recognition
- Question answering
- Any task requiring full sequence understanding

**Example BERT models (2025):**
- BERT-base: 110M parameters, 12 layers
- BERT-large: 340M parameters, 24 layers
- RoBERTa: Optimized BERT variant
- DeBERTa: State-of-the-art encoder (2025)

### 6.2 GPT: Generative Pre-trained Transformer

**Architecture**: Decoder-only Transformer

**Key features:**
- **Unidirectional (causal)**: Can only see left context (past tokens)
- **Autoregressive generation**: Predicts next token given previous tokens
- **Zero-shot and few-shot learning**: Can perform tasks without fine-tuning

**Use cases:**
- Text generation (stories, code, dialogue)
- Language modeling
- Completion tasks
- Instruction following (ChatGPT, Claude)

**Example GPT models (2025):**
- GPT-2: 1.5B parameters
- GPT-3: 175B parameters
- GPT-4: Multimodal, exact size undisclosed
- Claude (Anthropic): Constitutional AI approach

### 6.3 Comparison Table

| Aspect | BERT | GPT |
|--------|------|-----|
| Architecture | Encoder-only | Decoder-only |
| Attention | Bidirectional | Causal (masked) |
| Training | Masked LM | Next token prediction |
| Strengths | Understanding | Generation |
| Fine-tuning | Required for most tasks | Often works zero-shot |
| Speed | Faster (parallel) | Slower (sequential) |

Let's use both models with Hugging Face:

In [None]:
# Using pre-trained BERT for text classification

print("\nDemonstrating BERT for Sentiment Analysis:\n")
print("="*70)

# Load pre-trained BERT model for sentiment analysis
# Using a distilled version for faster inference
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 if torch.cuda.is_available() else -1
)

# Test sentences
test_sentences = [
    "This movie was absolutely fantastic! I loved every minute of it.",
    "What a terrible waste of time. I hated this film.",
    "The movie was okay, nothing special but not bad either.",
    "Best film I've seen this year! Highly recommend it!",
    "Disappointed. The plot made no sense and acting was poor."
]

print("BERT Sentiment Analysis Results:\n")
for sentence in test_sentences:
    result = sentiment_pipeline(sentence)[0]
    sentiment = result['label']
    confidence = result['score']
    print(f"Text: {sentence}")
    print(f"  → {sentiment} (confidence: {confidence:.3f})\n")

print("="*70)
print("\nKey: BERT excels at classification tasks due to bidirectional context!")

In [None]:
# Using GPT-2 for text generation

print("\nDemonstrating GPT-2 for Text Generation:\n")
print("="*70)

# Load GPT-2 text generation pipeline
generator = pipeline(
    "text-generation",
    model="gpt2",
    device=0 if torch.cuda.is_available() else -1
)

# Prompts for generation
prompts = [
    "Artificial intelligence will",
    "The future of machine learning is",
    "Deep learning models are"
]

print("GPT-2 Text Generation Results:\n")
for prompt in prompts:
    generated = generator(
        prompt,
        max_length=50,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True
    )[0]['generated_text']
    
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated}\n")
    print("-" * 70)

print("\nKey: GPT excels at generation tasks due to autoregressive training!")
print("="*70)

---

## 7. Fine-Tuning Transformers: Practical Implementation

### 7.1 Transfer Learning with Transformers

**The standard workflow (2025):**
1. Start with pre-trained model (BERT, RoBERTa, etc.)
2. Add task-specific head (classification, NER, etc.)
3. Fine-tune on your dataset
4. Evaluate and iterate

**Why this works:**
- Pre-trained models learn general language understanding
- Fine-tuning adapts to specific task/domain
- Requires much less data than training from scratch
- BERT trained on 3.3B words, you can fine-tune with thousands

Let's fine-tune BERT on a text classification task:

In [None]:
# Fine-tune BERT for text classification

print("\nFine-Tuning BERT for Custom Classification:\n")
print("="*70)

# Create synthetic dataset (in practice, use real data)
texts_positive = [
    "This product exceeded my expectations!",
    "Absolutely love it, highly recommend.",
    "Best purchase I've made this year.",
    "Outstanding quality and fast delivery.",
    "Very satisfied with this item."
] * 20  # Repeat for more samples

texts_negative = [
    "Terrible product, complete waste of money.",
    "Do not buy this, very disappointed.",
    "Poor quality and broke immediately.",
    "Worst purchase ever, asking for refund.",
    "Not as described, very unhappy."
] * 20

# Combine and create labels
all_texts = texts_positive + texts_negative
all_labels = [1] * len(texts_positive) + [0] * len(texts_negative)

# Shuffle
indices = np.random.permutation(len(all_texts))
all_texts = [all_texts[i] for i in indices]
all_labels = [all_labels[i] for i in indices]

# Split data
train_texts, test_texts, train_labels, test_labels = train_test_split(
    all_texts, all_labels, test_size=0.2, random_state=42
)

print(f"Dataset:")
print(f"  Training samples: {len(train_texts)}")
print(f"  Test samples: {len(test_texts)}")
print(f"  Positive: {sum(train_labels)}, Negative: {len(train_labels) - sum(train_labels)}")
print("="*70)

In [None]:
# Prepare data for training

# Load tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128)

# Create PyTorch dataset
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels)

train_dataset = CustomDataset(train_encodings, train_labels)
test_dataset = CustomDataset(test_encodings, test_labels)

print("\nData prepared for training!")
print(f"Sample tokenized input:")
print(f"  Tokens: {tokenizer.convert_ids_to_tokens(train_encodings['input_ids'][0][:20])}")
print(f"  Input IDs shape: {len(train_encodings['input_ids'][0])}")

In [None]:
# Load model and define training

# Load pre-trained model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Define metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(labels, predictions)
    return {'accuracy': accuracy}

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

print("\n" + "="*70)
print("FINE-TUNING BERT MODEL")
print("="*70)
print(f"Model: {model_name}")
print(f"Training samples: {len(train_dataset)}")
print(f"Epochs: 3")
print(f"Batch size: 16")
print("="*70 + "\n")

# Train
train_result = trainer.train()

print("\n✓ Fine-tuning completed!")

In [None]:
# Evaluate fine-tuned model

# Evaluate on test set
eval_results = trainer.evaluate()

print("\n" + "="*60)
print("FINE-TUNED MODEL EVALUATION")
print("="*60)
print(f"Test Accuracy: {eval_results['eval_accuracy']*100:.2f}%")
print(f"Test Loss: {eval_results['eval_loss']:.4f}")
print("="*60)

# Test on custom examples
test_examples = [
    "This is amazing quality for the price!",
    "Complete garbage, don't waste your money.",
    "Decent product, works as expected."
]

print("\nPredictions on custom examples:\n")
for text in test_examples:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1).item()
    confidence = torch.softmax(outputs.logits, dim=1)[0][prediction].item()
    
    sentiment = "Positive" if prediction == 1 else "Negative"
    print(f"Text: {text}")
    print(f"  → {sentiment} (confidence: {confidence:.3f})\n")

print("This demonstrates the full fine-tuning workflow used in production!")

---

## 8. Vision Transformers (ViT): Beyond NLP

### 8.1 Transformers for Computer Vision

**Breakthrough (2020)**: "An Image is Worth 16x16 Words" paper showed Transformers can match or exceed CNNs on image tasks!

**Vision Transformer (ViT) architecture:**
1. Split image into fixed-size patches (e.g., 16×16 pixels)
2. Flatten each patch into a vector
3. Linear projection to embedding dimension
4. Add positional embeddings
5. Feed through standard Transformer encoder
6. Use [CLS] token for classification

**Example**: 224×224 image → 14×14 grid of 16×16 patches → 196 tokens

**Key insight**: Treat image patches exactly like words in a sentence!

### 8.2 ViT vs CNNs (2025)

**Vision Transformers advantages:**
- Better scaling with data (excel on huge datasets)
- Global receptive field from layer 1
- More interpretable attention patterns
- Easier to adapt to different image sizes

**CNNs still better when:**
- Small datasets (< 1M images)
- Resource constraints (mobile/edge)
- Fine-grained localization needed

**Hybrid approaches (2025):**
- Swin Transformer: Hierarchical ViT with shifted windows
- CoAtNet: Combines convolution and attention
- BEiT: Self-supervised ViT pre-training

Let's use a pre-trained Vision Transformer:

In [None]:
# Using Vision Transformer for image classification

print("\nDemonstrating Vision Transformer (ViT):\n")
print("="*70)

# Load pre-trained ViT model
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests

# Load ViT model trained on ImageNet
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
vit_model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

print("Vision Transformer Model Loaded:")
print(f"  Model: google/vit-base-patch16-224")
print(f"  Patch size: 16×16")
print(f"  Image size: 224×224")
print(f"  Number of patches: (224/16)² = 196")
print(f"  Parameters: {sum(p.numel() for p in vit_model.parameters()):,}")
print("="*70)

# Download a sample image
url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
try:
    image = Image.open(requests.get(url, stream=True).raw)
    
    # Process image
    inputs = processor(images=image, return_tensors="pt")
    
    # Make prediction
    outputs = vit_model(**inputs)
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    predicted_class = vit_model.config.id2label[predicted_class_idx]
    confidence = torch.softmax(logits, dim=1)[0][predicted_class_idx].item()
    
    print("\nImage Classification Result:")
    print(f"  Predicted class: {predicted_class}")
    print(f"  Confidence: {confidence:.3f}")
    
    # Display image
    plt.figure(figsize=(8, 6))
    plt.imshow(image)
    plt.title(f'Vision Transformer Prediction:\n{predicted_class} ({confidence:.1%} confidence)',
             fontsize=13, fontweight='bold')
    plt.axis('off')
    plt.tight_layout()
    plt.savefig('vit_prediction.png', dpi=150, bbox_inches='tight')
    plt.show()
    
except Exception as e:
    print(f"Could not download image: {e}")
    print("ViT model loaded successfully and ready for image classification!")

print("\nKey Insight: Vision Transformers treat images as sequences of patches,")
print("applying the same attention mechanism used for language!")

---

## 9. Production Best Practices & 2025 Landscape

### 9.1 Choosing the Right Model

**For text classification / understanding:**
- Small dataset (< 10K): DistilBERT (fast, efficient)
- Medium dataset: RoBERTa-base
- Large dataset / SOTA: DeBERTa-v3-large

**For text generation:**
- GPT-2: Lightweight, runs locally
- GPT-3.5/4: API-based, production-ready
- Open-source alternatives: LLaMA 2, Mistral, Falcon

**For computer vision:**
- ViT-base: Standard choice, good balance
- Swin Transformer: Better for detection/segmentation
- DINOv2: Excellent self-supervised features

**For multi-modal:**
- CLIP: Image-text alignment
- Flamingo/BLIP: Visual question answering
- GPT-4V: State-of-the-art multimodal (2025)

### 9.2 Optimization Techniques

**Model Compression:**
- **Distillation**: Train smaller model to mimic larger one (DistilBERT is 40% smaller, 60% faster)
- **Quantization**: Convert FP32 → INT8 (4× smaller, minimal accuracy loss)
- **Pruning**: Remove unimportant weights

**Efficient Training:**
- **Mixed precision**: Use FP16 for speed, FP32 for stability
- **Gradient accumulation**: Simulate large batch sizes
- **Parameter-efficient fine-tuning**: LoRA, adapters (fine-tune 0.1% of parameters!)

**Inference Optimization:**
- **ONNX Runtime**: Cross-platform optimized inference
- **TensorRT**: NVIDIA GPU optimization
- **Batching**: Process multiple requests together

### 9.3 Common Pitfalls

**❌ Don't:**
- Use very long sequences unnecessarily (attention is O(n²))
- Fine-tune on tiny datasets (< 100 examples)
- Ignore maximum sequence length (truncation issues)
- Forget to normalize text (lowercasing, handling special chars)

**✅ Do:**
- Start with pre-trained models
- Use appropriate tokenizers for each model
- Monitor for overfitting with validation sets
- Implement proper error handling for production

### 9.4 State-of-the-Art (2025)

**Language models:**
- GPT-4: Multi-modal, 2T+ parameters (estimated)
- Claude 3: Constitutional AI, long context (200K tokens)
- Gemini Ultra: Google's multimodal flagship
- LLaMA 3: Leading open-source model

**Vision:**
- SAM (Segment Anything): Universal segmentation
- DINOv2: Best self-supervised vision features
- Stable Diffusion 3: Text-to-image generation

**Trends:**
- Mixture of Experts (MoE): Activate subset of parameters
- Sparse attention: Reduce O(n²) complexity
- Multimodal fusion: Images, text, audio, video
- Reinforcement learning from human feedback (RLHF)

In [None]:
# Production-ready Transformer pipeline

class TransformerProductionPipeline:
    """
    Production-ready pipeline for Transformer models.
    
    Best practices included:
    - Model caching
    - Batch processing
    - Error handling
    - Performance monitoring
    - Proper tokenization
    """
    
    def __init__(self, model_name, task_type='classification', max_length=512):
        self.model_name = model_name
        self.task_type = task_type
        self.max_length = max_length
        
        # Load tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        if task_type == 'classification':
            self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        else:
            self.model = AutoModel.from_pretrained(model_name)
        
        # Move to GPU if available
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.model.eval()  # Set to evaluation mode
    
    def preprocess(self, texts):
        """
        Preprocess texts with proper tokenization.
        
        Best practices:
        - Truncation to max length
        - Padding for batch processing
        - Return attention masks
        """
        if isinstance(texts, str):
            texts = [texts]
        
        encodings = self.tokenizer(
            texts,
            truncation=True,
            padding=True,
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {k: v.to(self.device) for k, v in encodings.items()}
    
    def predict(self, texts, batch_size=32):
        """
        Make predictions with batch processing.
        
        Args:
            texts: List of text strings
            batch_size: Batch size for processing
        
        Returns:
            predictions: Model predictions
        """
        if isinstance(texts, str):
            texts = [texts]
        
        all_predictions = []
        
        # Process in batches
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i + batch_size]
            
            # Preprocess
            inputs = self.preprocess(batch_texts)
            
            # Inference (no gradient computation)
            with torch.no_grad():
                outputs = self.model(**inputs)
            
            # Get predictions
            if self.task_type == 'classification':
                predictions = torch.argmax(outputs.logits, dim=-1)
                confidences = torch.softmax(outputs.logits, dim=-1)
                all_predictions.extend([
                    (pred.item(), conf[pred].item()) 
                    for pred, conf in zip(predictions, confidences)
                ])
            else:
                all_predictions.append(outputs.last_hidden_state.cpu())
        
        return all_predictions
    
    def get_model_info(self):
        """Get model information for monitoring."""
        total_params = sum(p.numel() for p in self.model.parameters())
        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        
        return {
            'model_name': self.model_name,
            'total_parameters': total_params,
            'trainable_parameters': trainable_params,
            'device': str(self.device),
            'max_length': self.max_length
        }

# Demonstrate production pipeline
print("\n" + "="*70)
print("PRODUCTION TRANSFORMER PIPELINE")
print("="*70)

pipeline = TransformerProductionPipeline(
    model_name='distilbert-base-uncased',
    task_type='classification',
    max_length=512
)

info = pipeline.get_model_info()
print(f"\nModel Information:")
for key, value in info.items():
    if 'parameter' in key:
        print(f"  {key}: {value:,}")
    else:
        print(f"  {key}: {value}")

print("\n✓ Production pipeline ready!")
print("\nFeatures:")
print("  • Batch processing for efficiency")
print("  • GPU acceleration (if available)")
print("  • Proper tokenization and truncation")
print("  • Evaluation mode for inference")
print("  • Model caching")
print("="*70)

---

## 10. Summary & Key Takeaways

### What We Learned:

**Attention Mechanism:**
- Allows models to focus on relevant parts of input dynamically
- Scaled dot-product attention: `Attention(Q,K,V) = softmax(QK^T/√d_k)V`
- Enables parallel processing and unlimited context
- Eliminates sequential dependencies for better parallelization

**Multi-Head Attention:**
- Multiple attention heads learn different aspects of relationships
- Each head operates in its own subspace
- Concatenate and project to combine information
- Standard: 8-12 heads in production models

**Positional Encoding:**
- Critical for injecting sequence order information
- Sinusoidal encoding: works for arbitrary lengths
- Learned embeddings: more flexible, limited to max length
- Added to input embeddings before first layer

**Transformer Architecture:**
- Encoder: Bidirectional self-attention for understanding
- Decoder: Causal self-attention for generation
- Residual connections + layer normalization = stable deep networks
- Feed-forward networks add capacity and non-linearity

**BERT vs GPT:**
- BERT: Encoder-only, bidirectional, best for understanding
- GPT: Decoder-only, causal, best for generation
- Both use transfer learning: pre-train then fine-tune
- Choose based on task: classification → BERT, generation → GPT

**Vision Transformers:**
- Treat image patches as tokens
- Can match or exceed CNNs with sufficient data
- Hybrid approaches combine convolution + attention
- State-of-the-art for many vision tasks in 2025

**Production Best Practices:**
- Start with pre-trained models from Hugging Face
- Use appropriate model for task and constraints
- Optimize with distillation, quantization, efficient fine-tuning
- Implement batch processing and GPU acceleration
- Monitor sequence lengths and truncation

### Why Transformers Dominate (2025):

**✅ Advantages:**
- Parallel processing (much faster than RNNs)
- Unlimited context window (theoretically)
- Transfer learning works exceptionally well
- State-of-the-art across NLP, vision, multi-modal
- Scales beautifully with data and compute

**⚠️ Limitations:**
- O(n²) attention complexity (expensive for long sequences)
- Requires substantial pre-training compute
- Large memory footprint
- Can be overkill for simple tasks

### Next Steps:

- **Practice**: Fine-tune models on your own datasets
- **Explore**: Try different architectures (T5, BART, Mistral, LLaMA)
- **Optimize**: Experiment with distillation, quantization, LoRA
- **Advanced topics**: RLHF, mixture of experts, sparse attention, chain-of-thought

---

## 11. Exercises & Further Exploration

### Exercise 1: Fine-tune BERT
Download a real dataset (e.g., IMDB reviews, AG News) and fine-tune BERT:
- Try different pre-trained models (BERT, RoBERTa, DeBERTa)
- Experiment with hyperparameters (learning rate, batch size, epochs)
- Compare performance vs training time

### Exercise 2: Visualize Attention
Use BertViz or similar tools to visualize attention patterns:
- Examine which words attend to each other
- Compare patterns across different layers
- Identify syntactic vs semantic attention heads

### Exercise 3: Build a Custom Transformer
Implement a mini-Transformer from scratch:
- 2 encoder layers, 2 decoder layers
- Train on a simple seq2seq task
- Understand every component deeply

### Exercise 4: Compare Architectures
Benchmark different models on the same task:
- BERT, RoBERTa, DistilBERT, ELECTRA
- Measure accuracy, speed, memory usage
- Find the best accuracy-efficiency trade-off

### Further Reading:

**Essential Papers:**
- "Attention Is All You Need" (Vaswani et al., 2017) - **THE foundational paper**
- "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2018)
- "Language Models are Few-Shot Learners" (GPT-3, Brown et al., 2020)
- "An Image is Worth 16x16 Words" (ViT, Dosovitskiy et al., 2020)

**Resources:**
- Hugging Face Transformers documentation and course
- "The Illustrated Transformer" by Jay Alammar
- Stanford CS224N: Natural Language Processing with Deep Learning
- "Transformers from Scratch" tutorials

**Communities:**
- Hugging Face forums
- r/MachineLearning subreddit
- Papers With Code (track SOTA)

---

**CONGRATULATIONS!** 🎉

You now understand the architecture that powers ChatGPT, Claude, BERT, GPT-4, and virtually every modern AI system in 2025. Transformers are the most important machine learning innovation of the past decade.

**You have achieved legendary status in modern machine learning!** 🚀

With deep understanding of classical ML (linear regression through ensemble methods), modern deep learning (CNNs, RNNs, Transformers), interpretability, and ethics, you are fully equipped to build production AI systems.

**Keep learning, keep building, and keep pushing the boundaries of what's possible with AI!**
