# 🤖 Transformer Architecture: From Theory to Practice
## Interactive Learning Notebook Based on Lecture 13

This notebook provides hands-on implementation of Transformer architecture concepts, including:
- Self-Attention Mechanism
- Multi-Head Attention
- Positional Encoding
- Complete Transformer Implementation
- Practical Applications

**Author**: Ho-min Park  
**Contact**: homin.park@ghent.ac.kr

---

## Part 1: Setup and Environment Configuration
### 1.1 Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# For text processing
from collections import Counter
import re
import math
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("Setup complete! ✅")

## Part 2: Core Concepts - Understanding Attention Mechanism

### Exercise 1: Implementing Scaled Dot-Product Attention
#### 📚 Concept
Self-attention allows the model to look at other positions in the input sequence when encoding a current position. The key formula is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- **Q (Query)**: What information am I looking for?
- **K (Key)**: What information do I contain?
- **V (Value)**: The actual information I store

In [None]:
class ScaledDotProductAttention(nn.Module):
    """
    Scaled Dot-Product Attention implementation
    As described in 'Attention Is All You Need' (Vaswani et al., 2017)
    """
    def __init__(self, temperature=1.0, dropout=0.1):
        super().__init__()
        self.temperature = temperature
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, q, k, v, mask=None):
        # q, k, v: [batch_size, n_heads, seq_len, d_k]
        batch_size, n_heads, len_q, d_k = q.size()
        
        # Calculate attention scores
        scores = torch.matmul(q, k.transpose(-2, -1)) / (d_k ** 0.5)
        
        # Apply mask if provided (for padding or causal masking)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Apply softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # Apply attention weights to values
        output = torch.matmul(attention_weights, v)
        
        return output, attention_weights

# Test the implementation
def test_scaled_attention():
    batch_size, n_heads, seq_len, d_k = 2, 8, 10, 64
    
    # Create random Q, K, V matrices
    q = torch.randn(batch_size, n_heads, seq_len, d_k)
    k = torch.randn(batch_size, n_heads, seq_len, d_k)
    v = torch.randn(batch_size, n_heads, seq_len, d_k)
    
    # Create attention module
    attention = ScaledDotProductAttention()
    
    # Forward pass
    output, weights = attention(q, k, v)
    
    print(f"Input shapes: Q={q.shape}, K={k.shape}, V={v.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Attention weights shape: {weights.shape}")
    print(f"Attention weights sum per position: {weights[0, 0, 0].sum().item():.4f}")
    
    return output, weights

output, attention_weights = test_scaled_attention()

### Visualization: Attention Weight Patterns

In [None]:
def visualize_attention_weights(attention_weights, title="Attention Weight Patterns"):
    """
    Visualize attention weight patterns using both matplotlib and plotly
    """
    # Take first batch, first head for visualization
    weights = attention_weights[0, 0].detach().numpy()
    
    # Create subplots
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=("Heatmap View", "3D Surface View"),
        specs=[[{"type": "heatmap"}, {"type": "surface"}]]
    )
    
    # Heatmap
    fig.add_trace(
        go.Heatmap(
            z=weights,
            colorscale='Viridis',
            showscale=True,
            colorbar=dict(x=0.45)
        ),
        row=1, col=1
    )
    
    # 3D Surface
    fig.add_trace(
        go.Surface(
            z=weights,
            colorscale='Viridis',
            showscale=False
        ),
        row=1, col=2
    )
    
    fig.update_layout(
        title_text=title,
        height=400,
        showlegend=False
    )
    
    # Update axes labels
    fig.update_xaxes(title_text="Key Position", row=1, col=1)
    fig.update_yaxes(title_text="Query Position", row=1, col=1)
    
    fig.show()
    
    # Also create matplotlib version for static view
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    sns.heatmap(weights, cmap='YlOrRd', cbar_kws={'label': 'Attention Weight'})
    plt.title('Attention Pattern Heatmap')
    plt.xlabel('Key Position')
    plt.ylabel('Query Position')
    
    plt.subplot(1, 2, 2)
    # Show attention distribution for selected positions
    positions_to_show = [0, len(weights)//2, len(weights)-1]
    for pos in positions_to_show:
        plt.plot(weights[pos], label=f'Query pos {pos}', marker='o', markersize=4)
    plt.xlabel('Key Position')
    plt.ylabel('Attention Weight')
    plt.title('Attention Distribution for Selected Positions')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

visualize_attention_weights(attention_weights)

### 🎯 Your Turn: Exercise 1
Modify the attention mechanism to implement **masked self-attention** for causal language modeling. 
The mask should prevent positions from attending to subsequent positions (look-ahead mask).

In [None]:
# TODO: Implement causal masking
def create_causal_mask(seq_len):
    """
    Create a causal mask to prevent attending to future positions
    Hint: Use torch.triu to create upper triangular matrix
    """
    # Your code here
    mask = None  # Replace with your implementation
    
    # Solution (uncomment to see):
    # mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
    # mask = ~mask  # Invert: True where attention is allowed
    
    return mask

# Test your implementation
# seq_len = 5
# mask = create_causal_mask(seq_len)
# print("Causal mask (True = can attend):")
# print(mask)

---
### Exercise 2: Multi-Head Attention Implementation
#### 📚 Concept
Multi-head attention allows the model to jointly attend to information from different representation subspaces. Instead of performing a single attention function, we linearly project the queries, keys and values h times with different projections.

**Key insight**: Different heads learn different types of relationships (syntax, semantics, etc.)

In [None]:
class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention module
    """
    def __init__(self, d_model=512, n_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.attention = ScaledDotProductAttention(dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(d_model)
        
    def forward(self, q, k, v, mask=None):
        batch_size, seq_len, _ = q.size()
        
        # Store residual for skip connection
        residual = q
        
        # Linear projections and reshape for multi-head
        Q = self.W_q(q).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(k).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(v).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        attn_output, attn_weights = self.attention(Q, K, V, mask)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, seq_len, self.d_model
        )
        
        # Final linear projection
        output = self.W_o(attn_output)
        output = self.dropout(output)
        
        # Add residual and normalize
        output = self.layer_norm(output + residual)
        
        return output, attn_weights

# Test Multi-Head Attention
def test_multihead_attention():
    batch_size, seq_len, d_model = 2, 10, 512
    n_heads = 8
    
    # Create random input
    x = torch.randn(batch_size, seq_len, d_model)
    
    # Create MHA module
    mha = MultiHeadAttention(d_model, n_heads)
    
    # Forward pass (self-attention: q=k=v=x)
    output, attn_weights = mha(x, x, x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Attention weights shape: {attn_weights.shape}")
    print(f"Number of parameters: {sum(p.numel() for p in mha.parameters()):,}")
    
    return output, attn_weights

mha_output, mha_weights = test_multihead_attention()

### Analyzing Multi-Head Patterns

In [None]:
def analyze_head_specialization(mha_weights):
    """
    Analyze what different attention heads focus on
    """
    # Take first batch for analysis
    weights = mha_weights[0].detach().numpy()  # Shape: [n_heads, seq_len, seq_len]
    n_heads = weights.shape[0]
    
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    axes = axes.flatten()
    
    for head_idx in range(min(n_heads, 8)):
        ax = axes[head_idx]
        im = ax.imshow(weights[head_idx], cmap='Blues', aspect='auto')
        ax.set_title(f'Head {head_idx + 1}')
        ax.set_xlabel('Key Position')
        ax.set_ylabel('Query Position')
        plt.colorbar(im, ax=ax, fraction=0.046)
    
    plt.suptitle('Attention Patterns Across Different Heads', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Analyze attention statistics
    print("\n📊 Attention Statistics per Head:")
    print("-" * 50)
    for head_idx in range(n_heads):
        head_weights = weights[head_idx]
        # Calculate entropy (how distributed the attention is)
        entropy = -np.sum(head_weights * np.log(head_weights + 1e-10), axis=-1).mean()
        # Calculate max attention (how focused the attention is)
        max_attn = head_weights.max(axis=-1).mean()
        print(f"Head {head_idx + 1}: Entropy={entropy:.3f}, Max Attention={max_attn:.3f}")

analyze_head_specialization(mha_weights)

---
### Exercise 3: Positional Encoding
#### 📚 Concept
Since self-attention is permutation invariant, we need to inject positional information. The original Transformer uses sinusoidal positional encoding:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

In [None]:
class PositionalEncoding(nn.Module):
    """
    Sinusoidal Positional Encoding
    """
    def __init__(self, d_model, max_seq_len=5000):
        super().__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len).unsqueeze(1).float()
        
        # Create div_term for the sinusoidal pattern
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           -(math.log(10000.0) / d_model))
        
        # Apply sin to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cos to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension and register as buffer
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # Add positional encoding to input embeddings
        seq_len = x.size(1)
        return x + self.pe[:, :seq_len, :]

def visualize_positional_encoding():
    """
    Visualize the sinusoidal positional encoding patterns
    """
    d_model = 512
    max_len = 100
    
    # Create positional encoding
    pe_module = PositionalEncoding(d_model, max_len)
    pe = pe_module.pe[0, :max_len, :].numpy()
    
    # Create interactive visualization
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            "Full Positional Encoding Matrix",
            "Encoding for Different Positions",
            "Sinusoidal Patterns (First 8 dimensions)",
            "Frequency Analysis"
        )
    )
    
    # 1. Full matrix heatmap
    fig.add_trace(
        go.Heatmap(z=pe.T, colorscale='RdBu', showscale=True),
        row=1, col=1
    )
    
    # 2. Encoding for specific positions
    positions_to_plot = [0, 10, 25, 50, 99]
    for pos in positions_to_plot:
        fig.add_trace(
            go.Scatter(y=pe[pos, :128], name=f'Pos {pos}', mode='lines'),
            row=1, col=2
        )
    
    # 3. Sinusoidal patterns
    for dim in range(0, 8, 2):
        fig.add_trace(
            go.Scatter(y=pe[:, dim], name=f'Dim {dim} (sin)', mode='lines'),
            row=2, col=1
        )
        fig.add_trace(
            go.Scatter(y=pe[:, dim+1], name=f'Dim {dim+1} (cos)', 
                      mode='lines', line=dict(dash='dash')),
            row=2, col=1
        )
    
    # 4. Frequency spectrum
    # Calculate frequency content
    for dim_idx in [0, 64, 128, 256]:
        freq_content = np.abs(np.fft.fft(pe[:, dim_idx]))[:50]
        fig.add_trace(
            go.Scatter(y=freq_content, name=f'Dim {dim_idx}', mode='lines'),
            row=2, col=2
        )
    
    # Update layout
    fig.update_layout(height=800, showlegend=True, 
                     title_text="Positional Encoding Analysis")
    fig.update_xaxes(title_text="Dimension", row=1, col=1)
    fig.update_yaxes(title_text="Position", row=1, col=1)
    fig.update_xaxes(title_text="Dimension Index", row=1, col=2)
    fig.update_xaxes(title_text="Position", row=2, col=1)
    fig.update_xaxes(title_text="Frequency", row=2, col=2)
    
    fig.show()
    
    # Also create matplotlib visualization
    plt.figure(figsize=(15, 8))
    
    # Heatmap
    plt.subplot(2, 3, 1)
    plt.imshow(pe.T[:64, :], aspect='auto', cmap='coolwarm')
    plt.colorbar(label='Encoding Value')
    plt.xlabel('Position')
    plt.ylabel('Dimension')
    plt.title('Positional Encoding (First 64 dims)')
    
    # Line plots for different positions
    plt.subplot(2, 3, 2)
    for pos in [0, 10, 50, 99]:
        plt.plot(pe[pos, :128], label=f'Position {pos}', alpha=0.7)
    plt.xlabel('Dimension')
    plt.ylabel('Encoding Value')
    plt.title('Encoding Values at Different Positions')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Sinusoidal patterns
    plt.subplot(2, 3, 3)
    plt.plot(pe[:, 0], label='Dim 0 (sin)', color='blue')
    plt.plot(pe[:, 1], label='Dim 1 (cos)', color='red')
    plt.plot(pe[:, 8], label='Dim 8 (sin)', color='green')
    plt.plot(pe[:, 9], label='Dim 9 (cos)', color='orange')
    plt.xlabel('Position')
    plt.ylabel('Encoding Value')
    plt.title('Sinusoidal Patterns')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Distance matrix
    plt.subplot(2, 3, 4)
    # Calculate euclidean distances between positions
    distances = np.zeros((50, 50))
    for i in range(50):
        for j in range(50):
            distances[i, j] = np.linalg.norm(pe[i] - pe[j])
    plt.imshow(distances, cmap='viridis', aspect='auto')
    plt.colorbar(label='Euclidean Distance')
    plt.xlabel('Position')
    plt.ylabel('Position')
    plt.title('Distance Matrix Between Positions')
    
    # Relative position encoding
    plt.subplot(2, 3, 5)
    base_pos = 25
    relative_distances = [np.linalg.norm(pe[base_pos] - pe[base_pos + offset]) 
                         for offset in range(-25, 25)]
    plt.plot(range(-25, 25), relative_distances, 'o-')
    plt.xlabel('Relative Position')
    plt.ylabel('Euclidean Distance')
    plt.title(f'Distance from Position {base_pos}')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

visualize_positional_encoding()

### 🎯 Your Turn: Exercise 3
Implement **learnable positional embeddings** as an alternative to sinusoidal encoding. 
Compare with the sinusoidal approach.

In [None]:
class LearnablePositionalEmbedding(nn.Module):
    """
    Learnable Positional Embeddings (like BERT)
    """
    def __init__(self, d_model, max_seq_len=5000):
        super().__init__()
        # TODO: Create learnable embedding matrix
        # Hint: Use nn.Embedding
        self.pos_embedding = None  # Replace with your implementation
        
        # Solution (uncomment to see):
        # self.pos_embedding = nn.Embedding(max_seq_len, d_model)
        
    def forward(self, x):
        """
        x: [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, d_model = x.shape
        
        # TODO: Create position indices and add embeddings
        # positions = None  # Your code here
        # return None  # Your code here
        
        # Solution (uncomment to see):
        # positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
        # return x + self.pos_embedding(positions)
        
        pass

# Test your implementation
# lpe = LearnablePositionalEmbedding(512, 1000)
# test_input = torch.randn(2, 50, 512)
# output = lpe(test_input)
# print(f"Output shape: {output.shape}")

## Part 3: Building a Complete Transformer

### Exercise 4: Transformer Encoder Block
#### 📚 Concept
A Transformer encoder block consists of:
1. Multi-Head Self-Attention
2. Add & Norm
3. Feed-Forward Network
4. Add & Norm

In [None]:
class FeedForward(nn.Module):
    """
    Position-wise Feed-Forward Network
    FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
    """
    def __init__(self, d_model, d_ff=2048, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.ReLU()
        
    def forward(self, x):
        # Two linear transformations with ReLU activation
        return self.linear2(self.dropout(self.activation(self.linear1(x))))

class TransformerEncoderBlock(nn.Module):
    """
    Single Transformer Encoder Block
    """
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.mha = MultiHeadAttention(d_model, n_heads, dropout)
        self.ffn = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Multi-Head Attention with residual connection
        attn_output, attn_weights = self.mha(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-Forward with residual connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        
        return x, attn_weights

# Test encoder block
def test_encoder_block():
    batch_size, seq_len, d_model = 2, 20, 512
    
    # Create input
    x = torch.randn(batch_size, seq_len, d_model)
    
    # Create encoder block
    encoder = TransformerEncoderBlock(d_model)
    
    # Forward pass
    output, attn_weights = encoder(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Output statistics:")
    print(f"  Mean: {output.mean().item():.4f}")
    print(f"  Std: {output.std().item():.4f}")
    print(f"  Min: {output.min().item():.4f}")
    print(f"  Max: {output.max().item():.4f}")
    
    return output, attn_weights

encoder_output, encoder_weights = test_encoder_block()

### Exercise 5: Complete Transformer Encoder
#### 📚 Concept
Stack multiple encoder blocks to create a deep Transformer encoder.

In [None]:
class TransformerEncoder(nn.Module):
    """
    Complete Transformer Encoder with N stacked layers
    """
    def __init__(self, n_layers=6, d_model=512, n_heads=8, d_ff=2048, 
                 max_seq_len=5000, dropout=0.1):
        super().__init__()
        
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
        
        # Stack of encoder blocks
        self.layers = nn.ModuleList([
            TransformerEncoderBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Add positional encoding
        x = self.pos_encoding(x)
        x = self.dropout(x)
        
        # Store attention weights from each layer
        attention_weights = []
        
        # Pass through each encoder layer
        for layer in self.layers:
            x, attn = layer(x, mask)
            attention_weights.append(attn)
        
        return x, attention_weights

# Create and test complete encoder
def test_transformer_encoder():
    batch_size, seq_len, d_model = 2, 30, 512
    n_layers = 6
    
    # Create input embeddings
    x = torch.randn(batch_size, seq_len, d_model)
    
    # Create encoder
    encoder = TransformerEncoder(n_layers=n_layers, d_model=d_model)
    
    # Forward pass
    output, all_attn_weights = encoder(x)
    
    print(f"Encoder with {n_layers} layers:")
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Number of attention weight matrices: {len(all_attn_weights)}")
    print(f"Total parameters: {sum(p.numel() for p in encoder.parameters()):,}")
    
    return output, all_attn_weights

transformer_output, all_attention_weights = test_transformer_encoder()

### Analyzing Layer-wise Attention Patterns

In [None]:
def analyze_layerwise_attention(all_attention_weights):
    """
    Analyze how attention patterns change across layers
    """
    n_layers = len(all_attention_weights)
    
    # Create subplots for first 6 layers
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for layer_idx in range(min(n_layers, 6)):
        # Take first batch, first head for visualization
        weights = all_attention_weights[layer_idx][0, 0].detach().numpy()
        
        ax = axes[layer_idx]
        im = ax.imshow(weights, cmap='Reds', aspect='auto')
        ax.set_title(f'Layer {layer_idx + 1}')
        ax.set_xlabel('Key Position')
        ax.set_ylabel('Query Position')
        plt.colorbar(im, ax=ax, fraction=0.046)
    
    plt.suptitle('Attention Patterns Across Transformer Layers', 
                fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Analyze attention entropy across layers
    print("\n📊 Layer-wise Attention Analysis:")
    print("-" * 50)
    
    layer_entropies = []
    for layer_idx, attn in enumerate(all_attention_weights):
        # Calculate average entropy across all heads and positions
        weights = attn[0].detach().numpy()  # First batch
        entropy = -np.sum(weights * np.log(weights + 1e-10), axis=-1).mean()
        layer_entropies.append(entropy)
        print(f"Layer {layer_idx + 1}: Average Entropy = {entropy:.3f}")
    
    # Plot entropy trend
    plt.figure(figsize=(10, 4))
    plt.plot(range(1, len(layer_entropies) + 1), layer_entropies, 
            'o-', linewidth=2, markersize=8)
    plt.xlabel('Layer')
    plt.ylabel('Average Attention Entropy')
    plt.title('Attention Entropy Across Layers')
    plt.grid(True, alpha=0.3)
    plt.show()

analyze_layerwise_attention(all_attention_weights)

## Part 4: Practical Applications

### Exercise 6: Text Classification with Transformers
#### 📚 Concept
Apply Transformer encoder for sequence classification tasks.

In [None]:
class TransformerClassifier(nn.Module):
    """
    Transformer-based text classifier
    """
    def __init__(self, vocab_size, n_classes, d_model=512, n_layers=6, 
                 n_heads=8, d_ff=2048, max_seq_len=512, dropout=0.1):
        super().__init__()
        
        # Token embeddings
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.scale = math.sqrt(d_model)
        
        # Transformer encoder
        self.encoder = TransformerEncoder(
            n_layers, d_model, n_heads, d_ff, max_seq_len, dropout
        )
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(d_model, d_model // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_model // 2, n_classes)
        )
        
    def forward(self, x, mask=None):
        # Embed tokens
        x = self.embedding(x) * self.scale
        
        # Encode with transformer
        encoded, attn_weights = self.encoder(x, mask)
        
        # Use [CLS] token or mean pooling for classification
        # Here we use mean pooling
        if mask is not None:
            mask_expanded = mask.unsqueeze(-1).expand(encoded.size()).float()
            sum_embeddings = torch.sum(encoded * mask_expanded, dim=1)
            sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
            pooled = sum_embeddings / sum_mask
        else:
            pooled = encoded.mean(dim=1)
        
        # Classify
        logits = self.classifier(pooled)
        
        return logits, attn_weights

# Create synthetic classification dataset
def create_synthetic_dataset():
    """
    Create a synthetic text classification dataset
    """
    vocab_size = 1000
    seq_len = 50
    n_samples = 100
    n_classes = 3
    
    # Generate random sequences
    X = torch.randint(0, vocab_size, (n_samples, seq_len))
    # Generate random labels
    y = torch.randint(0, n_classes, (n_samples,))
    
    # Create attention mask (simulate variable length sequences)
    mask = torch.ones(n_samples, seq_len)
    for i in range(n_samples):
        actual_len = torch.randint(20, seq_len, (1,)).item()
        mask[i, actual_len:] = 0
    
    return X, y, mask

# Train the classifier
def train_classifier():
    # Create dataset
    X, y, mask = create_synthetic_dataset()
    
    # Model parameters
    vocab_size = 1000
    n_classes = 3
    d_model = 256  # Smaller model for demo
    
    # Create model
    model = TransformerClassifier(
        vocab_size, n_classes, d_model=d_model, 
        n_layers=3, n_heads=4
    )
    
    # Training setup
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()
    
    # Training loop (simplified)
    model.train()
    n_epochs = 10
    batch_size = 16
    
    print("Training Transformer Classifier...")
    print("-" * 50)
    
    losses = []
    for epoch in range(n_epochs):
        epoch_loss = 0
        n_batches = len(X) // batch_size
        
        for i in range(n_batches):
            # Get batch
            batch_X = X[i*batch_size:(i+1)*batch_size]
            batch_y = y[i*batch_size:(i+1)*batch_size]
            batch_mask = mask[i*batch_size:(i+1)*batch_size]
            
            # Forward pass
            logits, _ = model(batch_X, batch_mask)
            loss = criterion(logits, batch_y)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        
        avg_loss = epoch_loss / n_batches
        losses.append(avg_loss)
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {avg_loss:.4f}")
    
    # Plot training curve
    plt.figure(figsize=(10, 4))
    plt.plot(losses, 'o-', linewidth=2, markersize=6)
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training Loss Curve')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return model, losses

trained_model, training_losses = train_classifier()

### Exercise 7: Sequence-to-Sequence with Transformers
#### 📚 Concept
Implement a simple sequence-to-sequence model using Transformer architecture.

In [None]:
class TransformerSeq2Seq(nn.Module):
    """
    Simple Transformer for Sequence-to-Sequence tasks
    (Encoder-only for simplicity)
    """
    def __init__(self, input_vocab_size, output_vocab_size, 
                 d_model=512, n_layers=6, n_heads=8, d_ff=2048, 
                 max_seq_len=100, dropout=0.1):
        super().__init__()
        
        # Embeddings
        self.input_embedding = nn.Embedding(input_vocab_size, d_model)
        self.output_embedding = nn.Embedding(output_vocab_size, d_model)
        self.scale = math.sqrt(d_model)
        
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
        
        # Transformer encoder
        self.encoder = TransformerEncoder(
            n_layers, d_model, n_heads, d_ff, max_seq_len, dropout
        )
        
        # Output projection
        self.output_projection = nn.Linear(d_model, output_vocab_size)
        
    def forward(self, src, tgt=None):
        # Encode source
        src_emb = self.input_embedding(src) * self.scale
        src_emb = self.pos_encoding(src_emb)
        
        # Get encoder output
        encoder_output, _ = self.encoder(src_emb)
        
        # Project to output vocabulary
        output = self.output_projection(encoder_output)
        
        return output

# Demonstrate sequence generation
def demonstrate_seq2seq():
    """
    Simple demonstration of sequence-to-sequence translation
    """
    # Model parameters
    input_vocab_size = 100
    output_vocab_size = 100
    seq_len = 20
    d_model = 256
    
    # Create model
    model = TransformerSeq2Seq(
        input_vocab_size, output_vocab_size,
        d_model=d_model, n_layers=2, n_heads=4
    )
    
    # Create sample input
    batch_size = 2
    src = torch.randint(0, input_vocab_size, (batch_size, seq_len))
    
    # Forward pass
    output = model(src)
    
    print(f"Sequence-to-Sequence Model:")
    print(f"Input shape: {src.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Output vocab size: {output_vocab_size}")
    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
    
    # Get predictions
    predictions = output.argmax(dim=-1)
    print(f"\nSample predictions shape: {predictions.shape}")
    
    return model, output

seq2seq_model, seq2seq_output = demonstrate_seq2seq()

### 🎯 Your Turn: Exercise 7
Implement beam search for better sequence generation. Currently, we use greedy decoding (argmax).
Implement beam search with beam_size=3.

In [None]:
def beam_search(model, src, beam_size=3, max_length=50):
    """
    Implement beam search for sequence generation
    TODO: Complete this implementation
    """
    # Your code here
    # Hints:
    # 1. Maintain beam_size hypotheses
    # 2. At each step, expand each hypothesis
    # 3. Keep top beam_size candidates based on cumulative probability
    # 4. Stop when EOS token or max_length reached
    
    # Placeholder return
    return None

# Test your implementation
# src_sequence = torch.randint(0, 100, (1, 20))
# best_sequence = beam_search(seq2seq_model, src_sequence, beam_size=3)
# print(f"Best sequence: {best_sequence}")

## Part 5: Advanced Topics

### Exercise 8: Attention Visualization and Interpretability
#### 📚 Concept
Visualize what the model is "looking at" when making predictions.

In [None]:
def create_attention_visualization(text_tokens, attention_weights):
    """
    Create an interactive attention visualization
    """
    # Convert attention weights to numpy if needed
    if torch.is_tensor(attention_weights):
        attention_weights = attention_weights.detach().cpu().numpy()
    
    # Take first head for visualization
    if len(attention_weights.shape) > 2:
        attention_weights = attention_weights[0]  # First head
    
    # Create interactive heatmap with Plotly
    fig = go.Figure(data=go.Heatmap(
        z=attention_weights,
        x=text_tokens,
        y=text_tokens,
        colorscale='Reds',
        text=np.round(attention_weights, 3),
        texttemplate="%{text}",
        textfont={"size": 8},
        showscale=True,
        hoverongaps=False,
        hovertemplate="From: %{y}<br>To: %{x}<br>Weight: %{z:.3f}<extra></extra>"
    ))
    
    fig.update_layout(
        title="Attention Weight Visualization",
        xaxis_title="Keys (To)",
        yaxis_title="Queries (From)",
        height=600,
        width=700,
        xaxis={'side': 'bottom'},
        yaxis={'autorange': 'reversed'}
    )
    
    fig.show()
    
    # Also create a flow diagram showing top attention connections
    threshold = 0.1  # Only show connections above this threshold
    
    plt.figure(figsize=(12, 8))
    
    # Create node positions in a circle
    n_tokens = len(text_tokens)
    angles = np.linspace(0, 2*np.pi, n_tokens, endpoint=False)
    x_pos = np.cos(angles)
    y_pos = np.sin(angles)
    
    # Draw tokens as nodes
    for i, (x, y, token) in enumerate(zip(x_pos, y_pos, text_tokens)):
        plt.scatter(x, y, s=1000, c='lightblue', edgecolors='black', linewidth=2, zorder=3)
        plt.text(x, y, token, ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Draw attention connections as edges
    for i in range(n_tokens):
        for j in range(n_tokens):
            if attention_weights[i, j] > threshold:
                # Draw arrow from i to j with weight as thickness
                plt.arrow(x_pos[i], y_pos[i], 
                         0.8*(x_pos[j] - x_pos[i]), 0.8*(y_pos[j] - y_pos[i]),
                         head_width=0.05, head_length=0.05,
                         alpha=attention_weights[i, j],
                         color='red', zorder=1,
                         length_includes_head=True,
                         width=attention_weights[i, j] * 0.02)
    
    plt.title("Attention Flow Diagram (Threshold > 0.1)", fontsize=14, fontweight='bold')
    plt.axis('equal')
    plt.axis('off')
    plt.tight_layout()
    plt.show()

# Example usage with sample tokens
sample_tokens = ["The", "cat", "sat", "on", "the", "mat", ".", "<PAD>"]
sample_attention = torch.softmax(torch.randn(8, 8) * 2, dim=-1)

create_attention_visualization(sample_tokens, sample_attention)

### Exercise 9: Efficient Attention Mechanisms
#### 📚 Concept
Explore efficient attention variants to handle longer sequences.

In [None]:
class LinearAttention(nn.Module):
    """
    Linear Attention - O(n) complexity instead of O(n²)
    Using kernel trick to approximate attention
    """
    def __init__(self, d_model, n_heads=8, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, q, k, v, mask=None):
        batch_size, seq_len, _ = q.size()
        
        # Linear projections
        Q = self.W_q(q).view(batch_size, seq_len, self.n_heads, self.d_k)
        K = self.W_k(k).view(batch_size, seq_len, self.n_heads, self.d_k)
        V = self.W_v(v).view(batch_size, seq_len, self.n_heads, self.d_k)
        
        # Apply kernel feature map (simplified - using ELU + 1)
        Q = F.elu(Q) + 1
        K = F.elu(K) + 1
        
        # Compute attention in linear time
        # Instead of (QK^T)V, compute Q(K^TV)
        K_T_V = torch.einsum('bshd,bshf->bhdf', K, V)
        QK_T_V = torch.einsum('bshd,bhdf->bshf', Q, K_T_V)
        
        # Normalization
        K_sum = K.sum(dim=1, keepdim=True)
        Q_K_sum = torch.einsum('bshd,bhd->bsh', Q, K_sum.squeeze(1))
        output = QK_T_V / (Q_K_sum.unsqueeze(-1) + 1e-6)
        
        # Reshape and project
        output = output.reshape(batch_size, seq_len, self.d_model)
        output = self.W_o(output)
        
        return output, None  # No attention weights in linear attention

def compare_attention_mechanisms():
    """
    Compare standard attention vs linear attention
    """
    import time
    
    d_model = 256
    n_heads = 4
    
    # Test different sequence lengths
    seq_lengths = [50, 100, 200, 500, 1000]
    
    standard_times = []
    linear_times = []
    
    print("Comparing Attention Mechanisms:")
    print("-" * 50)
    
    for seq_len in seq_lengths:
        batch_size = 4
        x = torch.randn(batch_size, seq_len, d_model)
        
        # Standard attention
        standard_attn = MultiHeadAttention(d_model, n_heads)
        start = time.time()
        with torch.no_grad():
            _, _ = standard_attn(x, x, x)
        standard_time = time.time() - start
        standard_times.append(standard_time)
        
        # Linear attention
        linear_attn = LinearAttention(d_model, n_heads)
        start = time.time()
        with torch.no_grad():
            _, _ = linear_attn(x, x, x)
        linear_time = time.time() - start
        linear_times.append(linear_time)
        
        print(f"Seq Length {seq_len:4d}: Standard={standard_time:.4f}s, "
              f"Linear={linear_time:.4f}s, Speedup={standard_time/linear_time:.2f}x")
    
    # Plot comparison
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(seq_lengths, standard_times, 'o-', label='Standard Attention', linewidth=2)
    plt.plot(seq_lengths, linear_times, 's-', label='Linear Attention', linewidth=2)
    plt.xlabel('Sequence Length')
    plt.ylabel('Time (seconds)')
    plt.title('Computation Time Comparison')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 2, 2)
    speedup = [s/l for s, l in zip(standard_times, linear_times)]
    plt.bar(range(len(seq_lengths)), speedup, color='green', alpha=0.7)
    plt.xticks(range(len(seq_lengths)), seq_lengths)
    plt.xlabel('Sequence Length')
    plt.ylabel('Speedup Factor')
    plt.title('Linear Attention Speedup')
    plt.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()

compare_attention_mechanisms()

### Exercise 10: Training Dynamics Analysis
#### 📚 Concept
Analyze and visualize the training dynamics of Transformers.

In [None]:
class TrainingMonitor:
    """
    Monitor and visualize transformer training dynamics
    """
    def __init__(self):
        self.metrics = {
            'loss': [],
            'gradient_norm': [],
            'learning_rate': [],
            'attention_entropy': []
        }
        
    def update(self, loss, model, lr, attention_weights=None):
        # Record loss
        self.metrics['loss'].append(loss)
        
        # Calculate gradient norm
        total_norm = 0
        for p in model.parameters():
            if p.grad is not None:
                param_norm = p.grad.data.norm(2)
                total_norm += param_norm.item() ** 2
        total_norm = total_norm ** 0.5
        self.metrics['gradient_norm'].append(total_norm)
        
        # Record learning rate
        self.metrics['learning_rate'].append(lr)
        
        # Calculate attention entropy if provided
        if attention_weights is not None:
            # Calculate entropy of attention distribution
            weights = attention_weights.detach().cpu().numpy()
            entropy = -np.sum(weights * np.log(weights + 1e-10), axis=-1).mean()
            self.metrics['attention_entropy'].append(entropy)
        
    def plot_metrics(self):
        """
        Create comprehensive training visualization
        """
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=('Training Loss', 'Gradient Norm', 
                          'Learning Rate Schedule', 'Attention Entropy')
        )
        
        # Training loss
        fig.add_trace(
            go.Scatter(y=self.metrics['loss'], mode='lines', name='Loss'),
            row=1, col=1
        )
        
        # Gradient norm
        fig.add_trace(
            go.Scatter(y=self.metrics['gradient_norm'], mode='lines', 
                      name='Grad Norm', line=dict(color='red')),
            row=1, col=2
        )
        
        # Learning rate
        fig.add_trace(
            go.Scatter(y=self.metrics['learning_rate'], mode='lines', 
                      name='LR', line=dict(color='green')),
            row=2, col=1
        )
        
        # Attention entropy
        if self.metrics['attention_entropy']:
            fig.add_trace(
                go.Scatter(y=self.metrics['attention_entropy'], mode='lines', 
                          name='Entropy', line=dict(color='purple')),
                row=2, col=2
            )
        
        fig.update_layout(height=600, showlegend=False, 
                         title_text="Training Dynamics Analysis")
        fig.update_xaxes(title_text="Step")
        fig.update_yaxes(title_text="Value")
        
        fig.show()

# Simulate training with monitoring
def simulate_training_with_monitoring():
    """
    Demonstrate training monitoring
    """
    # Create simple model
    model = TransformerEncoder(n_layers=2, d_model=128, n_heads=4)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    monitor = TrainingMonitor()
    
    # Simulate training steps
    print("Simulating training with monitoring...")
    for step in range(50):
        # Fake data
        x = torch.randn(4, 20, 128)
        
        # Forward pass
        output, attn_weights = model(x)
        loss = output.mean()  # Dummy loss
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Update monitor
        current_lr = optimizer.param_groups[0]['lr']
        monitor.update(
            loss.item(), model, current_lr, 
            attn_weights[0] if attn_weights else None
        )
        
        if (step + 1) % 10 == 0:
            print(f"Step {step+1}: Loss={loss.item():.4f}")
    
    # Plot training dynamics
    monitor.plot_metrics()
    
    return monitor

training_monitor = simulate_training_with_monitoring()

### 🎯 Your Turn: Exercise 10
Implement learning rate warmup schedule as used in the original Transformer paper.
The formula is: `lr = d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5))`

In [None]:
class WarmupScheduler:
    """
    Learning rate scheduler with warmup
    TODO: Implement the warmup schedule
    """
    def __init__(self, d_model, warmup_steps=4000):
        self.d_model = d_model
        self.warmup_steps = warmup_steps
        
    def get_lr(self, step):
        """
        Calculate learning rate for given step
        """
        # TODO: Implement the formula
        # lr = d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5))
        
        # Your code here
        lr = 0.001  # Replace with proper implementation
        
        # Solution (uncomment to see):
        # arg1 = step ** (-0.5)
        # arg2 = step * (self.warmup_steps ** (-1.5))
        # lr = (self.d_model ** (-0.5)) * min(arg1, arg2)
        
        return lr

# Test and visualize the schedule
def visualize_lr_schedule():
    scheduler = WarmupScheduler(d_model=512, warmup_steps=4000)
    
    steps = np.arange(1, 10000)
    lrs = [scheduler.get_lr(step) for step in steps]
    
    plt.figure(figsize=(10, 5))
    plt.plot(steps, lrs, linewidth=2)
    plt.axvline(x=4000, color='red', linestyle='--', label='Warmup ends')
    plt.xlabel('Training Step')
    plt.ylabel('Learning Rate')
    plt.title('Learning Rate Schedule with Warmup')
    plt.grid(True, alpha=0.3)
    plt.legend()
    plt.show()

# visualize_lr_schedule()

## Part 6: Summary and Final Exercises

### 🎓 Key Takeaways

1. **Self-Attention**: Enables parallel processing and captures global dependencies
2. **Multi-Head Attention**: Learns different types of relationships simultaneously
3. **Positional Encoding**: Injects position information into the model
4. **Architecture**: Encoder-decoder structure with residual connections and normalization
5. **Applications**: Versatile architecture for NLP, vision, and multimodal tasks

### 📊 Performance Comparison

In [None]:
# Create a comprehensive comparison chart
def create_architecture_comparison():
    """
    Compare Transformer with RNN/LSTM/GRU
    """
    architectures = ['RNN', 'LSTM', 'GRU', 'Transformer']
    
    # Metrics (relative scores)
    metrics = {
        'Parallelization': [1, 1, 1, 10],
        'Long-range Dependencies': [2, 5, 5, 9],
        'Training Speed': [3, 2, 3, 8],
        'Memory Efficiency': [8, 6, 7, 4],
        'Parameter Efficiency': [9, 5, 6, 7]
    }
    
    # Create radar chart
    fig = go.Figure()
    
    for arch in architectures:
        values = [metrics[metric][architectures.index(arch)] 
                 for metric in metrics.keys()]
        values.append(values[0])  # Close the polygon
        
        fig.add_trace(go.Scatterpolar(
            r=values,
            theta=list(metrics.keys()) + [list(metrics.keys())[0]],
            fill='toself',
            name=arch
        ))
    
    fig.update_layout(
        polar=dict(
            radialaxis=dict(
                visible=True,
                range=[0, 10]
            )),
        showlegend=True,
        title="Architecture Comparison: RNN vs Transformer",
        height=500
    )
    
    fig.show()
    
    # Also create bar chart comparison
    plt.figure(figsize=(12, 6))
    
    x = np.arange(len(metrics))
    width = 0.2
    
    for i, arch in enumerate(architectures):
        values = [metrics[metric][i] for metric in metrics.keys()]
        plt.bar(x + i*width, values, width, label=arch)
    
    plt.xlabel('Metrics')
    plt.ylabel('Score (1-10)')
    plt.title('Architecture Performance Comparison')
    plt.xticks(x + width*1.5, list(metrics.keys()), rotation=45)
    plt.legend()
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

create_architecture_comparison()

### 🚀 Next Steps and Resources

#### Recommended Implementations to Try:
1. **BERT-style Pretraining**: Implement masked language modeling
2. **GPT-style Generation**: Build an autoregressive language model
3. **Vision Transformer**: Apply Transformers to image patches
4. **Cross-modal Attention**: Combine text and image understanding

#### Key Papers to Read:
- **"Attention Is All You Need"** (Vaswani et al., 2017) - Original Transformer
- **"BERT: Pre-training of Deep Bidirectional Transformers"** (Devlin et al., 2018)
- **"Language Models are Few-Shot Learners"** (Brown et al., 2020) - GPT-3
- **"An Image is Worth 16x16 Words"** (Dosovitskiy et al., 2020) - Vision Transformer

#### Useful Libraries:
- **Hugging Face Transformers**: Production-ready implementations
- **PyTorch**: Flexible deep learning framework
- **Einops**: Elegant tensor operations
- **Weights & Biases**: Experiment tracking

---

### 📝 Final Challenge Exercise

Implement a mini-BERT model for masked language modeling:

In [None]:
# Final Challenge: Mini-BERT Implementation
class MiniBERT(nn.Module):
    """
    Simplified BERT for Masked Language Modeling
    Your task: Complete the implementation
    """
    def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, 
                 max_seq_len=512, mask_token_id=103):
        super().__init__()
        self.mask_token_id = mask_token_id
        
        # TODO: Add components
        # 1. Token embeddings
        # 2. Segment embeddings (for sentence A/B)
        # 3. Positional encoding
        # 4. Transformer encoder
        # 5. MLM head (prediction layer)
        
        pass
    
    def forward(self, input_ids, mask_positions):
        """
        Forward pass for masked language modeling
        
        Args:
            input_ids: Token IDs with [MASK] tokens
            mask_positions: Positions of masked tokens
        
        Returns:
            predictions: Vocabulary predictions for masked positions
        """
        # TODO: Implement forward pass
        pass
    
    def create_mlm_data(self, text_ids, mask_prob=0.15):
        """
        Create masked language modeling training data
        """
        # TODO: Randomly mask tokens
        pass

# Placeholder for testing
print("Challenge: Implement MiniBERT for masked language modeling!")
print("This combines all concepts learned in this notebook.")
print("Good luck! 🚀")

---

## 🎉 Congratulations!

You've completed the comprehensive Transformer Architecture notebook! You've learned:

✅ Self-attention mechanism and its implementation  
✅ Multi-head attention for learning diverse patterns  
✅ Positional encoding techniques  
✅ Complete Transformer architecture  
✅ Practical applications and optimizations  
✅ Advanced topics like efficient attention and training dynamics  

### 📚 Additional Exercises:
1. Experiment with different positional encoding methods
2. Implement cross-attention for encoder-decoder models
3. Try different attention patterns (local, sparse, etc.)
4. Build a small-scale language model
5. Apply Transformers to your own dataset

### 🤝 Connect and Share:
Share your implementations and insights with the community!

---
**Remember**: The best way to understand Transformers is to implement them from scratch. Keep experimenting and building! 💪