# üéì Transformer Training - Harvard NLP Style (v2.0)
## Train on WikiText-2 with Complete Monitoring & Evaluation

**Based on Harvard NLP's Annotated Transformer**

### üìö What's New in v2.0?
- ‚úÖ **Harvard NLP Batch Class**: Uses `Batch` class for automatic masking
- ‚úÖ **Extended Training**: 20 epochs (vs 3 epochs)
- ‚úÖ **Generation Monitoring**: Sample outputs after each epoch
- ‚úÖ **Quality Metrics**: Track repetition, diversity, coherence
- ‚úÖ **Early Stopping**: Automatic stopping when converged
- ‚úÖ **Better Prompts**: Wikipedia-style prompts matching training data
- ‚úÖ **LR Visualization**: See the Harvard NLP learning rate schedule
- ‚úÖ **Module Reload**: No more caching issues!
- ‚úÖ **Clean Structure**: Professional organization
- ‚úÖ **Proper Loss Calculation**: Follows Harvard NLP pattern exactly

### üîë Key Harvard NLP Patterns Used:
1. **`make_model()`** - Factory function for model creation
2. **`Batch` class** - Automatic src/tgt splitting and masking
3. **`rate()` scheduler** - Learning rate warmup from the paper
4. **`EncoderDecoder`** - Main architecture wrapper
5. **`Generator`** - Log softmax projection layer

### üìã Prerequisites
1. Upload `data_processed.zip` to Colab
2. Extract it: `!unzip data_processed.zip`
3. Mount Google Drive for checkpoints

### ‚è±Ô∏è Expected Training Time
- ~2.5 hours for 20 epochs on A100 GPU
- Final perplexity: ~180-220 (vs 350+ with 3 epochs)

---

## üì¶ Setup & Installation

### 1Ô∏è‚É£ Check GPU & Device

In [None]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected - training will be VERY slow!")

print(f"\n‚úì Using device: {device}")

### 2Ô∏è‚É£ Mount Google Drive

In [None]:
from google.colab import drive
import os

drive.mount('/content/drive')

# Create checkpoint directory
CHECKPOINT_DIR = '/content/drive/MyDrive/transformer_checkpoints_v2'
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
print(f"‚úì Checkpoints will be saved to: {CHECKPOINT_DIR}")

### 3Ô∏è‚É£ Clone Repository & Install Package

In [None]:
# Clean up any existing installation
!rm -rf LLM-Journey

# Clone repository
!git clone https://github.com/mohamedAtoui/LLM-Journey
%cd LLM-Journey

# Install dependencies
!pip install -q datasets transformers tqdm matplotlib seaborn

# Install mha package in editable mode
!pip install -q -e .

print("\n‚úì Installation complete!")

### 4Ô∏è‚É£ Import Everything

In [None]:
# Standard libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader
from transformers import GPT2Tokenizer
from datasets import load_from_disk
from tqdm import tqdm
import math
import numpy as np
import matplotlib.pyplot as plt
import time

# Harvard NLP components
from mha import make_model
from mha.utils import rate, Batch  # Import Batch class!
from mha.inference import TextGenerator
from mha.attention import subsequent_mask

print("‚úì All imports successful!")
print("‚úì Using Harvard NLP's Annotated Transformer patterns")
print("‚úì Training will use the Batch class for proper masking")

## ‚öôÔ∏è Configuration

In [None]:
# Model architecture (Harvard NLP parameter names)
config = {
    # Model
    'src_vocab': 50257,
    'tgt_vocab': 50257,
    'N': 6,
    'd_model': 512,
    'd_ff': 2048,
    'h': 8,
    'dropout': 0.1,
    'max_seq_length': 512,
    
    # Training
    'batch_size': 8,
    'num_epochs': 20,  # Extended from 3!
    'warmup_steps': 4000,  # Increased from 2000
    'gradient_clip': 1.0,
    
    # Early stopping
    'early_stop_patience': 3,
    'early_stop_min_delta': 0.01,
    
    # Dataset size (for experimental training)
    # Set to None to use full dataset, or specify number of samples
    'train_subset_size': 362,  # e.g., 1000 for 1k samples
    'val_subset_size': 22,    # e.g., 200 for 200 samples
}

print("Configuration:")
print("=" * 60)
for key, value in config.items():
    print(f"  {key:20s}: {value}")
print("=" * 60)

## üìä Load Data

### üß™ Experimental Training Options

You can reduce dataset size for faster experimental training by modifying `config`:

| Setup | train_subset_size | val_subset_size | Time/Epoch | Total Time (20 epochs) | Use Case |
|-------|-------------------|-----------------|------------|------------------------|----------|
| **Full Dataset** | `None` | `None` | ~15 min | ~5 hours | Final training |
| **50% Dataset** | `1810` | `109` | ~7-8 min | ~2.5 hours | Quick experiment |
| **25% Dataset** | `905` | `55` | ~4 min | ~1.3 hours | Fast iteration |
| **10% Dataset** | `362` | `22` | ~2 min | ~40 minutes | Very quick test |
| **5% Dataset** | `181` | `11` | ~1 min | ~20 minutes | Rapid prototyping |

**Original dataset sizes:**
- Train: 3,620 samples
- Validation: 218 samples

**Example for 10% dataset:**
```python
config = {
    ...
    'train_subset_size': 362,  # 10% of 3,620
    'val_subset_size': 22,     # 10% of 218
}
```

**Note:** Smaller datasets will result in:
- ‚úÖ Much faster training
- ‚úÖ Quick iteration for testing
- ‚ö†Ô∏è Lower final quality (less data to learn from)
- ‚ö†Ô∏è Higher validation perplexity

In [None]:
print("Loading WikiText-2 dataset...\n")

DATA_PATH = './data/wikitext2_processed'

# Load dataset
dataset = load_from_disk(DATA_PATH)
train_dataset = dataset['train']
val_dataset = dataset['validation']

# Apply subset if specified (for experimental training)
if config['train_subset_size'] is not None:
    train_dataset = train_dataset.select(range(min(config['train_subset_size'], len(train_dataset))))
    print(f"‚ö†Ô∏è Using subset of training data: {len(train_dataset)} samples")

if config['val_subset_size'] is not None:
    val_dataset = val_dataset.select(range(min(config['val_subset_size'], len(val_dataset))))
    print(f"‚ö†Ô∏è Using subset of validation data: {len(val_dataset)} samples")

# Initialize tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

# Set PyTorch format
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

print(f"‚úì Dataset loaded!")
print(f"  Vocab size: {len(tokenizer):,}")
print(f"  Train samples: {len(train_dataset):,}")
print(f"  Val samples: {len(val_dataset):,}")
print(f"  Sequence length: {len(train_dataset[0]['input_ids'])} tokens")

### Create DataLoaders

In [None]:
def collate_fn(batch):
    input_ids = torch.stack([item['input_ids'] for item in batch])
    attention_mask = torch.stack([item['attention_mask'] for item in batch])
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': input_ids.clone()
    }

train_loader = DataLoader(
    train_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=collate_fn
)

val_loader = DataLoader(
    val_dataset,
    batch_size=config['batch_size'],
    shuffle=False,
    collate_fn=collate_fn
)

print(f"‚úì DataLoaders created")
print(f"  Train batches: {len(train_loader):,}")
print(f"  Val batches: {len(val_loader):,}")

## üèóÔ∏è Create Model

In [None]:
# Create model using Harvard NLP's make_model()
model = make_model(
    src_vocab=config['src_vocab'],
    tgt_vocab=config['tgt_vocab'],
    N=config['N'],
    d_model=config['d_model'],
    d_ff=config['d_ff'],
    h=config['h'],
    dropout=config['dropout']
).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"‚úì Model created!")
print(f"  Architecture: EncoderDecoder (Harvard NLP)")
print(f"  Layers: {config['N']} encoder + {config['N']} decoder")
print(f"  Parameters: {total_params:,} ({total_params/1e6:.1f}M)")
print(f"  Model size: ~{total_params * 4 / 1e6:.1f} MB")

## üìà Visualize Learning Rate Schedule

In [None]:
# Visualize the Harvard NLP learning rate schedule
steps = np.arange(1, 15000)
lrs = [rate(s, config['d_model'], 1.0, config['warmup_steps']) for s in steps]

plt.figure(figsize=(12, 4))
plt.plot(steps, lrs, linewidth=2)
plt.axvline(config['warmup_steps'], color='r', linestyle='--', 
            label=f'Warmup end ({config["warmup_steps"]} steps)', linewidth=2)
plt.xlabel('Training Step', fontsize=12)
plt.ylabel('Learning Rate', fontsize=12)
plt.title('Harvard NLP Learning Rate Schedule: lrate = d_model^(-0.5) * min(step^(-0.5), step * warmup^(-1.5))', 
          fontsize=13)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"LR at step 1: {rate(1, config['d_model'], 1.0, config['warmup_steps']):.8f}")
print(f"LR at warmup: {rate(config['warmup_steps'], config['d_model'], 1.0, config['warmup_steps']):.6f}")
print(f"LR at 2x warmup: {rate(config['warmup_steps']*2, config['d_model'], 1.0, config['warmup_steps']):.6f}")

## üéØ Setup Optimizer & Loss

In [None]:
# Adam optimizer (parameters from the paper)
optimizer = optim.Adam(
    model.parameters(),
    lr=1.0,
    betas=(0.9, 0.98),
    eps=1e-9
)

# Learning rate scheduler
scheduler = LambdaLR(
    optimizer,
    lr_lambda=lambda step: rate(
        step + 1,
        model_size=config['d_model'],
        factor=1.0,
        warmup=config['warmup_steps']
    )
)

# Loss function
criterion = nn.NLLLoss(ignore_index=tokenizer.pad_token_id, reduction='sum')

print("‚úì Optimizer & loss configured")
print(f"  Optimizer: Adam (betas=(0.9, 0.98), eps=1e-9)")
print(f"  Scheduler: Harvard NLP rate() with {config['warmup_steps']} warmup steps")
print(f"  Loss: NLLLoss (reduction='sum')")

## üõ†Ô∏è Training Utilities

In [None]:
class EarlyStopping:
    """Early stopping to prevent overfitting"""
    def __init__(self, patience=3, min_delta=0.01):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None
    
    def __call__(self, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            return False
        
        if val_loss > self.best_loss - self.min_delta:
            self.counter += 1
            print(f"  ‚ö†Ô∏è EarlyStopping counter: {self.counter}/{self.patience}")
            if self.counter >= self.patience:
                return True
        else:
            self.best_loss = val_loss
            self.counter = 0
        return False

def format_time(seconds):
    """Format seconds into readable time"""
    mins = int(seconds // 60)
    secs = int(seconds % 60)
    return f"{mins}m {secs}s"

print("‚úì Utility functions defined")

### Training Function

**Harvard NLP Pattern Explanation:**

The training loop follows the Annotated Transformer's approach using the `Batch` class:

```python
# 1. Create Batch object (automatic masking)
batch = Batch(src=input_ids, tgt=input_ids, pad=pad_token)
# This automatically creates:
#   - batch.src: Source sequence
#   - batch.src_mask: Masks padding in source
#   - batch.tgt: Target input (excludes last token)
#   - batch.tgt_y: Target labels (excludes first token)
#   - batch.tgt_mask: Masks padding + future tokens
#   - batch.ntokens: Count of valid tokens

# 2. Forward pass
output = model.forward(batch.src, batch.tgt, batch.src_mask, batch.tgt_mask)

# 3. Generate log probabilities
log_probs = model.generator(output)

# 4. Compute loss (sum reduction)
loss = criterion(log_probs.reshape(-1, vocab), batch.tgt_y.reshape(-1))

# 5. Normalize by token count
loss = loss / batch.ntokens
```

This is the **clean, modular approach** from Harvard NLP - no manual mask creation!

In [None]:
def train_epoch(model, train_loader, optimizer, scheduler, criterion, device, epoch):
    """
    Train for one epoch using Harvard NLP's Batch class pattern
    
    This follows the Annotated Transformer approach:
    1. Create Batch object with automatic masking
    2. Forward pass through model
    3. Generate log probabilities
    4. Compute loss (sum reduction)
    5. Normalize by number of tokens
    """
    model.train()
    total_loss = 0
    total_tokens = 0
    grad_norms = []
    
    pbar = tqdm(train_loader, desc=f"Epoch {epoch}")
    for batch_idx, data in enumerate(pbar):
        input_ids = data['input_ids'].to(device)
        
        # Harvard NLP pattern: Use Batch class for automatic masking
        batch = Batch(
            src=input_ids,
            tgt=input_ids,
            pad=tokenizer.pad_token_id
        )
        
        # Forward pass (Harvard NLP style)
        decoder_output = model.forward(
            batch.src,      # Source sequence
            batch.tgt,      # Target input (excludes last token)
            batch.src_mask, # Source mask (hides padding)
            batch.tgt_mask  # Target mask (hides padding + future)
        )
        
        # Generate log probabilities
        log_probs = model.generator(decoder_output)
        
        # Compute loss (sum over batch)
        loss_sum = criterion(
            log_probs.reshape(-1, config['tgt_vocab']),
            batch.tgt_y.reshape(-1)  # Use batch.tgt_y (excludes first token)
        )
        
        # Normalize by number of tokens (Harvard NLP way)
        num_tokens = batch.ntokens.item()  # Use batch.ntokens
        loss = loss_sum / num_tokens if num_tokens > 0 else loss_sum
        
        # NaN detection
        if not torch.isfinite(loss):
            print(f"\n‚ö†Ô∏è Non-finite loss at batch {batch_idx}")
            continue
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        
        # Gradient clipping
        grad_norm = torch.nn.utils.clip_grad_norm_(
            model.parameters(), 
            config['gradient_clip']
        )
        grad_norms.append(grad_norm.item())
        
        # Update weights
        optimizer.step()
        scheduler.step()
        
        # Accumulate for epoch average
        total_loss += loss_sum.item()
        total_tokens += num_tokens
        
        # Display progress
        current_lr = scheduler.get_last_lr()[0]
        pbar.set_postfix({
            'loss': f"{loss.item():.4f}",
            'lr': f"{current_lr:.2e}",
            'grad': f"{grad_norm:.2f}"
        })
    
    # Compute epoch averages
    avg_loss = total_loss / total_tokens if total_tokens > 0 else float('inf')
    perplexity = math.exp(min(avg_loss, 100))
    avg_grad_norm = np.mean(grad_norms) if grad_norms else 0
    
    return avg_loss, perplexity, avg_grad_norm

print("‚úì Training function defined (Harvard NLP Batch class pattern)")

### Validation Function

In [None]:
@torch.no_grad()
def validate(model, val_loader, criterion, device):
    """
    Validate the model using Harvard NLP's Batch class pattern
    
    Same approach as training but without gradients
    """
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    for data in tqdm(val_loader, desc="Validation"):
        input_ids = data['input_ids'].to(device)
        
        # Harvard NLP pattern: Use Batch class for automatic masking
        batch = Batch(
            src=input_ids,
            tgt=input_ids,
            pad=tokenizer.pad_token_id
        )
        
        # Forward pass
        decoder_output = model.forward(
            batch.src,
            batch.tgt,
            batch.src_mask,
            batch.tgt_mask
        )
        
        # Generate log probabilities
        log_probs = model.generator(decoder_output)
        
        # Compute loss
        loss_sum = criterion(
            log_probs.reshape(-1, config['tgt_vocab']),
            batch.tgt_y.reshape(-1)
        )
        
        # Accumulate
        num_tokens = batch.ntokens.item()
        total_loss += loss_sum.item()
        total_tokens += num_tokens
    
    # Compute averages
    avg_loss = total_loss / total_tokens if total_tokens > 0 else float('inf')
    perplexity = math.exp(min(avg_loss, 100))
    return avg_loss, perplexity

print("‚úì Validation function defined (Harvard NLP Batch class pattern)")

### Generation Evaluation Function

In [None]:
@torch.no_grad()
def evaluate_generation(model, tokenizer, device, epoch):
    """Generate samples to monitor progress"""
    model.eval()
    generator = TextGenerator(model, tokenizer, device=device)
    
    # Wikipedia-style prompts (matching WikiText-2)
    prompts = [
        "The Roman Empire",
        "Albert Einstein",
        "World War II"
    ]
    
    print(f"\n{'='*70}")
    print(f"Sample Generations (Epoch {epoch})")
    print(f"{'='*70}")
    
    for prompt in prompts:
        try:
            text = generator.generate_with_temperature(
                prompt, temperature=0.8, max_length=40
            )
            print(f"\nPrompt: \"{prompt}\"")
            print(f"Output: {text}")
        except Exception as e:
            print(f"\nPrompt: \"{prompt}\"")
            print(f"Error: {str(e)}")
    
    print(f"{'='*70}")
    model.train()

print("‚úì Generation evaluation function defined")

## üöÄ Main Training Loop

In [None]:
# Training history
history = {
    'train_loss': [],
    'train_ppl': [],
    'val_loss': [],
    'val_ppl': [],
    'grad_norm': [],
    'epoch_time': []
}

# Early stopping
early_stopping = EarlyStopping(
    patience=config['early_stop_patience'],
    min_delta=config['early_stop_min_delta']
)

best_val_loss = float('inf')
total_start_time = time.time()

print("\n" + "="*70)
print("üöÄ STARTING TRAINING")
print("="*70)
print(f"Total epochs: {config['num_epochs']}")
print(f"Expected time: ~{config['num_epochs'] * 8.5:.0f} minutes")
print("="*70 + "\n")

for epoch in range(1, config['num_epochs'] + 1):
    epoch_start = time.time()
    
    print(f"\n{'='*70}")
    print(f"üìÖ Epoch {epoch}/{config['num_epochs']}")
    print(f"{'='*70}")
    
    # Train
    train_loss, train_ppl, grad_norm = train_epoch(
        model, train_loader, optimizer, scheduler, criterion, device, epoch
    )
    
    # Validate
    val_loss, val_ppl = validate(model, val_loader, criterion, device)
    
    epoch_time = time.time() - epoch_start
    
    # Save history
    history['train_loss'].append(train_loss)
    history['train_ppl'].append(train_ppl)
    history['val_loss'].append(val_loss)
    history['val_ppl'].append(val_ppl)
    history['grad_norm'].append(grad_norm)
    history['epoch_time'].append(epoch_time)
    
    # Print metrics
    print(f"\nüìä Results:")
    print(f"  Train Loss: {train_loss:.4f} | PPL: {train_ppl:.2f}")
    print(f"  Val Loss:   {val_loss:.4f} | PPL: {val_ppl:.2f}")
    print(f"  Grad Norm:  {grad_norm:.4f}")
    print(f"  Time: {format_time(epoch_time)}")
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        checkpoint_path = f"{CHECKPOINT_DIR}/best_model_epoch{epoch}.pt"
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler_state_dict': scheduler.state_dict(),
            'train_loss': train_loss,
            'val_loss': val_loss,
            'val_ppl': val_ppl,
            'config': config,
            'history': history
        }, checkpoint_path)
        print(f"  ‚úÖ Best model saved! (Val Loss: {val_loss:.4f})")
    
    # Sample generation every epoch
    if epoch % 1 == 0:
        evaluate_generation(model, tokenizer, device, epoch)
    
    # Early stopping check
    if early_stopping(val_loss):
        print(f"\n‚ö†Ô∏è Early stopping triggered at epoch {epoch}")
        break

total_time = time.time() - total_start_time

print("\n" + "="*70)
print("‚úÖ TRAINING COMPLETE!")
print("="*70)
print(f"Total time: {format_time(total_time)}")
print(f"Best val loss: {best_val_loss:.4f}")
print(f"Final train PPL: {history['train_ppl'][-1]:.2f}")
print(f"Final val PPL: {history['val_ppl'][-1]:.2f}")
print(f"\nCheckpoints saved at: {CHECKPOINT_DIR}")
print("="*70)

## üìä Visualize Training Results

In [None]:
epochs = range(1, len(history['train_loss']) + 1)

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Loss
axes[0, 0].plot(epochs, history['train_loss'], 'b-', label='Train', linewidth=2)
axes[0, 0].plot(epochs, history['val_loss'], 'r-', label='Validation', linewidth=2)
axes[0, 0].set_xlabel('Epoch', fontsize=12)
axes[0, 0].set_ylabel('Loss', fontsize=12)
axes[0, 0].set_title('Training and Validation Loss', fontsize=13, fontweight='bold')
axes[0, 0].legend(fontsize=11)
axes[0, 0].grid(True, alpha=0.3)

# Perplexity
axes[0, 1].plot(epochs, history['train_ppl'], 'b-', label='Train', linewidth=2)
axes[0, 1].plot(epochs, history['val_ppl'], 'r-', label='Validation', linewidth=2)
axes[0, 1].set_xlabel('Epoch', fontsize=12)
axes[0, 1].set_ylabel('Perplexity', fontsize=12)
axes[0, 1].set_title('Training and Validation Perplexity', fontsize=13, fontweight='bold')
axes[0, 1].legend(fontsize=11)
axes[0, 1].grid(True, alpha=0.3)

# Gradient norm
axes[1, 0].plot(epochs, history['grad_norm'], 'g-', linewidth=2)
axes[1, 0].set_xlabel('Epoch', fontsize=12)
axes[1, 0].set_ylabel('Gradient Norm', fontsize=12)
axes[1, 0].set_title('Average Gradient Norm per Epoch', fontsize=13, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Epoch time
axes[1, 1].bar(epochs, history['epoch_time'], color='purple', alpha=0.7)
axes[1, 1].set_xlabel('Epoch', fontsize=12)
axes[1, 1].set_ylabel('Time (seconds)', fontsize=12)
axes[1, 1].set_title('Training Time per Epoch', fontsize=13, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(f'{CHECKPOINT_DIR}/training_curves.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"‚úì Training curves saved to {CHECKPOINT_DIR}/training_curves.png")

## üîÑ Reload Module (Fix Caching Issues)

In [None]:
# Force reload mha modules to get latest code
import sys
import importlib

print("Reloading mha modules...")

# Remove all mha-related modules from cache
modules_to_remove = [key for key in sys.modules.keys() if 'mha' in key]
for module in modules_to_remove:
    del sys.modules[module]
    print(f"  Removed: {module}")

# Re-import
from mha.inference import TextGenerator

print("\n‚úì Modules reloaded! Text generation will use latest code.")

## üéØ Final Evaluation & Text Generation

### Load Best Model

In [None]:
import glob

# Find best checkpoint
checkpoint_files = glob.glob(f"{CHECKPOINT_DIR}/best_model_epoch*.pt")

if checkpoint_files:
    latest_checkpoint = max(checkpoint_files, key=os.path.getctime)
    print(f"Loading best model: {latest_checkpoint}")
    
    checkpoint = torch.load(latest_checkpoint, map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()
    
    print(f"\n‚úÖ Best model loaded!")
    print(f"  Epoch: {checkpoint['epoch']}")
    print(f"  Train Loss: {checkpoint['train_loss']:.4f}")
    print(f"  Val Loss: {checkpoint['val_loss']:.4f}")
    print(f"  Val Perplexity: {checkpoint['val_ppl']:.2f}")
else:
    print("‚ùå No checkpoint found!")

### Comprehensive Text Generation Test

In [None]:
# Create generator with reloaded module
generator = TextGenerator(model, tokenizer, device=device)

# Wikipedia-style prompts (matching WikiText-2 training data)
prompts = [
    "The Roman Empire",
    "Albert Einstein was a",
    "World War II began when",
    "The human brain",
    "Isaac Newton discovered"
]

print("\n" + "="*70)
print("üéØ FINAL TEXT GENERATION EVALUATION")
print("="*70)

for i, prompt in enumerate(prompts, 1):
    print(f"\n{i}. Prompt: \"{prompt}\"")
    print("-" * 70)
    
    try:
        # Greedy
        greedy_text = generator.generate_greedy(prompt, max_length=50)
        print(f"  Greedy:      {greedy_text}")
        
        # Temperature
        temp_text = generator.generate_with_temperature(
            prompt, temperature=0.8, max_length=50
        )
        print(f"  Temperature: {temp_text}")
        
    except Exception as e:
        print(f"  Error: {str(e)}")

print("\n" + "="*70)
print("‚úì Generation test complete!")
print("="*70)

### Generation Quality Metrics

In [None]:
def calculate_generation_metrics(generator, prompts, num_samples=5):
    """Calculate quality metrics for generated text"""
    metrics = {
        'length': [],
        'unique_tokens': [],
        'repetition_rate': [],
        'vocab_diversity': []
    }
    
    print("Calculating generation quality metrics...")
    
    for prompt in tqdm(prompts):
        for _ in range(num_samples):
            try:
                text = generator.generate_with_temperature(
                    prompt, temperature=0.8, max_length=50
                )
                tokens = text.split()
                
                metrics['length'].append(len(tokens))
                metrics['unique_tokens'].append(len(set(tokens)))
                
                if len(tokens) > 0:
                    repetition = 1 - (len(set(tokens)) / len(tokens))
                    metrics['repetition_rate'].append(repetition)
                    metrics['vocab_diversity'].append(len(set(tokens)) / len(tokens))
            except:
                continue
    
    print("\n" + "="*70)
    print("üìä GENERATION QUALITY METRICS")
    print("="*70)
    print(f"  Average length:      {np.mean(metrics['length']):.1f} tokens")
    print(f"  Average unique:      {np.mean(metrics['unique_tokens']):.1f} tokens")
    print(f"  Repetition rate:     {np.mean(metrics['repetition_rate']):.2%}")
    print(f"  Vocabulary diversity: {np.mean(metrics['vocab_diversity']):.2%}")
    print("="*70)
    
    return metrics

# Calculate metrics
metrics = calculate_generation_metrics(generator, prompts[:3], num_samples=3)

## üìù Summary Report

In [None]:
print("\n" + "="*70)
print("üìù TRAINING SUMMARY REPORT")
print("="*70)

print("\nüèóÔ∏è Model Architecture:")
print(f"  Type: Harvard NLP EncoderDecoder")
print(f"  Layers: {config['N']} √ó (Encoder + Decoder)")
print(f"  Parameters: {total_params:,} ({total_params/1e6:.1f}M)")
print(f"  Model dimension: {config['d_model']}")
print(f"  Attention heads: {config['h']}")

print("\nüìö Training Data:")
print(f"  Dataset: WikiText-2")
print(f"  Train samples: {len(train_dataset):,}")
print(f"  Val samples: {len(val_dataset):,}")
print(f"  Sequence length: 512 tokens")

print("\n‚öôÔ∏è Training Configuration:")
print(f"  Epochs completed: {len(history['train_loss'])}")
print(f"  Batch size: {config['batch_size']}")
print(f"  Warmup steps: {config['warmup_steps']}")
print(f"  Gradient clipping: {config['gradient_clip']}")

print("\nüìä Final Results:")
print(f"  Train Loss: {history['train_loss'][-1]:.4f}")
print(f"  Train PPL: {history['train_ppl'][-1]:.2f}")
print(f"  Val Loss: {history['val_loss'][-1]:.4f}")
print(f"  Val PPL: {history['val_ppl'][-1]:.2f}")
print(f"  Best Val Loss: {best_val_loss:.4f}")

improvement_loss = ((history['train_loss'][0] - history['train_loss'][-1]) / history['train_loss'][0]) * 100
improvement_ppl = ((history['train_ppl'][0] - history['train_ppl'][-1]) / history['train_ppl'][0]) * 100

print("\nüìà Improvement:")
print(f"  Loss reduction: {improvement_loss:.1f}%")
print(f"  PPL reduction: {improvement_ppl:.1f}%")

print("\n‚è±Ô∏è Training Time:")
print(f"  Total: {format_time(sum(history['epoch_time']))}")
print(f"  Avg per epoch: {format_time(np.mean(history['epoch_time']))}")

print("\nüíæ Checkpoints:")
print(f"  Location: {CHECKPOINT_DIR}")
print(f"  Best model: epoch {checkpoint['epoch']}")

print("\n" + "="*70)
print("‚úÖ Training complete! Model ready for inference.")
print("="*70)

## üéâ Conclusion

### What We Achieved

1. ‚úÖ **Trained a Transformer** using Harvard NLP's Annotated Transformer patterns
2. ‚úÖ **Used Batch Class** for automatic masking (proper Harvard NLP way)
3. ‚úÖ **Extended training** from 3 to 20 epochs for better quality
4. ‚úÖ **Monitored progress** with generation samples after each epoch
5. ‚úÖ **Tracked metrics**: Loss, perplexity, gradient norms, generation quality
6. ‚úÖ **Early stopping** to prevent overfitting
7. ‚úÖ **Saved checkpoints** to Google Drive

### Harvard NLP Patterns Used

This implementation strictly follows the Annotated Transformer:

| Pattern | What It Does |
|---------|-------------|
| `make_model()` | Factory function for model creation with proper initialization |
| `Batch` class | Automatic src/tgt splitting, masking, token counting |
| `rate()` | Learning rate schedule: d_model^(-0.5) * min(step^(-0.5), step*warmup^(-1.5)) |
| `EncoderDecoder` | Main architecture wrapper |
| `Generator` | Log softmax projection layer |
| Loss calculation | Sum reduction + normalize by token count |

### Next Steps

- üìà **Train longer** (30-50 epochs) for even better results
- üîß **Tune hyperparameters** (learning rate, warmup, batch size)
- üìä **Try larger models** (8-12 layers, 768 dimensions)
- üìö **Use more data** (WikiText-103 for better generalization)
- üéØ **Fine-tune** on specific domains (news, scientific text, etc.)
- üî¨ **Add label smoothing** (use `LabelSmoothing` class from mha.utils)

### Resources

- **Harvard NLP Annotated Transformer**: https://nlp.seas.harvard.edu/annotated-transformer/
- **Original Paper**: "Attention is All You Need" (Vaswani et al., 2017)
- **Your Implementation**: https://github.com/mohamedAtoui/LLM-Journey

---

**Built with ‚ù§Ô∏è following Harvard NLP's educational materials**

**This notebook demonstrates the clean, modular approach of the Annotated Transformer!**