# Lab 10: MLOps, Advanced Fine-tuning & Evaluation

## Learning Objectives

This lab is a **comprehensive reference** covering three distinct areas. By the end, you will be able to:

### Part 1: MLOps & Deployment
1. Explain the ML lifecycle and where MLOps fits
2. Track experiments with MLflow
3. Deploy a model with Gradio for interactive inference
4. Understand model monitoring concepts (data drift, concept drift)

### Part 2: Advanced Fine-tuning
5. Derive LoRA and calculate parameter savings
6. Implement a minimal LoRA layer from scratch
7. Understand quantization (INT8, INT4, NF4)
8. Explain when to use full fine-tuning vs PEFT methods

### Part 3: Evaluation Deep-dive
9. Derive ROUGE-N and ROUGE-L from first principles
10. Implement ROUGE from scratch
11. Explain BLEU and its brevity penalty
12. Understand semantic evaluation (BERTScore)

## Prerequisites

- **Lab 9 (Transformers)**: Understand transformer architecture and basic fine-tuning
- **Lab 9 Part 7 (Optional)**: HuggingFace Trainer basics - recommended for Part 2

## Lab Structure

Each part is self-contained. You can:
- Work through all three sequentially
- Focus on one part based on your interests
- Use as reference material for projects

In [None]:
# ==== Environment Setup ====
import os
import sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("✓ Running on Google Colab")
else:
    print("✓ Running locally")

def download_file(url: str, filename: str) -> str:
    """Download file if it doesn't exist. Works on both Colab and local."""
    if os.path.exists(filename):
        print(f"✓ {filename} already exists")
        return filename
    
    print(f"Downloading {filename}...")
    if IN_COLAB:
        import subprocess
        subprocess.run(['wget', '-q', url, '-O', filename], check=True)
    else:
        import urllib.request
        urllib.request.urlretrieve(url, filename)
    print(f"✓ Downloaded {filename}")
    return filename

In [None]:
# ==== Device Setup ====
import torch

def get_device():
    """Get best available device: CUDA > MPS > CPU."""
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f"✓ Using CUDA GPU: {torch.cuda.get_device_name(0)}")
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        device = torch.device('mps')
        print("✓ Using Apple MPS (Metal)")
    else:
        device = torch.device('cpu')
        print("✓ Using CPU")
    return device

DEVICE = get_device()

---

# Part 1: MLOps & Deployment

> **Goal:** Understand the production ML lifecycle beyond model training.

---

## 1.1 The ML Lifecycle

Training a model is only ~10-20% of a production ML system. The full lifecycle includes:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           ML LIFECYCLE                                      │
│                                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐               │
│  │   Data   │ -> │  Model   │ -> │  Deploy  │ -> │ Monitor  │ ──┐           │
│  │ Pipeline │    │ Training │    │ & Serve  │    │ & Retrain│   │           │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘   │           │ 
│       ▲                                                         │           │
│       └─────────────────────────────────────────────────────────┘           │
│                         Continuous Feedback Loop                            │
└─────────────────────────────────────────────────────────────────────────────┘
```

**MLOps** brings DevOps practices to ML:
- **Version control** for data, models, and experiments
- **CI/CD** for model training and deployment
- **Monitoring** for model performance in production
- **Automation** of the feedback loop

<details>
<summary><b>Q: Why is experiment tracking important even for personal projects?</b></summary>

**A:** Even solo, you'll forget:
- Which hyperparameters produced which results
- Why you made certain decisions
- What you already tried (avoiding redundant experiments)

Experiment tracking creates a "lab notebook" for ML - essential for reproducibility and iterating effectively.
</details>

## 1.2 Experiment Tracking

### Why Track Experiments?

Without tracking, you end up with:
- `model_final.pt`, `model_final_v2.pt`, `model_ACTUAL_final.pt`
- Forgotten hyperparameters
- Unreproducible results

### Popular Tools

| Tool | Pros | Cons |
|------|------|------|
| **MLflow** | Open source, self-hosted, simple | Less collaboration features |
| **Weights & Biases** | Great UI, collaboration, free tier | Cloud-based, data leaves your machine |
| **TensorBoard** | Built into PyTorch/TF | Limited to metrics/graphs |
| **Neptune** | Great for teams | Paid for advanced features |

In [None]:
# Install MLflow (uncomment if needed)
# !pip install mlflow --quiet

import mlflow
import mlflow.pytorch
import numpy as np

# Set up local experiment tracking
mlflow.set_tracking_uri("file:./mlruns")  # Local storage
mlflow.set_experiment("lab10_demo")

# Simulate a training run
def train_dummy_model(learning_rate: float, epochs: int):
    """Simulated training with MLflow logging."""
    with mlflow.start_run():
        # Log hyperparameters
        mlflow.log_param("learning_rate", learning_rate)
        mlflow.log_param("epochs", epochs)
        mlflow.log_param("model_type", "transformer")
        
        # Simulate training metrics
        for epoch in range(epochs):
            train_loss = 1.0 / (epoch + 1) + np.random.normal(0, 0.1)
            val_loss = 1.2 / (epoch + 1) + np.random.normal(0, 0.1)
            
            mlflow.log_metric("train_loss", train_loss, step=epoch)
            mlflow.log_metric("val_loss", val_loss, step=epoch)
        
        # Log final metrics
        mlflow.log_metric("final_train_loss", train_loss)
        mlflow.log_metric("final_val_loss", val_loss)
        
        print(f"Run logged with LR={learning_rate}, epochs={epochs}")
        print(f"Final loss: {val_loss:.4f}")

# Run a few experiments
train_dummy_model(learning_rate=1e-3, epochs=5)
train_dummy_model(learning_rate=1e-4, epochs=5)
train_dummy_model(learning_rate=1e-5, epochs=5)

print("\n✓ Check ./mlruns directory for logged experiments")
print("Run 'mlflow ui' in terminal to see the dashboard")

### MLflow Key Concepts

- **`mlflow.log_param()`**: Log hyperparameters (logged once per run)
- **`mlflow.log_metric()`**: Log metrics (can log multiple times with `step`)
- **`mlflow.log_artifact()`**: Log files (models, plots, configs)
- **`mlflow.start_run()`**: Context manager for a single experiment run

<details>
<summary><b>Q: What's the difference between a parameter and a metric?</b></summary>

**A:** 
- **Parameters** are inputs you set before training (learning rate, batch size, architecture)
- **Metrics** are outputs measured during/after training (loss, accuracy, F1)

Parameters are logged once; metrics can be logged at each step/epoch.
</details>

## 1.3 Model Versioning

### HuggingFace Hub

HuggingFace Hub provides:
- Model hosting with version control
- Model cards (documentation)
- Easy sharing and collaboration

```python
# Push model to Hub
from huggingface_hub import login
login()  # Authenticate

model.push_to_hub("your-username/model-name")
tokenizer.push_to_hub("your-username/model-name")

# Load from Hub
model = AutoModel.from_pretrained("your-username/model-name")
```

### Semantic Versioning for Models

Adopt version numbering like software:
- **v1.0.0**: Initial release
- **v1.1.0**: New feature (new task capability)
- **v1.0.1**: Bug fix (training data fix, architecture tweak)
- **v2.0.0**: Breaking change (different tokenizer, incompatible weights)

## 1.4 Model Serving

Once trained, models need to be accessible. Options:

| Approach | Use Case | Complexity |
|----------|----------|------------|
| **Gradio** | Quick demos, prototypes | Low |
| **Streamlit** | Data apps, dashboards | Low |
| **FastAPI** | Production APIs | Medium |
| **TorchServe** | Scalable PyTorch serving | High |
| **Triton** | Multi-framework, GPU optimized | High |

In [None]:
# Install Gradio (uncomment if needed)
# !pip install gradio --quiet

import gradio as gr
from transformers import pipeline

# Load a small summarization model
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=-1)

def summarize_text(text: str, max_length: int = 100) -> str:
    """Summarize input text using DistilBART."""
    if len(text.strip()) < 50:
        return "Please provide more text (at least 50 characters)."
    
    result = summarizer(text, max_length=max_length, min_length=30, do_sample=False)
    return result[0]['summary_text']

# Create Gradio interface
demo = gr.Interface(
    fn=summarize_text,
    inputs=[
        gr.Textbox(lines=10, placeholder="Enter text to summarize...", label="Input Text"),
        gr.Slider(50, 200, value=100, step=10, label="Max Summary Length")
    ],
    outputs=gr.Textbox(label="Summary"),
    title="Text Summarizer",
    description="Summarize long text using DistilBART (fine-tuned on CNN/DailyMail)",
    examples=[
        ["The transformer architecture has revolutionized natural language processing. " * 10, 100]
    ]
)

# Launch (in Colab, this creates a public URL)
# demo.launch(share=True)  # Uncomment to run
print("✓ Gradio interface defined. Uncomment demo.launch() to run.")

### FastAPI for Production APIs

For production, you need more control than Gradio provides:

```python
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

class SummarizeRequest(BaseModel):
    text: str
    max_length: int = 100

class SummarizeResponse(BaseModel):
    summary: str

@app.post("/summarize", response_model=SummarizeResponse)
async def summarize(request: SummarizeRequest):
    result = summarizer(request.text, max_length=request.max_length)
    return SummarizeResponse(summary=result[0]['summary_text'])

# Run with: uvicorn app:app --reload
```

**Key differences from Gradio:**
- Type-safe request/response validation
- OpenAPI documentation auto-generated
- Better for microservices architecture
- More control over authentication, rate limiting, etc.

<details>
<summary><b>Q: When would you use Gradio vs FastAPI?</b></summary>

**A:** 
- **Gradio**: Demos, prototypes, internal tools, non-technical stakeholders
- **FastAPI**: Production APIs, integration with other services, high traffic, need for authentication/monitoring
</details>

## 1.5 Model Monitoring

Deployed models degrade over time. Key concepts:

### Data Drift
The input distribution changes from training data.

**Example:** A sentiment model trained on product reviews starts receiving social media posts with different vocabulary and style.

### Concept Drift  
The relationship between inputs and outputs changes.

**Example:** A fraud detection model's patterns become outdated as fraudsters adapt their techniques.

### Monitoring Strategies

1. **Statistical tests** on input features (KL divergence, KS test)
2. **Performance monitoring** on labeled samples
3. **Prediction distribution** tracking
4. **Human-in-the-loop** feedback

<details>
<summary><b>Q: What's the difference between data drift and concept drift?</b></summary>

**A:**
- **Data drift**: P(X) changes - the inputs look different
- **Concept drift**: P(Y|X) changes - the correct answer for similar inputs changes

Data drift is easier to detect (monitor input statistics). Concept drift requires ground truth labels to detect.
</details>

## Part 1: Exercise

### Exercise: Track a Real Training Run with MLflow

Implement experiment tracking for a simple model:

```python
import torch
import torch.nn as nn
import mlflow

class SimpleMLP(nn.Module):
    def __init__(self, input_size=784, hidden_size=128, output_size=10):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

def train_with_tracking(learning_rate: float, epochs: int, hidden_size: int):
    with mlflow.start_run():
        # TODO: Log hyperparameters (learning_rate, epochs, hidden_size)
        # TODO: Create model, optimizer, loss function
        # TODO: Training loop with metric logging each epoch
        # TODO: Log final accuracy
        pass

# Compare: train_with_tracking(1e-3, 10, 128) vs train_with_tracking(1e-4, 10, 256)
```

**Bonus:** Use `mlflow ui` to visualize and compare your experiments.

## Part 1: Key Takeaways

1. **MLOps** extends DevOps to ML - versioning, CI/CD, monitoring
2. **Experiment tracking** (MLflow, W&B) is essential for reproducibility
3. **Model serving** ranges from simple (Gradio) to production-grade (FastAPI, TorchServe)
4. **Monitor** for data drift and concept drift in production
5. The training loop is only ~20% of a production ML system

---

<details>
<summary><b>Q: Why does MLflow use a local file store by default?</b></summary>

**A:** Local file storage (`file:./mlruns`) requires zero setup and no server management. For teams, you'd use a tracking server with a database backend for concurrent access.
</details>

<details>
<summary><b>Q: What's the difference between logging an artifact vs a metric?</b></summary>

**A:**
- **Metrics**: Numeric values that change (loss, accuracy) - shown in charts
- **Artifacts**: Files (models, plots, configs) - downloadable attachments
</details>

---

# Part 2: Advanced Fine-tuning

> **Prerequisite:** This section builds on Lab 9 Part 7 (Fine-tuning Pre-trained Transformers). 
> Review HuggingFace Trainer basics before continuing.

> **Goal:** Understand parameter-efficient fine-tuning methods that reduce memory and compute.

---

## 2.1 Why Parameter-Efficient Fine-tuning?

### The Problem: Full Fine-tuning is Expensive

Consider fine-tuning LLaMA-7B:

| Component | Memory |
|-----------|--------|
| Model weights (fp32) | 28 GB |
| Gradients | 28 GB |
| Optimizer states (Adam) | 56 GB |
| **Total** | **~112 GB** |

This exceeds most consumer GPUs (even A100-40GB struggles).

### The Solution: PEFT (Parameter-Efficient Fine-Tuning)

Instead of updating all parameters, update only a small subset:
- **LoRA**: Add low-rank matrices to attention layers
- **Prefix Tuning**: Prepend learnable tokens
- **Adapters**: Insert small bottleneck layers
- **QLoRA**: LoRA + quantization for even lower memory

## 2.2 LoRA: Low-Rank Adaptation

### The Key Insight

Pre-trained weight matrices have **low intrinsic rank** - their updates during fine-tuning can be approximated by low-rank decomposition.

### Mathematical Formulation

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$:

**Full fine-tuning:** $W = W_0 + \Delta W$ where $\Delta W \in \mathbb{R}^{d \times k}$

**LoRA:** $W = W_0 + BA$ where:
- $B \in \mathbb{R}^{d \times r}$
- $A \in \mathbb{R}^{r \times k}$
- $r \ll \min(d, k)$ (typically r = 4, 8, or 16)

### Parameter Savings

**Full fine-tuning parameters:** $d \times k$

**LoRA parameters:** $d \times r + r \times k = r(d + k)$

**Savings ratio:** $\frac{r(d + k)}{dk} \approx \frac{r}{\min(d,k)}$ when $d \approx k$

<details>
<summary><b>Q: If d=4096, k=4096, r=8, how many parameters does LoRA add vs full fine-tuning?</b></summary>

**A:**
- Full fine-tuning: $4096 \times 4096 = 16.8M$ parameters
- LoRA: $8 \times (4096 + 4096) = 65.5K$ parameters
- **Ratio:** $65.5K / 16.8M = 0.39\%$ - over 250x fewer parameters!
</details>

### Initialization

- $A$ is initialized with random Gaussian
- $B$ is initialized to zero

This ensures $BA = 0$ at the start, so the model begins identical to the pre-trained weights.

### Scaling Factor

The final update is scaled: $W = W_0 + \frac{\alpha}{r} BA$

where $\alpha$ is a hyperparameter controlling the magnitude of updates.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class LoRALinear(nn.Module):
    """
    Linear layer with LoRA (Low-Rank Adaptation).
    
    During training, computes: output = x @ W_0.T + (x @ A.T @ B.T) * (alpha/r)
    W_0 is frozen; only A and B are trained.
    """
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 8,           # LoRA rank
        alpha: float = 16,    # Scaling factor
        dropout: float = 0.0,
        merge_weights: bool = False  # Merge for inference
    ):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.r = r
        self.alpha = alpha
        self.scaling = alpha / r
        
        # Frozen pre-trained weight (simulated here as learnable for demo)
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        self.weight.requires_grad = False  # Freeze!
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.zeros(r, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, r))
        
        # Initialize A with random Gaussian, B with zeros
        nn.init.normal_(self.lora_A, std=1/r)
        # B is already zeros - ensures BA=0 at start
        
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        self.merged = False
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, seq_len, in_features)
        
        if self.merged:
            # During inference with merged weights
            return F.linear(x, self.weight)
        
        # Frozen base computation
        base_output = F.linear(x, self.weight)  # (batch, seq_len, out_features)
        
        # LoRA computation: x @ A.T @ B.T * scaling
        lora_output = self.dropout(x)
        lora_output = F.linear(lora_output, self.lora_A)  # (batch, seq_len, r)
        lora_output = F.linear(lora_output, self.lora_B)  # (batch, seq_len, out_features)
        lora_output = lora_output * self.scaling
        
        return base_output + lora_output
    
    def merge_weights(self):
        """Merge LoRA weights into base weight for efficient inference."""
        if not self.merged:
            # W_merged = W_0 + (B @ A) * scaling
            self.weight.data += (self.lora_B @ self.lora_A) * self.scaling
            self.merged = True
    
    def unmerge_weights(self):
        """Unmerge for continued training."""
        if self.merged:
            self.weight.data -= (self.lora_B @ self.lora_A) * self.scaling
            self.merged = False

# Test our implementation
print("Testing LoRALinear...")
layer = LoRALinear(in_features=512, out_features=512, r=8, alpha=16)

# Count parameters
frozen_params = layer.weight.numel()
trainable_params = layer.lora_A.numel() + layer.lora_B.numel()

print(f"Frozen parameters: {frozen_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Ratio: {trainable_params/frozen_params*100:.2f}%")

# Test forward pass
x = torch.randn(2, 10, 512)
y = layer(x)
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {y.shape}")

# Test merging
layer.merge_weights()
y_merged = layer(x)
print(f"\nOutputs equal after merge: {torch.allclose(y, y_merged, atol=1e-5)}")

In [None]:
# Visualize LoRA Parameter Savings
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: Parameter savings vs rank
d = 4096  # Typical hidden dimension
ranks = [1, 2, 4, 8, 16, 32, 64, 128]
params_full = d * d
params_lora = [r * (d + d) for r in ranks]
savings_pct = [(1 - p/params_full) * 100 for p in params_lora]

axes[0].bar(range(len(ranks)), savings_pct, color='steelblue')
axes[0].set_xticks(range(len(ranks)))
axes[0].set_xticklabels(ranks)
axes[0].set_xlabel('LoRA Rank (r)')
axes[0].set_ylabel('Parameter Savings (%)')
axes[0].set_title(f'LoRA Savings for d={d}')
axes[0].set_ylim(90, 100)

# Plot 2: Trainable parameters comparison
methods = ['Full FT', 'LoRA r=64', 'LoRA r=16', 'LoRA r=8', 'LoRA r=4']
params = [params_full, 64*(d+d), 16*(d+d), 8*(d+d), 4*(d+d)]
colors = ['firebrick'] + ['steelblue']*4

axes[1].barh(methods, [p/1e6 for p in params], color=colors)
axes[1].set_xlabel('Trainable Parameters (Millions)')
axes[1].set_title('Parameter Count Comparison')

plt.tight_layout()
plt.show()

print(f'For a {d}x{d} weight matrix:')
print(f'  Full fine-tuning: {params_full:,} parameters')
print(f'  LoRA r=8: {8*(d+d):,} parameters ({8*(d+d)/params_full*100:.3f}%)')

<details>
<summary><b>Q: Why does LoRA work despite using such low rank?</b></summary>

**A:** The key insight from the LoRA paper is that fine-tuning updates have **low intrinsic dimensionality**. Even though pre-trained weights are full rank, the *changes* needed for a specific task lie in a low-dimensional subspace. LoRA exploits this by only learning updates in that subspace.

Empirically, r=8 achieves ~97-99% of full fine-tuning performance on many tasks.
</details>

<details>
<summary><b>Q: Why initialize B to zero?</b></summary>

**A:** With B=0, the initial LoRA update is BA=0, so the model starts exactly at the pre-trained weights. This is crucial for stability - we want to make small adjustments from a good starting point, not random perturbations.
</details>

In [None]:
# Install PEFT (uncomment if needed)
# !pip install peft --quiet

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load base model
model_name = "google-t5/t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=8,                    # Rank
    lora_alpha=16,          # Scaling
    lora_dropout=0.1,
    target_modules=["q", "v"],  # Apply to query and value projections
)

# Apply LoRA
peft_model = get_peft_model(model, lora_config)

# Compare parameter counts
def count_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return trainable, total

base_trainable, base_total = count_parameters(model)
peft_trainable, peft_total = count_parameters(peft_model)

print(f"Base model:")
print(f"  Total parameters: {base_total:,}")
print(f"  Trainable: {base_trainable:,} (100%)")
print(f"\nPEFT model (LoRA r=8):")
print(f"  Total parameters: {peft_total:,}")
print(f"  Trainable: {peft_trainable:,} ({peft_trainable/peft_total*100:.2f}%)")
print(f"\nMemory savings: {(1 - peft_trainable/base_trainable)*100:.1f}%")

## 2.3 Quantization Fundamentals

Quantization reduces memory by using lower precision numbers.

### Precision Comparison

| Format | Bits | Range | Memory (7B params) |
|--------|------|-------|-------------------|
| FP32 | 32 | ±3.4×10³⁸ | 28 GB |
| FP16/BF16 | 16 | ±6.5×10⁴ | 14 GB |
| INT8 | 8 | -128 to 127 | 7 GB |
| INT4 | 4 | -8 to 7 | 3.5 GB |
| NF4 | 4 | Normal distribution | 3.5 GB |

### Quantization Approaches

**Post-Training Quantization (PTQ)**
- Quantize after training
- Fast, no retraining needed
- Some accuracy loss

**Quantization-Aware Training (QAT)**
- Simulate quantization during training
- Model learns to handle low precision
- Better accuracy, slower training

### NF4 (Normal Float 4)

NF4 is optimized for normally distributed weights (common in neural networks):
- Maps 4-bit values to points on a normal distribution
- Better precision than uniform INT4 for typical weight distributions

## 2.4 QLoRA: 4-bit Fine-tuning

QLoRA combines quantization with LoRA for extreme memory efficiency:

1. **Quantize base model** to 4-bit (NF4)
2. **Add LoRA adapters** in higher precision (fp16/bf16)
3. **Train only the adapters**

### Memory Comparison (LLaMA-7B)

| Method | Memory | Performance |
|--------|--------|-------------|
| Full Fine-tuning (fp32) | ~112 GB | 100% (baseline) |
| Full Fine-tuning (fp16) | ~56 GB | ~99.5% |
| LoRA (fp16) | ~14 GB | ~98% |
| QLoRA (4-bit + LoRA) | ~6 GB | ~95-97% |

### QLoRA Code (Conceptual - needs bitsandbytes + GPU)

```python
from transformers import BitsAndBytesConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Apply LoRA on top
peft_model = get_peft_model(model, lora_config)
# Now you can fine-tune a 7B model on a single GPU!
```

> **Note:** This requires `bitsandbytes` and a CUDA GPU. The code is shown for reference.

<details>
<summary><b>Q: When would you choose QLoRA over LoRA?</b></summary>

**A:** Use QLoRA when:
- You have limited GPU memory (< 16GB)
- Working with very large models (7B+)
- Willing to accept ~2-5% accuracy trade-off
- Need to iterate quickly on large models

Use standard LoRA when:
- You have sufficient memory
- Maximum accuracy is critical
- Working with smaller models where quantization overhead isn't worth it
</details>

## 2.5 PEFT Methods Comparison

| Method | Trainable % | Memory | Performance | Best For |
|--------|-------------|--------|-------------|----------|
| **Full Fine-tuning** | 100% | High | Baseline | Unlimited compute |
| **LoRA** | 0.1-1% | Low | ~98% | Most tasks |
| **QLoRA** | 0.1-1% | Very Low | ~95% | Large models, limited GPU |
| **Prefix Tuning** | <0.1% | Very Low | ~90-95% | Simple tasks |
| **Adapters** | 1-5% | Medium | ~97% | Multi-task learning |
| **(IA)³** | <0.01% | Very Low | ~90% | Extreme efficiency |

### When to Use What

- **Have plenty of compute?** → Full fine-tuning
- **Standard fine-tuning?** → LoRA (best balance)
- **Limited GPU memory?** → QLoRA
- **Many tasks, one model?** → Adapters
- **Extreme parameter efficiency?** → Prefix tuning

## 2.6 Memory-Efficient Training Strategies

Beyond PEFT, several techniques reduce training memory:

### Gradient Checkpointing

**Problem:** Storing activations for backprop uses lots of memory.

**Solution:** Don't store all activations; recompute them during backward pass.

```python
model.gradient_checkpointing_enable()
```

**Trade-off:** ~30% slower training, ~60% less memory.

<details>
<summary><b>Q: What's the trade-off of gradient checkpointing?</b></summary>

**A:** You trade compute for memory. Activations are recomputed during the backward pass instead of stored, so:
- **Memory:** Reduced by ~60% (only store checkpoints, not all activations)
- **Speed:** ~30% slower (recomputation cost)

Use when memory-constrained but have time budget.
</details>

### Mixed Precision Training (AMP)

Use fp16 for most operations, fp32 for sensitive ones:

```python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in dataloader:
    with autocast():  # Use fp16
        loss = model(batch)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

**Benefit:** ~2x memory reduction, ~1.5x speedup.

### Gradient Accumulation

Simulate larger batch sizes without memory increase:

```python
accumulation_steps = 4
for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
```

**Effective batch size:** `actual_batch * accumulation_steps`

## Part 2: Exercise

### Exercise: Implement and Compare LoRA

1. Use the `LoRALinear` class we implemented earlier
2. Create a small transformer model (2 layers, hidden_size=256)
3. Add LoRA to the attention query and value projections
4. Compare parameter counts and verify the math

```python
# Starter code
class TinyTransformerWithLoRA(nn.Module):
    def __init__(self, vocab_size=1000, hidden_size=256, num_layers=2, r=4):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        # TODO: Create attention layers with LoRALinear for Q and V
        # TODO: Create feed-forward layers (regular Linear)
        pass
    
    def forward(self, x):
        # TODO: Implement forward pass
        pass

# TODO: Count parameters and verify LoRA savings
```

## Part 2: Key Takeaways

1. **LoRA** adds low-rank matrices to frozen weights: $W = W_0 + BA$ with $r \ll d$
2. **Parameter savings:** ~99% fewer trainable parameters with ~98% performance
3. **QLoRA** combines 4-bit quantization with LoRA for extreme efficiency
4. **Memory strategies:** Gradient checkpointing, mixed precision, gradient accumulation
5. Choose PEFT method based on your memory/accuracy trade-off requirements

---

---

# Part 3: Evaluation Deep-dive

> **Goal:** Understand how to measure the quality of generated text.

---

## 3.1 Why Evaluation Matters

Generated text is hard to evaluate:
- No single "correct" answer
- Quality is subjective
- Different tasks need different metrics

### Automatic vs Human Evaluation

| Type | Pros | Cons |
|------|------|------|
| **Automatic** | Fast, reproducible, cheap | May not correlate with quality |
| **Human** | Captures true quality | Slow, expensive, subjective |

**Best practice:** Use automatic metrics for development, human evaluation for final assessment.

## 3.2 ROUGE: Recall-Oriented Understudy for Gisting Evaluation

ROUGE measures n-gram overlap between generated and reference text.

### ROUGE-N Formula

$$\text{ROUGE-N} = \frac{\sum_{S \in \text{Ref}} \sum_{\text{gram}_n \in S} \text{Count}_{match}(\text{gram}_n)}{\sum_{S \in \text{Ref}} \sum_{\text{gram}_n \in S} \text{Count}(\text{gram}_n)}$$

**In plain English:** 
$$\text{ROUGE-N} = \frac{\text{Number of n-grams in reference that appear in candidate}}{\text{Total n-grams in reference}}$$

This is **recall** - what fraction of reference n-grams were captured?

### ROUGE Variants

| Variant | Measures | Use Case |
|---------|----------|----------|
| **ROUGE-1** | Unigram overlap | Content coverage |
| **ROUGE-2** | Bigram overlap | Fluency + content |
| **ROUGE-L** | Longest common subsequence | Sentence structure |

### Example Calculation

**Reference:** "the cat sat on the mat"
**Candidate:** "the cat was on the mat"

**ROUGE-1:**
- Reference unigrams: {the, cat, sat, on, the, mat} = {the: 2, cat: 1, sat: 1, on: 1, mat: 1}
- Candidate unigrams: {the: 2, cat: 1, was: 1, on: 1, mat: 1}
- Matches: the(2) + cat(1) + on(1) + mat(1) = 5
- Total reference: 6
- **ROUGE-1 = 5/6 ≈ 0.83**

<details>
<summary><b>Q: Why is ROUGE recall-oriented while BLEU is precision-oriented?</b></summary>

**A:** They serve different purposes:
- **ROUGE** (summarization): We care that important content is covered → recall
- **BLEU** (translation): We care that generated words are correct → precision

A summary should capture key points (recall). A translation should be accurate (precision).
</details>

In [None]:
from collections import Counter
from typing import List

def get_ngrams(tokens: List[str], n: int) -> Counter:
    """Extract n-grams from token list."""
    ngrams = [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
    return Counter(ngrams)

def rouge_n(candidate: str, reference: str, n: int = 1) -> dict:
    """
    Calculate ROUGE-N score.
    
    Args:
        candidate: Generated text
        reference: Reference text
        n: N-gram size (1 for ROUGE-1, 2 for ROUGE-2)
    
    Returns:
        Dict with precision, recall, and F1
    """
    # Tokenize (simple whitespace split)
    cand_tokens = candidate.lower().split()
    ref_tokens = reference.lower().split()
    
    # Get n-grams
    cand_ngrams = get_ngrams(cand_tokens, n)
    ref_ngrams = get_ngrams(ref_tokens, n)
    
    # Count matches (minimum of counts)
    matches = 0
    for ngram, count in ref_ngrams.items():
        matches += min(count, cand_ngrams.get(ngram, 0))
    
    # Calculate scores
    total_ref = sum(ref_ngrams.values())
    total_cand = sum(cand_ngrams.values())
    
    recall = matches / total_ref if total_ref > 0 else 0
    precision = matches / total_cand if total_cand > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return {"precision": precision, "recall": recall, "f1": f1}

def lcs_length(a: List[str], b: List[str]) -> int:
    """Compute length of Longest Common Subsequence using DP."""
    m, n = len(a), len(b)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if a[i-1] == b[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    return dp[m][n]

def rouge_l(candidate: str, reference: str) -> dict:
    """Calculate ROUGE-L score based on LCS."""
    cand_tokens = candidate.lower().split()
    ref_tokens = reference.lower().split()
    
    lcs = lcs_length(cand_tokens, ref_tokens)
    
    recall = lcs / len(ref_tokens) if ref_tokens else 0
    precision = lcs / len(cand_tokens) if cand_tokens else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return {"precision": precision, "recall": recall, "f1": f1}

# Test our implementation
reference = "the cat sat on the mat"
candidate = "the cat was sitting on the mat"

print("Reference:", reference)
print("Candidate:", candidate)
print()
print(f"ROUGE-1: {rouge_n(candidate, reference, n=1)}")
print(f"ROUGE-2: {rouge_n(candidate, reference, n=2)}")
print(f"ROUGE-L: {rouge_l(candidate, reference)}")

# Compare with library
print("\n--- Comparing with 'evaluate' library ---")
import evaluate
rouge = evaluate.load("rouge")
result = rouge.compute(predictions=[candidate], references=[reference])
print(f"Library ROUGE-1: {result['rouge1']:.4f}")
print(f"Library ROUGE-2: {result['rouge2']:.4f}")
print(f"Library ROUGE-L: {result['rougeL']:.4f}")

<details>
<summary><b>Q: When would ROUGE-L beat ROUGE-2?</b></summary>

**A:** ROUGE-L captures long-range structure that ROUGE-2 misses.

Example:
- Reference: "The quick brown fox jumps"
- Candidate: "The quick red fox leaps"

ROUGE-2 misses "quick brown" and "brown fox" bigrams.
ROUGE-L finds LCS: "The quick fox" (length 3) capturing overall structure.

ROUGE-L is better when word order matters more than exact phrasing.
</details>

### Understanding LCS Dynamic Programming

The Longest Common Subsequence algorithm uses this recurrence:

```
dp[i][j] = length of LCS for first i chars of A and first j chars of B

dp[i][j] = dp[i-1][j-1] + 1              if A[i-1] == B[j-1]  (match!)
         = max(dp[i-1][j], dp[i][j-1])   otherwise           (skip one)
```

**Example:** A = "CAT", B = "HAT"

```
      ""  H  A  T
 ""    0  0  0  0
 C     0  0  0  0
 A     0  0  1  1
 T     0  0  1  2  <- LCS length = 2 ("AT")
```

In [None]:
import math
from collections import Counter

def modified_precision(candidate: str, reference: str, n: int) -> float:
    """
    Calculate clipped n-gram precision for BLEU.
    Clips counts to avoid gaming with repeated words.
    """
    cand_tokens = candidate.lower().split()
    ref_tokens = reference.lower().split()
    
    # Get n-grams
    cand_ngrams = Counter(tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens) - n + 1))
    ref_ngrams = Counter(tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens) - n + 1))
    
    # Clip counts to reference maximum
    clipped = sum(min(count, ref_ngrams.get(ng, 0)) for ng, count in cand_ngrams.items())
    total = sum(cand_ngrams.values())
    
    return clipped / total if total > 0 else 0

def brevity_penalty(cand_len: int, ref_len: int) -> float:
    """BP = 1 if c > r, else exp(1 - r/c)"""
    if cand_len > ref_len:
        return 1.0
    elif cand_len == 0:
        return 0.0
    return math.exp(1 - ref_len / cand_len)

def bleu_score(candidate: str, reference: str, max_n: int = 4) -> dict:
    """
    Calculate BLEU score: BP * exp(avg(log(precisions)))
    """
    cand_len = len(candidate.split())
    ref_len = len(reference.split())
    
    # Modified precisions for n=1 to max_n
    precisions = [modified_precision(candidate, reference, n) for n in range(1, max_n + 1)]
    
    # Geometric mean (avoid log(0))
    if any(p == 0 for p in precisions):
        bleu = 0.0
    else:
        bleu = math.exp(sum(math.log(p) for p in precisions) / len(precisions))
    
    bp = brevity_penalty(cand_len, ref_len)
    return {'bleu': bp * bleu, 'brevity_penalty': bp, 'precisions': precisions}

# Test BLEU
print('=== BLEU Score Examples ===')
tests = [
    ('the cat sat on the mat', 'the cat sat on the mat'),
    ('the cat was on the mat', 'the cat sat on the mat'),
    ('the', 'the cat sat on the mat'),  # Short - BP penalty
]

for cand, ref in tests:
    r = bleu_score(cand, ref, max_n=2)
    print(f'Cand: "{cand}"')
    print(f'  BLEU-2: {r["bleu"]:.4f}, BP: {r["brevity_penalty"]:.4f}')
    print()

## 3.3 BLEU: Bilingual Evaluation Understudy

BLEU is precision-based (unlike ROUGE's recall focus).

### Modified N-gram Precision

To avoid gaming with repeated words, BLEU clips counts:

$$p_n = \frac{\sum_{\text{gram}_n} \min(\text{Count}_{cand}, \text{Max}_{ref})}{\sum_{\text{gram}_n} \text{Count}_{cand}}$$

### Brevity Penalty

Short translations can achieve high precision trivially. BLEU adds a penalty:

$$BP = \begin{cases} 
1 & \text{if } c > r \\
e^{1 - r/c} & \text{if } c \leq r
\end{cases}$$

where $c$ = candidate length, $r$ = reference length.

### Final BLEU Score

$$\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$

Typically $N=4$ and $w_n = 1/4$ (geometric mean of 1-4 gram precisions).

<details>
<summary><b>Q: Why does BLEU need a brevity penalty?</b></summary>

**A:** Without it, a one-word translation like "the" could score 100% precision if that word appears in the reference. The brevity penalty ensures candidates are similar length to references, preventing this exploit.

Example: Reference "The cat sat on the mat"
- Candidate "the" → precision = 1.0, but BP ≈ 0.05, so BLEU ≈ 0.05
</details>

## 3.4 BERTScore: Semantic Similarity

ROUGE and BLEU measure **lexical** overlap. BERTScore measures **semantic** similarity.

### How It Works

1. Get BERT embeddings for each token in candidate and reference
2. Compute cosine similarity between all token pairs
3. Use greedy matching to align tokens
4. Calculate precision, recall, F1 from matched similarities

### When to Use BERTScore

| Scenario | ROUGE | BERTScore |
|----------|-------|-----------|
| "happy" vs "joyful" | 0 (different words) | High (similar meaning) |
| "bank" (river) vs "bank" (money) | 1 (same word) | Lower (different context) |

**BERTScore captures paraphrasing that ROUGE misses.**

<details>
<summary><b>Q: Why do we need BERTScore if we have ROUGE?</b></summary>

**A:** ROUGE only measures surface-level word overlap. Two summaries with identical meaning but different wording would score low on ROUGE.

Example:
- Reference: "The automobile collided with the barrier"
- Candidate: "The car crashed into the wall"

ROUGE-1 ≈ 0.3 (only "the" matches)
BERTScore ≈ 0.85 (semantically similar)

BERTScore better correlates with human judgment for paraphrasing.
</details>

In [None]:
# Install bert-score (uncomment if needed)
# !pip install bert-score --quiet

from bert_score import score

# Example comparison
references = ["The cat sat on the mat"]
candidates = [
    "The cat sat on the mat",      # Exact match
    "A cat was sitting on a mat",  # Paraphrase
    "The dog ran in the park",     # Different meaning
]

for cand in candidates:
    P, R, F1 = score([cand], references, lang="en", verbose=False)
    r1 = rouge_n(cand, references[0], n=1)
    print(f"Candidate: '{cand}'")
    print(f"  ROUGE-1 F1: {r1['f1']:.3f}")
    print(f"  BERTScore F1: {F1.item():.3f}")
    print()

In [None]:
# Visualize Evaluation Metrics Comparison
import matplotlib.pyplot as plt
import numpy as np

reference = 'The cat sat on the mat'
candidates = [
    'The cat sat on the mat',       # Exact
    'A cat was sitting on a mat',   # Paraphrase
    'The dog ran in the park',      # Different
]
labels = ['Exact Match', 'Paraphrase', 'Different']

# Calculate scores (using functions defined earlier)
rouge1 = [rouge_n(c, reference, 1)['f1'] for c in candidates]
rouge2 = [rouge_n(c, reference, 2)['f1'] for c in candidates]
rougel = [rouge_l(c, reference)['f1'] for c in candidates]
bleu = [bleu_score(c, reference, 2)['bleu'] for c in candidates]

# Approximate BERTScores (computing is slow)
bert = [0.99, 0.87, 0.62]  # Pre-computed approximations

# Plot
fig, ax = plt.subplots(figsize=(10, 5))
x = np.arange(len(labels))
width = 0.15

ax.bar(x - 2*width, rouge1, width, label='ROUGE-1', color='#2ecc71')
ax.bar(x - width, rouge2, width, label='ROUGE-2', color='#3498db')
ax.bar(x, rougel, width, label='ROUGE-L', color='#9b59b6')
ax.bar(x + width, bleu, width, label='BLEU-2', color='#e74c3c')
ax.bar(x + 2*width, bert, width, label='BERTScore', color='#f39c12')

ax.set_ylabel('Score')
ax.set_title('Evaluation Metrics Comparison')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
ax.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

print('Key insight: BERTScore remains high for paraphrases where n-gram metrics drop!')

## 3.5 Human Evaluation

For final assessment, human evaluation is the gold standard.

### Common Approaches

**Likert Scale Ratings**
- Rate quality on 1-5 scale
- Fast but coarse-grained

**Pairwise Comparisons**
- "Which summary is better: A or B?"
- More reliable than absolute ratings
- Requires more comparisons

**Best-Worst Scaling**
- Show N options, pick best and worst
- Efficient use of annotator time

### Inter-Annotator Agreement

Measure how much annotators agree using **Cohen's Kappa**:

$$\kappa = \frac{p_o - p_e}{1 - p_e}$$

where $p_o$ = observed agreement, $p_e$ = expected agreement by chance.

| Kappa | Interpretation |
|-------|----------------|
| < 0 | Less than chance |
| 0.01-0.20 | Slight |
| 0.21-0.40 | Fair |
| 0.41-0.60 | Moderate |
| 0.61-0.80 | Substantial |
| 0.81-1.00 | Almost perfect |

<details>
<summary><b>Q: What's a good inter-annotator agreement (Kappa) for NLG evaluation?</b></summary>

**A:** For NLG tasks, κ > 0.6 (substantial agreement) is considered good. NLG evaluation is inherently subjective, so perfect agreement is rare. If κ < 0.4, your annotation guidelines may need clarification.
</details>

## 3.6 Evaluation Strategy Guide

### Which Metrics for Which Task?

| Task | Primary | Secondary | Human Eval |
|------|---------|-----------|------------|
| **Summarization** | ROUGE-L | BERTScore | Sample-based |
| **Translation** | BLEU | COMET, BERTScore | Critical |
| **Dialogue** | Perplexity | BERTScore, BLEU | Yes |
| **Story Generation** | Perplexity | Diversity metrics | Essential |
| **QA** | Exact Match | F1, BERTScore | Task-dependent |

### Evaluation Checklist

1. **Development:** Use automatic metrics (fast iteration)
2. **Validation:** Add BERTScore (semantic check)
3. **Final:** Human evaluation on sample
4. **Report:** Multiple metrics + confidence intervals

## Part 3: Exercise

### Exercise: Build a Complete Evaluation Pipeline

1. Implement ROUGE-1, ROUGE-2, and ROUGE-L from scratch (use code above as reference)
2. Add BLEU with brevity penalty
3. Compare your scores with the `evaluate` library
4. Analyze where the metrics disagree

```python
def evaluate_summary(candidate: str, reference: str) -> dict:
    """
    Comprehensive evaluation of a summary.
    
    Returns dict with:
    - rouge1, rouge2, rougeL (your implementation)
    - rouge1_lib, rouge2_lib, rougeL_lib (library)
    - bertscore_f1
    """
    # TODO: Implement
    pass

# Test cases
test_cases = [
    ("The cat sat on the mat", "The cat sat on the mat"),  # Exact
    ("A feline rested on the rug", "The cat sat on the mat"),  # Paraphrase
    ("The dog ran in the park", "The cat sat on the mat"),  # Different
]

for cand, ref in test_cases:
    print(f"Candidate: {cand}")
    print(f"Reference: {ref}")
    # print(evaluate_summary(cand, ref))
    print()
```

## Part 3: Key Takeaways

1. **ROUGE** measures recall-based n-gram overlap - good for summarization
2. **BLEU** measures precision with brevity penalty - good for translation
3. **BERTScore** captures semantic similarity - detects paraphrases
4. **Human evaluation** is the gold standard but expensive
5. Use **multiple metrics** and report confidence intervals

---

---

## Summary: What You've Learned

### Part 1: MLOps
- ML lifecycle: data → train → deploy → monitor → repeat
- Experiment tracking with MLflow
- Model serving with Gradio
- Monitoring for drift

### Part 2: Advanced Fine-tuning
- LoRA: $W = W_0 + BA$ with massive parameter savings
- QLoRA: 4-bit quantization + LoRA
- Memory strategies: checkpointing, mixed precision

### Part 3: Evaluation
- ROUGE (recall), BLEU (precision), BERTScore (semantic)
- When to use which metric
- Human evaluation best practices

---

## References

1. Hu et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models" [arXiv:2106.09685](https://arxiv.org/abs/2106.09685)
2. Dettmers et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs" [arXiv:2305.14314](https://arxiv.org/abs/2305.14314)
3. Lin (2004). "ROUGE: A Package for Automatic Evaluation of Summaries"
4. Papineni et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation"
5. Zhang et al. (2020). "BERTScore: Evaluating Text Generation with BERT" [arXiv:1904.09675](https://arxiv.org/abs/1904.09675)
6. [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
7. [PEFT Library](https://huggingface.co/docs/peft)