# ‚ö° Performance Optimization: Make Models 10x Faster

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gouthamgo/FineTuning/blob/main/lessons/module3_advanced/04_performance_optimization.ipynb)

## Hey! Ready to make your models BLAZING fast? üöÄ

Here's the truth: **A slow model is a useless model.**

Your model might be 99% accurate, but if it takes 5 seconds to respond, users will hate it. In production, **speed = money**.

Good news: You can make most models **5-10x faster** without losing accuracy!

### üéØ What We'll Learn

Today you'll master 5 techniques that production ML engineers use:

1. **Quantization** - Make models 4x smaller and 3x faster
2. **Knowledge Distillation** - Train a tiny student model from a big teacher
3. **Pruning** - Remove unnecessary model weights
4. **ONNX Runtime** - Hardware-optimized inference
5. **Batch Processing** - Smart batching for throughput

### üí∞ Business Impact

**Before optimization:**
- 500ms latency ‚Üí Users complain
- 4GB model ‚Üí Expensive GPU instances ($2/hour)
- 100 requests/sec max ‚Üí Need horizontal scaling

**After optimization:**
- 50ms latency ‚Üí Happy users!
- 1GB model ‚Üí Cheap CPU instances ($0.20/hour)
- 1000 requests/sec ‚Üí Single server handles everything

**Savings: $15,000/year just on infrastructure!**

Let's go! ‚ö°

---

## Setup

In [None]:
!pip install -q transformers datasets torch accelerate optimum onnx onnxruntime sentencepiece

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset
import numpy as np
import time
from typing import List, Dict
import matplotlib.pyplot as plt

print("‚úÖ Setup complete!")

### Load a Model (We'll Optimize This)

In [None]:
# Load a base model for sentiment analysis
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Check model size
param_count = sum(p.numel() for p in model.parameters())
model_size_mb = param_count * 4 / (1024 ** 2)  # 4 bytes per float32 param

print(f"‚úÖ Model loaded!")
print(f"Parameters: {param_count:,}")
print(f"Model size: {model_size_mb:.1f} MB")

### Benchmark Original Model

In [None]:
def benchmark_model(model, tokenizer, texts: List[str], num_runs: int = 100) -> Dict:
    """
    Benchmark model inference speed.
    
    Returns:
        Dict with latency metrics
    """
    model.eval()
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)
    
    latencies = []
    
    # Warmup
    for _ in range(10):
        inputs = tokenizer(texts[0], return_tensors="pt").to(device)
        with torch.no_grad():
            _ = model(**inputs)
    
    # Benchmark
    for i in range(num_runs):
        text = texts[i % len(texts)]
        inputs = tokenizer(text, return_tensors="pt").to(device)
        
        start = time.perf_counter()
        with torch.no_grad():
            outputs = model(**inputs)
        end = time.perf_counter()
        
        latencies.append((end - start) * 1000)  # Convert to ms
    
    return {
        "mean_ms": np.mean(latencies),
        "p50_ms": np.percentile(latencies, 50),
        "p95_ms": np.percentile(latencies, 95),
        "p99_ms": np.percentile(latencies, 99),
    }

# Test texts
test_texts = [
    "This product is amazing! I love it.",
    "Terrible experience, would not recommend.",
    "It's okay, nothing special.",
    "Best purchase I've made this year!",
]

print("üîç Benchmarking original model...\n")
baseline_metrics = benchmark_model(model, tokenizer, test_texts)

print("üìä Baseline Performance:")
for metric, value in baseline_metrics.items():
    print(f"  {metric}: {value:.2f} ms")

print(f"\nüí∞ Cost: ~{model_size_mb:.1f} MB model size")

---

## Technique 1: Quantization (4x Smaller, 3x Faster) üéØ

### What is Quantization?

Normal models use **32-bit floats** for each weight. That's overkill!

Quantization converts weights to **8-bit integers** (or even 4-bit).

**Result:** Model is 4x smaller with minimal accuracy loss!

### When to Use:
- ‚úÖ Production deployment (always!)
- ‚úÖ Edge devices (mobile, IoT)
- ‚úÖ Cost reduction
- ‚ùå Training (use full precision)

### Types:
1. **Dynamic Quantization**: Convert weights only (easiest)
2. **Static Quantization**: Calibrate with data (better)
3. **Quantization-Aware Training**: Train with quantization (best)

In [None]:
# Dynamic Quantization (easiest method)
print("‚ö° Applying dynamic quantization...\n")

# Quantize the model
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},  # Quantize linear layers
    dtype=torch.qint8    # Use 8-bit integers
)

# Check size reduction
quantized_param_count = sum(p.numel() for p in quantized_model.parameters())
quantized_size_mb = quantized_param_count / (1024 ** 2)  # Rough estimate

print(f"‚úÖ Quantization complete!")
print(f"Original size: {model_size_mb:.1f} MB")
print(f"Quantized size: ~{quantized_size_mb:.1f} MB")
print(f"Reduction: {(1 - quantized_size_mb/model_size_mb)*100:.1f}%")

In [None]:
# Benchmark quantized model
print("\nüîç Benchmarking quantized model...\n")
quantized_metrics = benchmark_model(quantized_model, tokenizer, test_texts)

print("üìä Quantized Performance:")
for metric, value in quantized_metrics.items():
    baseline = baseline_metrics[metric]
    speedup = baseline / value
    print(f"  {metric}: {value:.2f} ms ({speedup:.2f}x faster)")

print("\nüí° Typical speedup: 2-3x on CPU, less on GPU (GPUs already optimized for FP32)")

### Test Accuracy (Should Be Nearly Identical!)

In [None]:
def compare_predictions(model1, model2, tokenizer, texts: List[str]):
    """Compare predictions between two models."""
    print("üîç Comparing predictions...\n")
    
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt")
        
        # Original model
        with torch.no_grad():
            outputs1 = model1(**inputs)
            pred1 = torch.argmax(outputs1.logits, dim=-1).item()
            conf1 = torch.softmax(outputs1.logits, dim=-1).max().item()
        
        # Quantized model
        with torch.no_grad():
            outputs2 = model2(**inputs)
            pred2 = torch.argmax(outputs2.logits, dim=-1).item()
            conf2 = torch.softmax(outputs2.logits, dim=-1).max().item()
        
        labels = ["NEGATIVE", "POSITIVE"]
        print(f"Text: {text[:50]}...")
        print(f"  Original:  {labels[pred1]} ({conf1:.2%})")
        print(f"  Quantized: {labels[pred2]} ({conf2:.2%})")
        print(f"  Match: {'‚úÖ' if pred1 == pred2 else '‚ùå'}\n")

compare_predictions(model, quantized_model, tokenizer, test_texts)

---

## Technique 2: Knowledge Distillation (Tiny Model, Big Brain) üß†

### The Idea:

Train a **small student model** to mimic a **large teacher model**.

It's like learning from a professor - you don't need to be as smart as them, just learn their patterns!

### How It Works:

1. **Teacher model** (large): Makes predictions
2. **Student model** (small): Learns to match those predictions
3. **Loss function**: Combination of:
   - Match teacher's soft predictions (probabilities)
   - Match actual labels

### Results:
- Student is 10x smaller
- Loses only 2-3% accuracy
- 10x faster inference!

### When to Use:
- ‚úÖ Production deployment on CPU/edge
- ‚úÖ Mobile apps
- ‚úÖ Real-time applications
- ‚úÖ Cost-sensitive scenarios

In [None]:
# Load a smaller student model
student_model_name = "prajjwal1/bert-tiny"  # Only 4M parameters!

print(f"Loading student model: {student_model_name}...")
student_model = AutoModelForSequenceClassification.from_pretrained(
    student_model_name,
    num_labels=2
)

student_params = sum(p.numel() for p in student_model.parameters())
student_size_mb = student_params * 4 / (1024 ** 2)

print(f"‚úÖ Student model loaded!")
print(f"Teacher params: {param_count:,}")
print(f"Student params: {student_params:,}")
print(f"Size reduction: {param_count/student_params:.1f}x smaller!")

In [None]:
# Knowledge Distillation Loss
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    """
    Loss function for knowledge distillation.
    
    Combines:
    1. Soft loss: Match teacher's probability distribution
    2. Hard loss: Match actual labels
    """
    
    def __init__(self, temperature=3.0, alpha=0.7):
        super().__init__()
        self.temperature = temperature  # Soften probabilities
        self.alpha = alpha  # Balance between soft and hard loss
    
    def forward(self, student_logits, teacher_logits, labels):
        # Soft loss: KL divergence between student and teacher
        soft_targets = F.softmax(teacher_logits / self.temperature, dim=-1)
        soft_student = F.log_softmax(student_logits / self.temperature, dim=-1)
        soft_loss = F.kl_div(soft_student, soft_targets, reduction='batchmean')
        soft_loss = soft_loss * (self.temperature ** 2)  # Scale back
        
        # Hard loss: Regular cross-entropy with labels
        hard_loss = F.cross_entropy(student_logits, labels)
        
        # Combine
        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

print("‚úÖ Distillation loss function ready!")
print("\nThis loss teaches the student to:")
print("  1. Match teacher's confident predictions (soft loss)")
print("  2. Get the right answers (hard loss)")

In [None]:
# In production, you'd train the student on your dataset
# For this demo, we'll show the concept

print("üí° In production, you would:")
print("""\n1. Load your training data
2. Get teacher predictions for all samples
3. Train student to match both teacher and labels
4. Training code:

for batch in dataloader:
    # Get teacher predictions (no gradients needed)
    with torch.no_grad():
        teacher_logits = teacher_model(**batch)
    
    # Get student predictions (train this!)
    student_logits = student_model(**batch)
    
    # Calculate distillation loss
    loss = distillation_loss(
        student_logits,
        teacher_logits,
        batch['labels']
    )
    
    # Backprop
    loss.backward()
    optimizer.step()

Result: Student model that's 10x smaller but nearly as accurate!
""")

---

## Technique 3: ONNX Runtime (Hardware Optimization) üîß

### What is ONNX?

**ONNX** (Open Neural Network Exchange) is a universal model format.

**ONNX Runtime** optimizes your model for specific hardware (CPU, GPU, etc.)

### Benefits:
- ‚úÖ 2-3x faster inference
- ‚úÖ Works on any hardware
- ‚úÖ Lower memory usage
- ‚úÖ Better batching

### When to Use:
- ‚úÖ Production deployment (always!)
- ‚úÖ Cross-platform apps
- ‚úÖ When you need maximum speed

In [None]:
from optimum.onnxruntime import ORTModelForSequenceClassification

print("üîÑ Converting model to ONNX...\n")

# Convert and optimize
onnx_model = ORTModelForSequenceClassification.from_pretrained(
    model_name,
    export=True
)

print("‚úÖ ONNX conversion complete!")
print("\nONNX Runtime will:")
print("  - Fuse operations (e.g., Conv + BatchNorm ‚Üí single op)")
print("  - Optimize memory layout")
print("  - Use hardware-specific instructions (AVX, CUDA kernels)")

In [None]:
# Benchmark ONNX model
print("\nüîç Benchmarking ONNX model...\n")
onnx_metrics = benchmark_model(onnx_model, tokenizer, test_texts)

print("üìä ONNX Performance:")
for metric, value in onnx_metrics.items():
    baseline = baseline_metrics[metric]
    speedup = baseline / value
    print(f"  {metric}: {value:.2f} ms ({speedup:.2f}x faster)")

print("\nüí° ONNX Runtime shines on CPU inference!")

---

## Technique 4: Smart Batching (10x Throughput) üì¶

### The Problem:

Processing one request at a time is inefficient. GPUs are built for parallelism!

### The Solution:

**Batch multiple requests** together and process them simultaneously.

### Results:
- Single request: 50ms
- Batch of 32: 100ms total ‚Üí **3ms per request!**
- **16x improvement in throughput!**

### When to Use:
- ‚úÖ High-traffic APIs
- ‚úÖ Batch processing jobs
- ‚úÖ When latency can be slightly higher
- ‚ùå Real-time single-user apps

In [None]:
def benchmark_batching(model, tokenizer, texts: List[str], batch_sizes: List[int]):
    """
    Compare throughput across different batch sizes.
    """
    model.eval()
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)
    
    results = {}
    
    for batch_size in batch_sizes:
        # Create batches
        batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]
        
        start = time.perf_counter()
        total_processed = 0
        
        for batch in batches:
            inputs = tokenizer(
                batch,
                padding=True,
                truncation=True,
                return_tensors="pt"
            ).to(device)
            
            with torch.no_grad():
                outputs = model(**inputs)
            
            total_processed += len(batch)
        
        end = time.perf_counter()
        
        total_time = end - start
        throughput = total_processed / total_time
        latency_per_item = (total_time / total_processed) * 1000
        
        results[batch_size] = {
            "throughput": throughput,
            "latency_per_item_ms": latency_per_item
        }
    
    return results

# Create test data (100 samples)
large_test_set = test_texts * 25

print("üîç Benchmarking different batch sizes...\n")
batch_results = benchmark_batching(
    model,
    tokenizer,
    large_test_set,
    batch_sizes=[1, 4, 8, 16, 32]
)

print("üìä Batching Results:\n")
print(f"{'Batch Size':<12} {'Throughput':<15} {'Latency/Item':<15}")
print("-" * 42)

for batch_size, metrics in batch_results.items():
    print(
        f"{batch_size:<12} "
        f"{metrics['throughput']:<15.1f} "
        f"{metrics['latency_per_item_ms']:<15.2f}"
    )

print("\nüí° Sweet spot: Balance between throughput and latency (usually 8-16)")

---

## Technique 5: Pruning (Remove Dead Weight) ‚úÇÔ∏è

### The Idea:

Many model weights are close to zero and don't contribute much.

**Pruning** removes these weights, making the model smaller and faster.

### Types:
1. **Magnitude Pruning**: Remove smallest weights
2. **Structured Pruning**: Remove entire neurons/channels
3. **Movement Pruning**: Remove weights that don't change during training

### Results:
- Can remove 30-50% of weights
- With minimal accuracy loss (<2%)
- Often combined with quantization!

### When to Use:
- ‚úÖ After training
- ‚úÖ Combined with fine-tuning
- ‚úÖ Edge deployment
- ‚ùå During initial training

In [None]:
import torch.nn.utils.prune as prune

def apply_magnitude_pruning(model, amount=0.3):
    """
    Apply magnitude-based pruning to linear layers.
    
    Args:
        model: The model to prune
        amount: Fraction of weights to prune (0.3 = 30%)
    
    Returns:
        Pruned model
    """
    print(f"‚ö° Pruning {amount*100:.0f}% of weights...\n")
    
    # Find all Linear layers
    modules_to_prune = []
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            modules_to_prune.append((module, 'weight'))
    
    # Apply global magnitude pruning
    prune.global_unstructured(
        modules_to_prune,
        pruning_method=prune.L1Unstructured,
        amount=amount,
    )
    
    # Make pruning permanent
    for module, param_name in modules_to_prune:
        prune.remove(module, param_name)
    
    # Count remaining weights
    total_params = sum(p.numel() for p in model.parameters())
    nonzero_params = sum((p != 0).sum().item() for p in model.parameters())
    
    print(f"‚úÖ Pruning complete!")
    print(f"Total parameters: {total_params:,}")
    print(f"Non-zero parameters: {nonzero_params:,}")
    print(f"Sparsity: {(1 - nonzero_params/total_params)*100:.1f}%")
    
    return model

# Create a copy for pruning
import copy
pruned_model = copy.deepcopy(model)

# Prune 30% of weights
pruned_model = apply_magnitude_pruning(pruned_model, amount=0.3)

print("\nüí° In production, you'd fine-tune after pruning to recover accuracy!")

---

## üìä Final Comparison: All Techniques

Let's compare all optimization methods!

In [None]:
# Summary comparison
comparison = {
    "Baseline": {
        "size_mb": model_size_mb,
        "latency_ms": baseline_metrics["mean_ms"],
        "accuracy_loss": 0.0
    },
    "Quantization (8-bit)": {
        "size_mb": quantized_size_mb,
        "latency_ms": quantized_metrics["mean_ms"],
        "accuracy_loss": 0.5  # Typical
    },
    "ONNX Runtime": {
        "size_mb": model_size_mb,  # Same size
        "latency_ms": onnx_metrics["mean_ms"],
        "accuracy_loss": 0.0
    },
    "Knowledge Distillation": {
        "size_mb": student_size_mb,
        "latency_ms": baseline_metrics["mean_ms"] / 10,  # ~10x faster
        "accuracy_loss": 2.5  # Typical
    },
    "Pruning (30%)": {
        "size_mb": model_size_mb * 0.7,
        "latency_ms": baseline_metrics["mean_ms"] * 0.8,
        "accuracy_loss": 1.5  # Typical
    },
}

print("üìä OPTIMIZATION COMPARISON\n")
print(f"{'Method':<25} {'Size (MB)':<12} {'Latency (ms)':<15} {'Accuracy Loss'}")
print("=" * 70)

for method, metrics in comparison.items():
    print(
        f"{method:<25} "
        f"{metrics['size_mb']:<12.1f} "
        f"{metrics['latency_ms']:<15.2f} "
        f"{metrics['accuracy_loss']:.1f}%"
    )

print("\nüí° Pro Tip: Combine multiple techniques!")
print("   Best combo: Distillation + Quantization + ONNX")
print("   Result: 10x smaller, 20x faster, <3% accuracy loss!")

---

## üéØ Production Recommendations

### For Different Scenarios:

#### 1. **Cloud Deployment (API)**
**Use:** Quantization + ONNX + Batching
```python
# Optimize for CPU inference
model = quantize_dynamic(model)
model = convert_to_onnx(model)
# Use batch size 8-16
```
**Result:** 5x cheaper servers, 3x faster

---

#### 2. **Mobile App**
**Use:** Knowledge Distillation + Quantization + Pruning
```python
# Create tiny model
student = distill_from_teacher(teacher, student_tiny)
student = quantize_dynamic(student)
student = prune(student, amount=0.3)
```
**Result:** <10MB model, runs on phone

---

#### 3. **Edge Devices (IoT)**
**Use:** Aggressive Distillation + 4-bit Quantization
```python
# Extreme optimization
tiny_model = distill_to_mini(teacher)
tiny_model = quantize(tiny_model, bits=4)
```
**Result:** <5MB model, runs on Raspberry Pi

---

#### 4. **High-Throughput Batch Jobs**
**Use:** ONNX + Large Batches + GPU
```python
# Maximize throughput
model = convert_to_onnx(model, use_gpu=True)
# Use batch size 64-128
```
**Result:** Process millions of records per hour

---

## üí∞ Cost Analysis

### Before Optimization:
- Instance: g4dn.xlarge (GPU) @ $0.526/hour
- Throughput: 100 requests/sec
- Daily traffic: 5M requests
- Cost: 5M / (100 * 3600) = 13.9 hours
- **Monthly cost: $220**

### After Optimization (Quantization + ONNX + Batching):
- Instance: t3.large (CPU) @ $0.0832/hour
- Throughput: 500 requests/sec (batching!)
- Daily traffic: 5M requests
- Cost: 5M / (500 * 3600) = 2.8 hours
- **Monthly cost: $7**

**üíµ Savings: $213/month = $2,556/year!**

---

## üìö Interview Prep

### Q: "How would you optimize a model for production?"

**Your Answer:**

*"I'd start by benchmarking the baseline - measure latency, throughput, and model size. Then I'd apply optimizations based on the deployment target:*

*For cloud APIs, I'd use **quantization** to reduce the model to 8-bit integers, which cuts size by 4x with minimal accuracy loss. Then convert to **ONNX Runtime** for hardware-specific optimizations. Finally, implement **smart batching** - processing 8-16 requests together can increase throughput 10x.*

*For mobile or edge, I'd use **knowledge distillation** to train a tiny student model that mimics the large teacher. Combined with quantization and pruning, you can get a 10x smaller model with only 2-3% accuracy loss.*

*The key is measuring everything - optimization without benchmarks is guesswork."*

---

### Q: "What's the difference between quantization and pruning?"

**Your Answer:**

*"**Quantization** reduces the precision of weights - converting from 32-bit floats to 8-bit or even 4-bit integers. This makes the model smaller and faster without changing the architecture. It's like storing prices as whole dollars instead of cents - less precise but good enough.*

*"**Pruning** removes entire weights that are close to zero. It's like removing unused roads from a map. You can prune 30-50% of weights with minimal accuracy loss, but you need to fine-tune afterwards to compensate.*

*They're complementary - I often use both together for maximum compression."*

---

### Q: "When would you NOT use optimization?"

**Your Answer:**

*"Good question! I wouldn't optimize if:*

*1. **During training** - Use full precision (FP32 or FP16) for stability
*2. **When accuracy is critical** - Medical diagnosis, financial models where every 0.1% matters
*3. **GPU inference with low traffic** - GPUs are already optimized for FP32, and quantization might not help much
*4. **Prototyping** - Optimize after you've proven the model works

*The rule: **Optimize for production, not for development.**"*

---

## üöÄ Next Steps

### Practice Projects:

1. **Optimize Your Own Model**:
   - Take a model you've trained
   - Apply all 5 techniques
   - Benchmark before/after
   - Calculate cost savings

2. **Deploy Optimized Model**:
   - Convert to ONNX
   - Deploy on AWS Lambda (serverless!)
   - Add batching logic
   - Monitor latency

3. **Mobile App**:
   - Distill a large model to tiny
   - Quantize to 4-bit
   - Deploy to Android/iOS
   - Test on real device

### Further Reading:

- **Quantization**: [PyTorch Quantization Tutorial](https://pytorch.org/docs/stable/quantization.html)
- **Distillation**: [DistilBERT Paper](https://arxiv.org/abs/1910.01108)
- **ONNX**: [ONNX Runtime Docs](https://onnxruntime.ai/)
- **Pruning**: [Lottery Ticket Hypothesis](https://arxiv.org/abs/1803.03635)

---

## üéâ You're Now an Optimization Expert!

You've learned:

‚úÖ **Quantization** - 4x smaller models  
‚úÖ **Knowledge Distillation** - 10x faster inference  
‚úÖ **ONNX Runtime** - Hardware optimization  
‚úÖ **Smart Batching** - 10x higher throughput  
‚úÖ **Pruning** - Remove 30-50% of weights  

**These techniques will save your company tens of thousands of dollars per year.**

Add these skills to your resume:
- *"Optimized ML models for production deployment, reducing inference cost by 95%"*
- *"Implemented knowledge distillation and quantization, achieving 10x speedup with <2% accuracy loss"*

**Now go make those models FAST! ‚ö°**

---

*Built with ‚ù§Ô∏è for people who ship ML to production*