## Summary

This notebook demonstrated five key model optimization techniques:

1. **Quantization**: Reduces precision to int8, providing significant size and speed improvements with minimal accuracy loss
2. **Pruning**: Removes less important weights, creating sparse models that can be further optimized
3. **Knowledge Distillation**: Creates smaller student models that mimic larger teacher models
4. **ONNX Export**: Enables optimized inference across different platforms and runtimes
5. **Comprehensive Comparison**: Understanding trade-offs between different optimization approaches

### Best Practices:
- Start with quantization for quick wins
- Use ONNX for production deployments
- Combine techniques for maximum optimization
- Always validate that accuracy remains acceptable
- Profile your specific use case to choose the best approach

### Next Steps:
- Experiment with different quantization methods (static quantization, QAT)
- Explore structured pruning techniques
- Try progressive knowledge distillation
- Implement model compression pipelines for your specific models

### Production Recommendations:
1. **For CPU inference**: Quantization + ONNX Runtime
2. **For edge devices**: Knowledge Distillation + Quantization
3. **For cloud deployment**: ONNX Runtime with optimized providers
4. **For research**: Combine multiple techniques based on specific requirements

In [None]:
# Comprehensive comparison of all optimization techniques
import pandas as pd
import matplotlib.pyplot as plt

# Collect all performance metrics
optimization_summary = {
    'Technique': ['Original', 'Quantization', 'Pruning (20%)', 'Knowledge Distillation'],
    'Model Size (MB)': [original_size, quantized_size, original_size, student_size],
    'Inference Time (s)': [original_time, quantized_time, results[1]['time'], student_time],
    'Size Reduction (%)': [0, (1-quantized_size/original_size)*100, results[1]['sparsity'], (1-student_size/original_size)*100],
    'Speed Improvement (%)': [0, (original_time/quantized_time-1)*100, (original_time/results[1]['time']-1)*100, (original_time/student_time-1)*100],
    'Accuracy Impact': ['Baseline', 'Minimal', 'Low', 'Moderate']
}

# Create comparison DataFrame
df = pd.DataFrame(optimization_summary)
print("Model Optimization Comparison Summary:")
print("=" * 70)
print(df.to_string(index=False, float_format='%.2f'))

# Create visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# Model size comparison
techniques = df['Technique']
sizes = df['Model Size (MB)']
colors = ['blue', 'orange', 'green', 'red']
ax1.bar(techniques, sizes, color=colors)
ax1.set_title('Model Size Comparison')
ax1.set_ylabel('Size (MB)')
ax1.tick_params(axis='x', rotation=45)

# Inference time comparison
times = df['Inference Time (s)']
ax2.bar(techniques, times, color=colors)
ax2.set_title('Inference Time Comparison')
ax2.set_ylabel('Time (seconds)')
ax2.tick_params(axis='x', rotation=45)

# Size reduction percentage
size_reductions = df['Size Reduction (%)']
ax3.bar(techniques, size_reductions, color=colors)
ax3.set_title('Size Reduction Percentage')
ax3.set_ylabel('Reduction (%)')
ax3.tick_params(axis='x', rotation=45)

# Speed improvement percentage
speed_improvements = df['Speed Improvement (%)']
ax4.bar(techniques, speed_improvements, color=colors)
ax4.set_title('Speed Improvement Percentage')
ax4.set_ylabel('Improvement (%)')
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\\n" + "="*70)
print("Key Optimization Insights:")
print("-" * 30)
print("• Quantization: Best balance of size and speed with minimal accuracy loss")
print("• Pruning: Reduces model complexity, good for specialized hardware")
print("• Knowledge Distillation: Significant size reduction but requires retraining")
print("• ONNX Export: Great for production deployment with runtime optimizations")
print("\\nRecommendation: Combine techniques (e.g., quantized ONNX) for maximum benefit!")
print("="*70)

## 5. Comprehensive Performance Summary

Let's summarize all the optimization techniques and their trade-offs.

In [None]:
from optimum.onnxruntime import ORTModelForSequenceClassification

# Convert model to ONNX format
onnx_model_path = "./onnx_model"

# Export to ONNX (this will create the ONNX model)
print("Converting model to ONNX format...")
try:
    ort_model = ORTModelForSequenceClassification.from_pretrained(
        model_name,
        export=True,
        use_cache=False
    )
    
    # Save ONNX model
    ort_model.save_pretrained(onnx_model_path)
    tokenizer.save_pretrained(onnx_model_path)
    
    print(f"ONNX model saved to {onnx_model_path}")
    
    # Load the ONNX model for inference
    ort_model = ORTModelForSequenceClassification.from_pretrained(onnx_model_path)
    ort_tokenizer = AutoTokenizer.from_pretrained(onnx_model_path)
    
    # Create pipelines for comparison
    pytorch_pipeline = pipeline(
        "text-classification",
        model=model,
        tokenizer=tokenizer,
        device=-1  # CPU
    )
    
    onnx_pipeline = pipeline(
        "text-classification",
        model=ort_model,
        tokenizer=ort_tokenizer,
        device=-1  # CPU
    )
    
    # Test with sample texts
    test_texts = [
        "This product exceeded my expectations!",
        "The service was disappointing.",
        "Average quality, nothing special."
    ]
    
    print("\\nBenchmarking inference performance...")
    
    # Benchmark PyTorch pipeline
    start_time = time.time()
    pytorch_results = pytorch_pipeline(test_texts)
    pytorch_time = time.time() - start_time
    
    # Benchmark ONNX pipeline
    start_time = time.time()
    onnx_results = onnx_pipeline(test_texts)
    onnx_time = time.time() - start_time
    
    print(f"PyTorch pipeline: {pytorch_time:.4f} seconds")
    print(f"ONNX pipeline:    {onnx_time:.4f} seconds")
    
    # Calculate improvement
    if onnx_time > 0:
        speedup = pytorch_time / onnx_time
        print(f"ONNX speedup: {speedup:.2f}x faster")
    
    # Compare predictions
    print("\\nPrediction comparison:")
    print("-" * 50)
    
    for i, (text, pt_result, onnx_result) in enumerate(zip(test_texts, pytorch_results, onnx_results)):
        print(f"Text {i+1}: {text}")
        print(f"  PyTorch: {pt_result['label']} ({pt_result['score']:.4f})")
        print(f"  ONNX:    {onnx_result['label']} ({onnx_result['score']:.4f})")
        print()

except Exception as e:
    print(f"ONNX conversion failed: {e}")
    print("This is common in some environments. The concept remains valid for production use.")

## 4. ONNX Export for Optimized Inference

ONNX (Open Neural Network Exchange) allows models to run on optimized runtimes, providing faster inference across different platforms.

In [None]:
# Create a smaller student model
student_config = DistilBertConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=384,  # Smaller than original (768)
    num_hidden_layers=4,  # Fewer layers than original (6)
    num_attention_heads=6,  # Fewer heads than original (12)
    intermediate_size=1536,  # Smaller than original (3072)
    num_labels=2
)

student_model = DistilBertForSequenceClassification(student_config)

# Compare model sizes
teacher_size = get_model_size(model)
student_size = get_model_size(student_model)

print(f"Teacher model size: {teacher_size:.2f} MB")
print(f"Student model size: {student_size:.2f} MB")
print(f"Size reduction: {(1 - student_size/teacher_size)*100:.1f}%")

# Simple knowledge distillation loss function
def distillation_loss(student_logits, teacher_logits, true_labels, temperature=4.0, alpha=0.7):
    """Compute knowledge distillation loss"""
    # Soft targets from teacher
    soft_targets = F.softmax(teacher_logits / temperature, dim=-1)
    soft_prob = F.log_softmax(student_logits / temperature, dim=-1)
    
    # KL divergence loss
    soft_loss = F.kl_div(soft_prob, soft_targets, reduction='batchmean') * (temperature ** 2)
    
    # Hard targets loss
    hard_loss = F.cross_entropy(student_logits, true_labels)
    
    # Combined loss
    return alpha * soft_loss + (1 - alpha) * hard_loss

# Simple training loop for knowledge distillation
def train_student_model(student, teacher, train_data, num_epochs=2, batch_size=8):
    """Train student model with knowledge distillation"""
    teacher.eval()
    student.train()
    
    optimizer = torch.optim.AdamW(student.parameters(), lr=5e-5)
    
    for epoch in range(num_epochs):
        total_loss = 0
        num_batches = 0
        
        # Simple batching
        for i in range(0, len(train_data), batch_size):
            batch = train_data[i:i+batch_size]
            
            # Prepare batch data
            input_ids = torch.stack([item['input_ids'] for item in batch])
            attention_mask = torch.stack([item['attention_mask'] for item in batch])
            labels = torch.stack([item['label'] for item in batch])
            
            # Get teacher predictions
            with torch.no_grad():
                teacher_outputs = teacher(input_ids=input_ids, attention_mask=attention_mask)
                teacher_logits = teacher_outputs.logits
            
            # Get student predictions
            student_outputs = student(input_ids=input_ids, attention_mask=attention_mask)
            student_logits = student_outputs.logits
            
            # Calculate distillation loss
            loss = distillation_loss(student_logits, teacher_logits, labels)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            num_batches += 1
        
        avg_loss = total_loss / num_batches
        print(f"Epoch {epoch + 1}/{num_epochs}, Average Loss: {avg_loss:.4f}")
    
    return student

print("\nTraining student model with knowledge distillation...")
student_model = train_student_model(student_model, model, train_dataset, num_epochs=2)

# Benchmark the student model
student_time, _ = benchmark_inference(student_model, tokenizer, sample_texts, num_runs=5)
print(f"\nStudent model inference time: {student_time:.4f} seconds")
print(f"Speed improvement over teacher: {(original_time/student_time - 1)*100:.1f}%")

In [None]:
from torch.nn import functional as F

# Create synthetic training data for demonstration
def create_synthetic_dataset(size=500):
    """Create synthetic text classification data"""
    positive_templates = [
        "This is amazing!", "Great work!", "Fantastic product!",
        "I love this!", "Excellent quality!", "Outstanding service!"
    ]
    negative_templates = [
        "This is terrible!", "Poor quality!", "I hate this!",
        "Awful experience!", "Complete waste!", "Very disappointing!"
    ]
    
    texts = []
    labels = []
    
    for i in range(size):
        if i % 2 == 0:
            text = np.random.choice(positive_templates)
            label = 1  # Positive
        else:
            text = np.random.choice(negative_templates)
            label = 0  # Negative
        
        # Add some variation
        text += f" Item {i}"
        texts.append(text)
        labels.append(label)
    
    return Dataset.from_dict({'text': texts, 'label': labels})

# Create smaller datasets for quick training
train_dataset = create_synthetic_dataset(200)
eval_dataset = create_synthetic_dataset(50)

print(f"Created {len(train_dataset)} training samples and {len(eval_dataset)} evaluation samples")
print(f"Sample data: {train_dataset[0]}")

# Tokenize datasets
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
eval_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

## 3. Knowledge Distillation

Knowledge distillation trains a smaller "student" model to mimic the behavior of a larger "teacher" model, achieving similar performance with reduced computational requirements.

In [None]:
import torch.nn.utils.prune as prune
import copy

# Create a copy of the original model for pruning
model_to_prune = copy.deepcopy(model)

def apply_global_pruning(model, pruning_ratio=0.2):
    """Apply global magnitude-based pruning to the model"""
    # Collect all linear layers for pruning
    parameters_to_prune = []
    
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            parameters_to_prune.append((module, 'weight'))
    
    # Apply global magnitude pruning
    prune.global_unstructured(
        parameters_to_prune,
        pruning_method=prune.L1Unstructured,
        amount=pruning_ratio,
    )
    
    return model

def calculate_sparsity(model):
    """Calculate the sparsity of the model"""
    total_params = 0
    zero_params = 0
    
    for name, param in model.named_parameters():
        if 'weight' in name:
            total_params += param.numel()
            zero_params += (param == 0).sum().item()
    
    return zero_params / total_params * 100

# Apply different pruning ratios and compare
pruning_ratios = [0.1, 0.2, 0.3, 0.5]
results = []

print("Pruning Results:")
print("-" * 60)

for ratio in pruning_ratios:
    # Create a fresh copy of the model
    pruned_model = copy.deepcopy(model)
    
    # Apply pruning
    pruned_model = apply_global_pruning(pruned_model, ratio)
    
    # Calculate sparsity
    sparsity = calculate_sparsity(pruned_model)
    
    # Benchmark performance
    pruned_time, _ = benchmark_inference(pruned_model, tokenizer, sample_texts, num_runs=5)
    
    # Test accuracy
    pruned_preds = test_model_outputs(pruned_model, tokenizer, sample_texts)
    
    # Calculate average prediction difference
    avg_diff = 0
    for orig, pruned in zip(original_preds, pruned_preds):
        avg_diff += np.abs(orig - pruned).mean()
    avg_diff /= len(original_preds)
    
    results.append({
        'ratio': ratio,
        'sparsity': sparsity,
        'time': pruned_time,
        'accuracy_diff': avg_diff
    })
    
    print(f"Pruning Ratio: {ratio:4.1f} | Sparsity: {sparsity:5.1f}% | "
          f"Time: {pruned_time:6.4f}s | Acc Diff: {avg_diff:.6f}")

print("-" * 60)
print(f"Original Time: {original_time:.4f}s")

## 2. Model Pruning

Pruning removes less important weights from the model, reducing its size while maintaining most of its performance.

In [None]:
# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {nn.Linear},  # Quantize linear layers
    dtype=torch.qint8
)

# Benchmark quantized model
quantized_size = get_model_size(quantized_model)
quantized_time, quantized_std = benchmark_inference(quantized_model, tokenizer, sample_texts)

print("\nQuantized Model:")
print(f"  Size: {quantized_size:.2f} MB")
print(f"  Inference time: {quantized_time:.4f} ± {quantized_std:.4f} seconds")

# Calculate improvements
size_reduction = (1 - quantized_size / original_size) * 100
speed_improvement = (original_time / quantized_time - 1) * 100

print("\nImprovements:")
print(f"  Size reduction: {size_reduction:.1f}%")
print(f"  Speed improvement: {speed_improvement:.1f}%")

# Test accuracy comparison
def test_model_outputs(model, tokenizer, texts):
    """Get model predictions"""
    model.eval()
    predictions = []
    
    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            outputs = model(**inputs)
            pred = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predictions.append(pred.cpu().numpy())
    
    return predictions

original_preds = test_model_outputs(model, tokenizer, sample_texts)
quantized_preds = test_model_outputs(quantized_model, tokenizer, sample_texts)

# Calculate prediction differences
print("\nPrediction Accuracy Comparison:")
for i, (orig, quant) in enumerate(zip(original_preds, quantized_preds)):
    diff = np.abs(orig - quant).max()
    print(f"  Text {i+1} max difference: {diff:.6f}")

In [None]:
# Load a pre-trained model for sentiment analysis
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create sample text for testing
sample_texts = [
    "This movie is absolutely fantastic!",
    "I really disliked this film.",
    "The plot was okay, nothing special.",
    "Outstanding performance by the actors!"
]

def get_model_size(model):
    """Calculate model size in MB"""
    param_size = sum(p.numel() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
    return (param_size + buffer_size) / 1024 / 1024

def benchmark_inference(model, tokenizer, texts, num_runs=10):
    """Benchmark inference time"""
    model.eval()
    times = []
    
    with torch.no_grad():
        for _ in range(num_runs):
            start_time = time.time()
            for text in texts:
                inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
                _ = model(**inputs)
            times.append(time.time() - start_time)
    
    return np.mean(times), np.std(times)

# Benchmark original model
original_size = get_model_size(model)
original_time, original_std = benchmark_inference(model, tokenizer, sample_texts)

print("Original Model:")
print(f"  Size: {original_size:.2f} MB")
print(f"  Inference time: {original_time:.4f} ± {original_std:.4f} seconds")

## 1. Model Quantization

Quantization reduces the precision of model weights from 32-bit floats to 8-bit integers, significantly reducing model size and improving inference speed with minimal accuracy loss.

In [None]:
import torch
import torch.nn as nn
from datasets import Dataset
import numpy as np
import time

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

# 13. Model Optimization with Hugging Face

This notebook covers various techniques for optimizing Hugging Face models for better performance, reduced memory usage, and faster inference.

## Topics Covered:
1. **Model Quantization** - Reducing model precision for efficiency
2. **Dynamic Quantization** - Runtime quantization for inference
3. **Model Pruning** - Removing less important weights
4. **Knowledge Distillation** - Training smaller models from larger ones
5. **ONNX Export** - Converting models for optimized inference

Let's start by installing the required packages and importing necessary libraries.

In [None]:
# Install required packages
!pip install transformers torch torchvision torchaudio
!pip install torch-audio --index-url https://download.pytorch.org/whl/cpu
!pip install optimum[onnxruntime]
!pip install datasets accelerate