# Workshop 3.1: Quantization for Efficient Inference - Hands-On Practice

## Learning Objectives
Upon completion of this notebook, you will be able to:
- Articulate the theoretical foundations and practical motivations for quantization in deep neural networks, particularly in the context of resource-constrained edge AI deployments.
- Analyze the principles and implications of FP16 quantization, including the conversion from 32-bit (FP32) to 16-bit (FP16) floating-point representations.
- Implement post-training quantization workflows using PyTorch, and critically assess their impact on model performance.
- Quantitatively evaluate the trade-offs between model size, computational efficiency, and predictive accuracy resulting from quantization.
- Employ visualization techniques to systematically interpret the effects of quantization on neural network behavior and deployment characteristics.



## What is Quantization?
Quantization refers to the process of mapping continuous-valued parameters and activations of neural networks to a lower-precision numerical format. In the context of deep learning, this typically involves converting 32-bit floating-point (FP32) representations to reduced-precision formats such as 16-bit floating-point (FP16). The primary objectives are to decrease memory footprint, accelerate inference, and enable deployment on hardware with limited computational resources, while preserving model fidelity within acceptable bounds.



This workshop focuses on FP16 quantization, which offers:
- Substantial reduction in model storage requirements and memory bandwidth
- Negligible degradation in predictive accuracy for most modern architectures
- Broad compatibility with PyTorch and contemporary hardware accelerators
- Straightforward integration into existing model development pipelines


Paper: https://arxiv.org/pdf/1712.05877


---
**🔥 HANDS-ON PRACTICE**: This notebook contains code completion exercises marked with `# TODO:` comments. Fill in the missing code to complete the quantization workflow!

In [None]:
# Import required libraries
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torch.quantization
from torchvision import models
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import copy
import time
import os

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## Step 1: Load and Prepare CIFAR-10 Dataset

### Understanding CIFAR-10: Why This Dataset?

CIFAR-10 is perfect for learning quantization because:
- **Small images (32×32)** - Fast training and testing
- **Realistic challenge** - 10 different object classes
- **Edge AI relevant** - Similar to mobile camera applications
- **Resource-friendly** - Doesn't require powerful hardware

The small image size makes it ideal for:
- Mobile device deployment
- Edge computing scenarios
- Real-time inference applications
- Learning optimization techniques like quantization

In [None]:
# TODO: Define transforms for training and testing
# HINT: Use transforms.Compose() with RandomCrop, RandomHorizontalFlip, ToTensor, and Normalize
# The normalization values for CIFAR-10 are: mean=(0.4914, 0.4822, 0.4465), std=(0.2023, 0.1994, 0.2010)

transform_train = transforms.Compose([
    # TODO: Add RandomCrop with size 32 and padding 4
    # TODO: Add RandomHorizontalFlip
    # TODO: Add ToTensor
    # TODO: Add Normalize with the values mentioned above
])

transform_test = transforms.Compose([
    # TODO: Add ToTensor
    # TODO: Add Normalize with the same values as above
])

# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
trainloader = DataLoader(trainset, batch_size=256, shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)
testloader = DataLoader(testset, batch_size=200, shuffle=False, num_workers=2)

# CIFAR-10 classes
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

print(f"Training samples: {len(trainset)}")
print(f"Test samples: {len(testset)}")
print(f"Classes: {classes}")

In [None]:
# Visualize sample images from CIFAR-10 dataset
def visualize_cifar10_samples(dataset, classes, num_samples=12):
    """
    Display a grid of sample images from CIFAR-10 with their class labels
    """
    fig, axes = plt.subplots(3, 4, figsize=(12, 9))
    fig.suptitle('CIFAR-10 Dataset Samples', fontsize=16, fontweight='bold')
    
    # Get random samples
    indices = np.random.choice(len(dataset), num_samples, replace=False)
    
    for i, idx in enumerate(indices):
        row = i // 4
        col = i % 4
        
        # Get image and label
        image, label = dataset[idx]
        
        # Convert tensor to numpy and denormalize for display
        if isinstance(image, torch.Tensor):
            # Denormalize the image
            mean = np.array([0.4914, 0.4822, 0.4465])
            std = np.array([0.2023, 0.1994, 0.2010])
            image = image.numpy().transpose(1, 2, 0)
            image = image * std + mean
            image = np.clip(image, 0, 1)
        
        # Display image
        axes[row, col].imshow(image)
        axes[row, col].set_title(f'Class: {classes[label]}', fontsize=12, fontweight='bold')
        axes[row, col].axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print(f"CIFAR-10 Dataset Details:")
    print(f"• Image size: 32×32 pixels (small images)")
    print(f"• Color channels: 3 (RGB)")
    print(f"• Number of classes: {len(classes)}")
    print(f"• Classes: {', '.join(classes)}")
    print(f"• Training samples: {len(dataset)}")
    print(f"• This is why we use MobileNetV2 - efficient for small images")

# Display sample images and dataset information
print("CIFAR-10 Dataset Overview:")
visualize_cifar10_samples(trainset, classes)

## Step 2: Load Pre-trained MobileNetV2 Model

In [None]:
# TODO: Complete the MobileNetV2 adaptation function
def create_mobilenetv2_cifar10(num_classes=10, pretrained=True):
    """
    Create MobileNetV2 adapted for CIFAR-10
    CIFAR-10 images are 32x32, smaller than ImageNet's 224x224
    """
    # TODO: Load pre-trained MobileNetV2 using models.mobilenet_v2()
    model = # Your code here
    
    # TODO: Modify the first convolution layer for smaller input size
    # HINT: use nn.Conv2d()
    model.features[0][0] = # Your code here
    
    # TODO: Modify classifier for CIFAR-10 (10 classes instead of 1000)
    # HINT: model.classifier[1] should be a Linear layer with model.last_channel input features
    model.classifier[1] = # Your code here
    
    return model

# Create model instance
model = create_mobilenetv2_cifar10(num_classes=10, pretrained=True).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"MobileNetV2 loaded with {total_params:,} total parameters")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model adapted for CIFAR-10 (32x32 images, 10 classes)")

In [None]:
# Model Architecture and Summary
def show_model_info(model, input_shape=(1, 3, 32, 32)):
    """
    Display detailed information about the MobileNetV2 model
    """
    print("="*70)
    print("MOBILENETV2 MODEL ARCHITECTURE OVERVIEW")
    print("="*70)
    
    # Model summary information
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    model_size_mb = total_params * 4 / (1024 * 1024)  # 4 bytes per FP32 parameter
    
    print(f"Model Statistics:")
    print(f"   • Total Parameters: {total_params:,}")
    print(f"   • Trainable Parameters: {trainable_params:,}")
    print(f"   • Model Size (FP32): {model_size_mb:.2f} MB")
    print(f"   • Input Shape: {input_shape}")
    print(f"   • Output Classes: 10 (CIFAR-10)")
    
    # Show key architectural components
    print(f"\nKey Architecture Components:")
    print(f"   • Features: {len(model.features)} layers (conv blocks)")
    print(f"   • Classifier: {len(model.classifier)} layers")
    print(f"   • Activation: ReLU6 (mobile-friendly)")
    print(f"   • Normalization: Batch Normalization")
    
    # Create a visual representation of the model flow
    print(f"\nModel Flow Overview:")
    print(f"   Input (3×32×32) → Features Extraction → Global Pooling → Classifier → Output (10)")
    
    # Test with a dummy input to show output shapes
    model.eval()
    with torch.no_grad():
        dummy_input = torch.randn(input_shape).to(device)
        dummy_output = model(dummy_input)
        print(f"\nModel Test:")
        print(f"   • Input shape: {dummy_input.shape}")
        print(f"   • Output shape: {dummy_output.shape}")
        print(f"   • Output represents probabilities for {dummy_output.shape[1]} classes")
    
    return total_params, model_size_mb

# Analyze our MobileNetV2 model
print("Examining MobileNetV2 model in detail:")
total_params, model_size_mb = show_model_info(model)

# Show the actual model structure (first few layers)
print(f"\nModel Structure Preview (First Few Layers):")
print("="*50)
for i, (name, layer) in enumerate(model.named_children()):
    print(f"{i+1}. {name}: {layer.__class__.__name__}")
    if i >= 2:  # Show first 3 main components
        break

print(f"\nKey Insight:")
print(f"   This model has {total_params:,} parameters taking {model_size_mb:.1f} MB.")
print(f"   With quantization, we'll reduce this to ~{model_size_mb/2:.1f} MB!")
print(f"   Perfect for deployment on mobile devices and edge computing!")

## Step 3: Train the Original Model (FP32)

In [None]:
# TODO: Complete the training function
def train_model(model, trainloader, testloader, epochs=10, learning_rate=0.001):
    # TODO: Define criterion (loss function) - use CrossEntropyLoss
    criterion = # Your code here
    
    # TODO: Define optimizer - use Adam with the given learning rate
    optimizer = # Your code here
    
    train_losses = []
    train_accuracies = []
    
    model.train()
    
    for epoch in range(epochs):
        running_loss = 0.0
        correct = 0
        total = 0
        
        # Training loop with progress bar
        pbar = tqdm(trainloader, desc=f'Epoch {epoch+1}/{epochs}')
        for batch_idx, (inputs, labels) in enumerate(pbar):
            inputs, labels = inputs.to(device), labels.to(device)
            
            # TODO: Zero gradients
            optimizer = # Your code here
            
            # TODO: Forward pass
            outputs = # Your code here
            
            # TODO: Calculate loss
            loss = # Your code here
            
            # TODO: Backward pass
            # Your code here
            
            # TODO: Update weights with .step()
            optimizer = # Your code here
            
            running_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
            
            # Update progress bar
            pbar.set_postfix({
                'Loss': f'{running_loss/(batch_idx+1):.3f}',
                'Acc': f'{100.*correct/total:.2f}%'
            })
        
        epoch_loss = running_loss / len(trainloader)
        epoch_acc = 100. * correct / total
        
        train_losses.append(epoch_loss)
        train_accuracies.append(epoch_acc)
        
        print(f'Epoch {epoch+1}: Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.2f}%')
    
    return train_losses, train_accuracies

# Train the model
print("Training original FP32 model...")
train_losses, train_accuracies = train_model(model, trainloader, testloader, epochs=5)

## Step 4: Evaluate Original Model Performance

In [None]:
# TODO: Complete the evaluation function
def evaluate_model(model, testloader, model_name="Model"):
    model.eval()
    correct = 0
    total = 0
    
    # TODO: Determine if model is quantized (on CPU) or regular (on GPU)
    # HINT: Use next(model.parameters()).device to get model device
    model_device = # Your code here
    
    # Measure inference time
    start_time = time.time()
    
    # TODO: Use torch.no_grad() context manager for evaluation
    with # Your code here:
        for inputs, labels in tqdm(testloader, desc=f'Evaluating {model_name}'):
            # TODO: Move inputs to the same device as the model
            inputs = # Your code here
            labels = labels.to(device)  # Keep labels on original device for comparison
            
            # TODO: Forward pass
            outputs = # Your code here
            
            # TODO: Get predictions (use outputs.max(1) to get the class with highest probability)
            _, predicted = # Your code here
            
            # TODO: Move predictions back to same device as labels for comparison
            predicted = # Your code here
            
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    end_time = time.time()
    inference_time = end_time - start_time
    
    accuracy = 100. * correct / total
    print(f'{model_name} - Accuracy: {accuracy:.2f}%, Inference Time: {inference_time:.2f}s')
    
    return accuracy, inference_time

def get_model_size(model, model_name="Model"):
    # Save model temporarily to measure size
    temp_path = f'temp_{model_name.lower().replace(" ", "_")}.pth'
    torch.save(model.state_dict(), temp_path)
    size_mb = os.path.getsize(temp_path) / (1024 * 1024)
    os.remove(temp_path)
    
    print(f'{model_name} - Size: {size_mb:.2f} MB')
    return size_mb

# Evaluate original model
original_accuracy, original_time = evaluate_model(model, testloader, "Original FP32")
original_size = get_model_size(model, "Original FP32")

## Step 5: Apply FP16 Quantization (Half Precision)

FP16 quantization converts models from 32-bit to 16-bit floating point representation. This approach offers:
- **Simple and reliable** - works on all PyTorch installations
- **Clear benefits** - exactly 50% model size reduction
- **Minimal accuracy loss** - maintains nearly full precision
- **Good starting point** - before considering more aggressive quantization techniques

In [None]:
# TODO: Complete the FP16 quantization function
def apply_fp16_quantization(model):
    """
    Convert model to FP16 (half precision)
    This reduces model size by exactly 50% with minimal accuracy loss
    """
    print("Converting model to FP16 (half precision)...")
    
    # TODO: Create a deep copy of the model to avoid modifying the original with deepcopy()
    model_fp16 = # Your code here
    
    # TODO: Move to CPU first for conversion with .to('cpu') to move the model to CPU
    model_fp16 = # Your code here
    
    # TODO: Convert to half precision (FP16) with .half() method
    model_fp16 = # Your code here
    
    model_fp16.eval()
    
    # TODO: Move back to device for evaluation
    model_fp16 = # Your code here
    
    print("FP16 quantization successful!")
    print("   All weights and activations now use 16-bit instead of 32-bit")
    
    return model_fp16

# Apply FP16 quantization
model_fp16 = apply_fp16_quantization(model)

print(f"\nModel Comparison:")
print(f"Original model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"FP16 model parameters: {sum(p.numel() for p in model_fp16.parameters()):,}")
print(f"Parameter count unchanged (same architecture)")
print(f"Memory usage per parameter: FP32 = 4 bytes, FP16 = 2 bytes")

# Show the actual data types
print(f"\nData type verification:")
print(f"Original model dtype: {next(model.parameters()).dtype}")
print(f"FP16 model dtype: {next(model_fp16.parameters()).dtype}")

## Step 5B: Apply INT8 Quantization (Dynamic Quantization)

INT8 quantization maps model weights and/or activations to 8-bit integer representations, providing even greater compression and potential speedup compared to FP16. In PyTorch, dynamic quantization is a practical approach for post-training quantization, especially for models with mostly linear layers (e.g., fully connected, LSTM, Transformer).

**Key characteristics:**
- Converts weights to INT8, activations quantized dynamically at runtime
- **Only quantizes specific layer types** (e.g., nn.Linear, nn.LSTM, nn.GRU)
- **Convolutional layers are NOT quantized** by dynamic quantization
- Size reduction depends on the proportion of quantizable layers in your model
- May introduce more accuracy loss than FP16, but often acceptable for many applications

**Important for MobileNetV2:** Since MobileNetV2 consists mostly of convolutional layers with only a small final classifier (nn.Linear), dynamic INT8 quantization will show minimal size reduction. This is expected behavior - the size reduction depends on how many Linear layers your model contains.

Below, we apply dynamic INT8 quantization to the MobileNetV2 classifier and evaluate its performance.

In [None]:
# TODO: Complete the INT8 dynamic quantization function
def apply_int8_dynamic_quantization(model):
    """
    Apply dynamic INT8 quantization to supported layers (e.g., nn.Linear) in the model.
    Returns a quantized model ready for evaluation.
    """
    print("Applying dynamic INT8 quantization...")
    
    # TODO: Create a deep copy of the model and move to CPU
    # HINT: Use copy.deepcopy(model).cpu().eval()
    model_int8 = # Your code here
    
    # TODO: Apply dynamic quantization to Linear layers
    # HINT: Use torch.quantization.quantize_dynamic()
    # Parameters: model, {torch.nn.Linear}, dtype=torch.qint8
    model_int8_quantized = # Your code here
    
    print("INT8 dynamic quantization successful!")
    return model_int8_quantized

# Apply INT8 quantization
dynamic_int8_model = apply_int8_dynamic_quantization(model)

# TODO: Evaluate INT8 quantized model
# HINT: Use the evaluate_model function you completed earlier
int8_accuracy, int8_time = # Your code here
int8_size = get_model_size(dynamic_int8_model, "INT8 Quantized")

print(f"\nINT8 Quantized Model:\n  • Size: {int8_size:.2f} MB\n  • Accuracy: {int8_accuracy:.2f}%\n  • Inference Time: {int8_time:.2f}s")

## Step 6: Evaluate Quantized Models

### Why Batch Size Matters for FP16 Inference Speed

Larger batch sizes can significantly improve the inference speed of FP16 quantized models, especially on CUDA-enabled GPUs with Tensor Cores. This is because:
- Tensor Cores are most efficient when processing large matrix operations, which occur with larger batches.
- Small batches may not fully utilize the GPU's parallelism or Tensor Core hardware, so the speedup from FP16 is less noticeable.
- With larger batches, the overhead of data transfer and kernel launch is amortized, and the GPU can process more data in parallel, making FP16 inference faster compared to FP32.

**Key takeaway:** If your hardware supports it, increasing the batch size can help you realize the full speed benefits of FP16 quantization.

In [None]:
# TODO: Complete the FP16 evaluation function
def evaluate_fp16_model(model, testloader, model_name="FP16 Model"):
    """
    Evaluate FP16 model performance
    """
    model.eval()
    correct = 0
    total = 0
    
    # Measure inference time
    start_time = time.time()
    
    with torch.no_grad():
        for inputs, labels in tqdm(testloader, desc=f'Evaluating {model_name}'):
            # TODO: Convert inputs to FP16 to match model precision
            # HINT: use .half() method on inputs
            inputs = # Your code here
            labels = labels.to(device)
            
            # TODO: Forward pass
            outputs = # Your code here
            
            # TODO: Get predictions
            _, predicted = # Your code here
            
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    end_time = time.time()
    inference_time = end_time - start_time
    
    accuracy = 100. * correct / total
    print(f'{model_name} - Accuracy: {accuracy:.2f}%, Inference Time: {inference_time:.2f}s')
    
    return accuracy, inference_time

def get_model_size_precise(model, model_name="Model"):
    """
    Calculate precise model size based on parameter types
    """
    total_size = 0
    
    for param in model.parameters():
        # Calculate size based on actual data type
        param_size = param.numel() * param.element_size()
        total_size += param_size
    
    # Convert to MB
    size_mb = total_size / (1024 * 1024)
    
    print(f'{model_name} - Size: {size_mb:.2f} MB')
    return size_mb

# Evaluate FP16 model
print("Evaluating FP16 Quantized Model...")
fp16_accuracy, fp16_time = evaluate_fp16_model(model_fp16, testloader, "FP16 Quantized")
fp16_size = get_model_size_precise(model_fp16, "FP16 Quantized")

# Compare with original and INT8
print(f"\nQuantization Results:")
print(f"Original FP32 Model:")
print(f"  • Size: {original_size:.2f} MB")
print(f"  • Accuracy: {original_accuracy:.2f}%")
print(f"  • Inference Time: {original_time:.2f}s")

print(f"\nFP16 Quantized Model:")
print(f"  • Size: {fp16_size:.2f} MB")
print(f"  • Accuracy: {fp16_accuracy:.2f}%")
print(f"  • Inference Time: {fp16_time:.2f}s")

print(f"\nINT8 Quantized Model:")
print(f"  • Size: {int8_size:.2f} MB")
print(f"  • Accuracy: {int8_accuracy:.2f}%")
print(f"  • Inference Time: {int8_time:.2f}s")

# TODO: Calculate improvements for both FP16 and INT8
fp16_size_reduction = # Your code here - calculate FP16 size reduction percentage
fp16_accuracy_change = # Your code here - calculate FP16 accuracy difference
fp16_speed_change = # Your code here - calculate FP16 speed change percentage

int8_size_reduction = # Your code here - calculate INT8 size reduction percentage
int8_accuracy_change = # Your code here - calculate INT8 accuracy difference
int8_speed_change = # Your code here - calculate INT8 speed change percentage

print(f"\nImprovements:")
print(f"FP16 Quantization:")
print(f"  Size Reduction: {fp16_size_reduction:.1f}%")
print(f"  Accuracy Change: {fp16_accuracy_change:+.2f}%")
print(f"  Speed Change: {fp16_speed_change:+.1f}%")

print(f"INT8 Quantization:")
print(f"  Size Reduction: {int8_size_reduction:.1f}%")
print(f"  Accuracy Change: {int8_accuracy_change:+.2f}%")
print(f"  Speed Change: {int8_speed_change:+.1f}%")

print(f"\nSuccess! FP16 achieved {fp16_size_reduction:.0f}% size reduction, INT8 achieved {int8_size_reduction:.0f}% size reduction!")