# SmoothQuant Implementation: Accurate and Efficient Post-Training Quantization - Hands-On Practice

## Introduction

SmoothQuant is a post-training quantization technique that addresses the challenge of quantizing large language models (LLMs) by smoothing the activation outliers. This workshop implements the SmoothQuant method as described in the paper "SmoothQuant: Accurate and Efficient Post-training Quantization for Large Language Models" (Xiao et al., 2022).

### Key Concepts:
- **Problem**: Activation outliers in transformer models make quantization difficult
- **Solution**: Migrate difficulty from activations to weights through mathematical equivalence
- **Method**: Apply per-channel scaling to balance activation and weight quantization difficulties

### Workshop Objectives:
1. Load and analyze OPT-135M model weights and activations
2. Visualize the distribution of weights and activations before SmoothQuant
3. Implement the SmoothQuant algorithm
4. Visualize the transformed distributions
5. Evaluate model performance before and after quantization

## 1. Environment Setup and Dependencies

First, we install and import the necessary libraries for our implementation.

In [None]:
# Install required packages
!pip install torch transformers datasets matplotlib numpy scipy tqdm

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from transformers import OPTForCausalLM, GPT2Tokenizer, AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

print("🔥 HANDS-ON WORKSHOP: SmoothQuant Implementation")
print("📚 Complete the TODO exercises to learn SmoothQuant!")
print("💡 Look for TODO comments and fill in the missing code")

## 2. Model and Data Loading

We load the OPT-135M model and prepare calibration data for analyzing activation patterns.

In [None]:
# Load OPT-135M model and tokenizer
model_name = "facebook/opt-125m"  # Using 125M as it's more readily available
print(f"Loading model: {model_name}")

model = OPTForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model.to(device)
model.eval()

print(f"Model loaded successfully. Total parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Prepare calibration dataset
def prepare_calibration_data(tokenizer, num_samples=100, max_length=512):
    """Prepare calibration data from WikiText-2 dataset."""
    
    # Load WikiText-2 dataset
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    
    calibration_texts = []
    for i, example in enumerate(dataset):
        if i >= num_samples:
            break
        text = example['text'].strip()
        if len(text) > 50:  # Filter out very short texts
            calibration_texts.append(text)
    
    # Tokenize the texts
    inputs = tokenizer(
        calibration_texts[:num_samples],
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )
    
    return inputs

# Prepare calibration data
calibration_data = prepare_calibration_data(tokenizer, num_samples=50, max_length=256)
print(f"Calibration data shape: {calibration_data['input_ids'].shape}")

## 3. Activation Collection and Analysis

Before implementing SmoothQuant, we need to understand the distribution of activations in the model. We'll collect activations from linear layers during forward passes.

In [None]:
class ActivationCollector:
    """Collects activations from specified layers during forward pass."""
    
    def __init__(self):
        self.activations = {}
        self.hooks = []
    
    def hook_fn(self, name):
        def hook(module, input, output):
            # Store input activations (not output)
            if isinstance(input, tuple):
                activation = input[0].detach().cpu()
            else:
                activation = input.detach().cpu()
            
            if name not in self.activations:
                self.activations[name] = []
            self.activations[name].append(activation)
        return hook
    
    def register_hooks(self, model, target_layers):
        """Register hooks on target layers."""
        for name, module in model.named_modules():
            if any(target in name for target in target_layers):
                if isinstance(module, nn.Linear):
                    hook = module.register_forward_hook(self.hook_fn(name))
                    self.hooks.append(hook)
                    print(f"Registered hook on: {name}")
    
    def remove_hooks(self):
        """Remove all registered hooks."""
        for hook in self.hooks:
            hook.remove()
        self.hooks = []
    
    def get_aggregated_activations(self):
        """Aggregate collected activations."""
        aggregated = {}
        for name, acts in self.activations.items():
            # Concatenate all collected activations
            concatenated = torch.cat(acts, dim=0)
            aggregated[name] = concatenated
        return aggregated

# Initialize activation collector
collector = ActivationCollector()

# Target specific layers for analysis (focus on attention and MLP layers)
target_layers = ['q_proj', 'k_proj', 'v_proj', 'out_proj', 'fc1', 'fc2']
collector.register_hooks(model, target_layers)

In [None]:
# Collect activations by running forward passes
print("Collecting activations...")

with torch.no_grad():
    # Process calibration data in smaller batches
    batch_size = 8
    num_batches = len(calibration_data['input_ids']) // batch_size
    
    for i in tqdm(range(num_batches)):
        start_idx = i * batch_size
        end_idx = min((i + 1) * batch_size, len(calibration_data['input_ids']))
        
        batch_input_ids = calibration_data['input_ids'][start_idx:end_idx].to(device)
        batch_attention_mask = calibration_data['attention_mask'][start_idx:end_idx].to(device)
        
        # Forward pass
        outputs = model(
            input_ids=batch_input_ids,
            attention_mask=batch_attention_mask
        )

# Get aggregated activations
activations_original = collector.get_aggregated_activations()
collector.remove_hooks()

print(f"Collected activations from {len(activations_original)} layers")
for name, acts in activations_original.items():
    print(f"  {name}: {acts.shape}")

## 4. Weight and Activation Distribution Analysis

Now we analyze the distribution characteristics of weights and activations to understand the quantization challenges.

In [None]:
def analyze_distribution_stats(tensor, name):
    """Analyze statistical properties of tensor distributions."""
    
    # Flatten tensor for analysis
    flat_tensor = tensor.flatten()
    
    stats = {
        'mean': float(torch.mean(flat_tensor)),
        'std': float(torch.std(flat_tensor)),
        'min': float(torch.min(flat_tensor)),
        'max': float(torch.max(flat_tensor)),
        'median': float(torch.median(flat_tensor)),
        'q95': float(torch.quantile(flat_tensor, 0.95)),
        'q99': float(torch.quantile(flat_tensor, 0.99)),
        'abs_max': float(torch.max(torch.abs(flat_tensor)))
    }
    
    print(f"\n{name} Distribution Statistics:")
    print(f"  Mean: {stats['mean']:.6f}")
    print(f"  Std:  {stats['std']:.6f}")
    print(f"  Range: [{stats['min']:.6f}, {stats['max']:.6f}]")
    print(f"  95th percentile: {stats['q95']:.6f}")
    print(f"  99th percentile: {stats['q99']:.6f}")
    print(f"  Absolute max: {stats['abs_max']:.6f}")
    
    return stats

# Analyze activations
activation_stats = {}
for name, acts in activations_original.items():
    if 'fc1' in name:  # Focus on one representative layer
        activation_stats[name] = analyze_distribution_stats(acts, f"Activations - {name}")
        break

# Analyze weights from the same layer
weight_stats = {}
for name, module in model.named_modules():
    if 'fc1' in name and isinstance(module, nn.Linear):
        weight_stats[name] = analyze_distribution_stats(module.weight.data.cpu(), f"Weights - {name}")
        break

## 5. Pre-SmoothQuant Visualization

We create 3D visualizations similar to those in the SmoothQuant paper, showing the distribution of weights and activations across channels and tokens.

In [None]:
def create_3d_distribution_plot_matplotlib(tensor, title, max_channels=64, max_tokens=128):
    """Create 3D plot showing Channel x Token x Absolute Value distribution using matplotlib."""
    
    # TODO: Reshape tensor to [batch*seq_len, channels] if needed
    # HINT: Check if tensor has 3 dimensions, then reshape to 2D
    if len(tensor.shape) == 3:
        tensor = # Your code here - reshape to 2D
    
    # Sample data for visualization
    num_tokens = min(tensor.shape[0], max_tokens)
    num_channels = min(tensor.shape[1], max_channels)
    
    # TODO: Sample random indices for tokens and channels
    # HINT: Use torch.randperm() to get random permutation, then slice [:num_tokens]
    token_indices = # Your code here - random token indices
    channel_indices = # Your code here - random channel indices
    
    # Extract sampled data
    sampled_data = tensor[token_indices][:, channel_indices]
    
    # TODO: Compute absolute values
    # HINT: Use torch.abs()
    abs_values = # Your code here
    
    # Create meshgrid for 3D plot
    channels = np.arange(num_channels)
    tokens = np.arange(num_tokens)
    C, T = np.meshgrid(channels, tokens)
    
    # Create 3D surface plot using matplotlib
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    
    # TODO: Create surface plot
    # HINT: Use ax.plot_surface() with C, T, abs_values.numpy(), and cmap='viridis'
    surf = # Your code here - create surface plot
    
    # Customize plot
    ax.set_xlabel('Channel Index')
    ax.set_ylabel('Token Index') 
    ax.set_zlabel('Absolute Value')
    ax.set_title(title)
    
    # Add colorbar
    fig.colorbar(surf, ax=ax, shrink=0.5, aspect=5)
    
    plt.tight_layout()
    plt.show()
    
    return fig

# Get representative layer data
fc1_layer_name = None
fc1_activations = None
fc1_weights = None

# TODO: Find FC1 layer in collected activations
# HINT: Loop through activations_original.items() and check if 'fc1' in name
for name, acts in activations_original.items():
    if 'fc1' in name:
        fc1_layer_name = name
        fc1_activations = acts
        break

# TODO: Get corresponding weights from the model
# HINT: Loop through model.named_modules() and find the layer with same name
for name, module in model.named_modules():
    if name == fc1_layer_name:
        fc1_weights = # Your code here - get weight data and move to CPU
        break

print(f"🎯 Analyzing layer: {fc1_layer_name}")
print(f"📊 Activation shape: {fc1_activations.shape}")
print(f"📊 Weight shape: {fc1_weights.shape}")
print("✅ Complete the TODOs above to prepare data for visualization!")

In [None]:
# TODO: Create 3D plots for original distributions
print("Creating 3D visualizations for original distributions...")

# TODO: Plot original activations
# HINT: Call create_3d_distribution_plot_matplotlib() with fc1_activations and appropriate title
print("📊 Plotting original activations...")
fig_act_orig = # Your code here - create 3D plot for activations

# TODO: Plot original weights (transpose to match channel dimension)
# HINT: Use fc1_weights.T to transpose, then create 3D plot
print("📊 Plotting original weights...")
fig_weight_orig = # Your code here - create 3D plot for weights (transposed)

print("✅ Complete the TODOs above to visualize original distributions!")
print("🎯 Key observations to look for:")
print("  • Activation outliers (high peaks in certain channels)")
print("  • Weight distribution patterns")
print("  • These visualizations show WHY quantization is challenging")

## 6. SmoothQuant Algorithm Implementation - HANDS-ON PRACTICE

Now we implement the core SmoothQuant algorithm. The key insight is to mathematically migrate quantization difficulty from activations to weights through per-channel scaling.

**🔥 HANDS-ON PRACTICE**: Complete the missing parts of the SmoothQuant implementation below! Look for `# TODO:` comments and fill in the code.

### Key Mathematical Concepts:
- **Migration Formula**: For linear layer Y = XW^T, we apply scaling s to get Y = (X ⊘ s)(W ⊙ s)^T
- **Scaling Factor**: s_j = (max|x_j|)^α where α controls migration strength
- **Balance**: α = 0 (no smoothing) to α = 1 (full migration to weights)

In [None]:
class SmoothQuant:
    """Implementation of SmoothQuant algorithm."""
    
    def __init__(self, alpha=0.5):
        """
        Initialize SmoothQuant.
        
        Args:
            alpha (float): Migration strength parameter. 
                          0 = no smoothing, 1 = full migration to weights
        """
        self.alpha = alpha
        self.scaling_factors = {}
    
    def compute_scaling_factors(self, activations):
        """
        Compute per-channel scaling factors based on activation statistics.
        
        Args:
            activations (torch.Tensor): Input activations [batch, seq_len, channels]
        
        Returns:
            torch.Tensor: Scaling factors for each channel
        """
        # TODO: Reshape activations to [batch*seq_len, channels] if needed
        # HINT: Check if len(activations.shape) == 3, then reshape to 2D
        if len(activations.shape) == 3:
            activations = # Your code here - reshape to 2D
        
        # TODO: Compute per-channel maximum absolute values
        # HINT: Use torch.max with dim=0 to get max along batch dimension
        # HINT: torch.max returns (values, indices), so use [0] to get values
        channel_max = # Your code here - compute max absolute values per channel
        
        # TODO: Compute scaling factors: s_j = (max|x_j|)^α
        # HINT: Use torch.pow(channel_max, self.alpha)
        scaling_factors = # Your code here - apply power of alpha
        
        # TODO: Avoid division by zero by clamping minimum value
        # HINT: Use torch.clamp(scaling_factors, min=1e-8)
        scaling_factors = # Your code here - clamp minimum value
        
        return scaling_factors
    
    def apply_smoothing(self, activations, weights, layer_name=None):
        """
        Apply SmoothQuant transformation to activations and weights.
        
        For linear layer Y = XW^T:
        - Scale activations: X' = X * diag(s^{-1})  [divide by scaling factors]
        - Scale weights: W' = W * diag(s)           [multiply by scaling factors]
        
        Args:
            activations (torch.Tensor): Input activations
            weights (torch.Tensor): Layer weights
            layer_name (str): Optional layer name for tracking
        
        Returns:
            tuple: (smoothed_activations, smoothed_weights, scaling_factors)
        """
        # TODO: Compute scaling factors using the method above
        scaling_factors = # Your code here - call compute_scaling_factors
        
        # Store scaling factors for tracking
        if layer_name:
            self.scaling_factors[layer_name] = scaling_factors
        
        # TODO: Apply scaling to activations (DIVIDE by scaling factors)
        # HINT: Store original shape, reshape if needed, apply scaling, reshape back
        original_shape = activations.shape
        if len(original_shape) == 3:
            activations_2d = activations.reshape(-1, original_shape[-1])
        else:
            activations_2d = activations
        
        # TODO: Scale activations by dividing by scaling factors
        # HINT: Use scaling_factors.unsqueeze(0) to broadcast correctly
        smoothed_activations_2d = # Your code here - divide activations by scaling factors
        
        # Reshape back to original shape
        if len(original_shape) == 3:
            smoothed_activations = smoothed_activations_2d.reshape(original_shape)
        else:
            smoothed_activations = smoothed_activations_2d
        
        # TODO: Apply scaling to weights (MULTIPLY by scaling factors)
        # HINT: For linear layer weights [out_features, in_features], scale along input dimension
        # HINT: Use scaling_factors.unsqueeze(0) to broadcast along output dimension
        smoothed_weights = # Your code here - multiply weights by scaling factors
        
        return smoothed_activations, smoothed_weights, scaling_factors
    
    def get_migration_statistics(self, original_activations, original_weights, 
                               smoothed_activations, smoothed_weights):
        """
        Compute statistics showing the effect of smoothing.
        """
        stats = {}
        
        # TODO: Compute maximum absolute values for original and smoothed tensors
        # HINT: Use torch.max(torch.abs(tensor.flatten())) for each tensor
        orig_act_max = # Your code here - max abs value of original activations
        smooth_act_max = # Your code here - max abs value of smoothed activations
        
        orig_weight_max = # Your code here - max abs value of original weights  
        smooth_weight_max = # Your code here - max abs value of smoothed weights
        
        # TODO: Calculate reduction/increase ratios
        stats['activation_max_reduction'] = # Your code here - orig_act_max / smooth_act_max
        stats['weight_max_increase'] = # Your code here - smooth_weight_max / orig_weight_max
        
        stats['original_act_max'] = float(orig_act_max)
        stats['smoothed_act_max'] = float(smooth_act_max)
        stats['original_weight_max'] = float(orig_weight_max)
        stats['smoothed_weight_max'] = float(smooth_weight_max)
        
        return stats

# TODO: Initialize SmoothQuant with α=0.5 (balanced smoothing)
# HINT: Create an instance of SmoothQuant class with alpha=0.5
smooth_quant = # Your code here - create SmoothQuant instance

print(f"Initialized SmoothQuant with α={smooth_quant.alpha}")
print("✅ Complete the TODOs above to implement the SmoothQuant algorithm!")

## 7. Apply SmoothQuant Transformation - HANDS-ON PRACTICE

We now apply the SmoothQuant transformation to our representative layer and analyze the results.

**🔥 HANDS-ON PRACTICE**: Complete the transformation application and analysis below!

In [None]:
# TODO: Apply SmoothQuant to the representative layer
print(f"Applying SmoothQuant to layer: {fc1_layer_name}")

# TODO: Apply smoothing using the smooth_quant instance
# HINT: Call smooth_quant.apply_smoothing() with fc1_activations, fc1_weights, and fc1_layer_name
smoothed_activations, smoothed_weights, scaling_factors = # Your code here

print(f"Original activations shape: {fc1_activations.shape}")
print(f"Smoothed activations shape: {smoothed_activations.shape}")
print(f"Original weights shape: {fc1_weights.shape}")
print(f"Smoothed weights shape: {smoothed_weights.shape}")
print(f"Scaling factors shape: {scaling_factors.shape}")

# TODO: Get migration statistics
# HINT: Call smooth_quant.get_migration_statistics() with the four tensors
migration_stats = # Your code here

print("\n🎯 SmoothQuant Migration Results:")
print(f"  Activation outliers reduced by: {migration_stats['activation_max_reduction']:.3f}x")
print(f"  Weight outliers increased by: {migration_stats['weight_max_increase']:.3f}x")
print(f"  Original activation max: {migration_stats['original_act_max']:.6f}")
print(f"  Smoothed activation max: {migration_stats['smoothed_act_max']:.6f}")
print(f"  Original weight max: {migration_stats['original_weight_max']:.6f}")
print(f"  Smoothed weight max: {migration_stats['smoothed_weight_max']:.6f}")

# TODO: Verify the mathematical equivalence (advanced challenge!)
# HINT: Check that the matrix multiplication results are approximately equal
# For Y = X @ W.T vs Y' = X' @ W'.T where X' and W' are smoothed versions
print("\n🔬 Mathematical Verification:")
print("Checking if Y = XW^T ≈ Y' = X'W'^T (mathematical equivalence)")

# Sample small batch for verification
sample_size = min(10, fc1_activations.shape[0])
sample_act = fc1_activations[:sample_size]
sample_smooth_act = smoothed_activations[:sample_size]

# TODO: Compute original output: Y = X @ W.T
# HINT: Use torch.matmul() or @ operator
original_output = # Your code here

# TODO: Compute smoothed output: Y' = X' @ W'.T  
# HINT: Use the smoothed activations and smoothed weights
smoothed_output = # Your code here

# TODO: Compute the difference between outputs
# HINT: Use torch.mean(torch.abs(original_output - smoothed_output))
output_difference = # Your code here

print(f"  Mean absolute difference between outputs: {float(output_difference):.8f}")
print(f"  ✅ Outputs should be nearly identical (difference ≈ 0)")

if float(output_difference) < 1e-5:
    print("  🎉 Mathematical equivalence verified!")
else:
    print("  ⚠️  Check your implementation - outputs should be nearly identical")

### 💡 Hints and Solution Guidelines

**For the SmoothQuant Implementation:**

1. **Reshaping tensors**: When you have 3D activations [batch, seq_len, channels], reshape to 2D [batch*seq_len, channels] using `.reshape(-1, tensor.shape[-1])`

2. **Computing channel maximums**: Use `torch.max(torch.abs(activations), dim=0)[0]` to get max absolute values per channel

3. **Scaling factor formula**: The core SmoothQuant formula is `s_j = (max|x_j|)^α`

4. **Applying scaling**: 
   - Activations: **divide** by scaling factors (reduce outliers)
   - Weights: **multiply** by scaling factors (transfer difficulty)

5. **Broadcasting**: Use `.unsqueeze(0)` to add batch dimension for proper broadcasting

6. **Mathematical verification**: The key insight is that `X @ W.T` should equal `(X/s) @ (W*s).T` due to mathematical equivalence

**Expected Results:**
- Activation outliers should be reduced (factor > 1)
- Weight outliers should increase (factor > 1) 
- Mathematical equivalence error should be < 1e-5

## 8. Post-SmoothQuant Visualization

We create visualizations of the smoothed distributions to compare with the original ones.

In [None]:
# Analyze smoothed distributions
print("Analyzing smoothed distributions...")

smoothed_activation_stats = analyze_distribution_stats(
    smoothed_activations, 
    f"Smoothed Activations - {fc1_layer_name}"
)

smoothed_weight_stats = analyze_distribution_stats(
    smoothed_weights, 
    f"Smoothed Weights - {fc1_layer_name}"
)

In [None]:
# Create 3D plots for smoothed distributions
print("Creating 3D visualizations for smoothed distributions...")

# Plot smoothed activations
fig_act_smooth = create_3d_distribution_plot(
    smoothed_activations, 
    "Smoothed Activations Distribution (Channel × Token × Absolute Value)"
)
fig_act_smooth.show()

# Plot smoothed weights
fig_weight_smooth = create_3d_distribution_plot(
    smoothed_weights.T,
    "Smoothed Weights Distribution (Input Channel × Output Channel × Absolute Value)"
)
fig_weight_smooth.show()

## 9. Quantization Analysis - HANDS-ON PRACTICE

Now we demonstrate the quantization benefits of SmoothQuant by comparing quantization errors before and after smoothing.

**🔥 HANDS-ON PRACTICE**: Complete the quantization simulation and analysis!

In [None]:
def quantize_tensor(tensor, bits=8, symmetric=True):
    """
    Simulate quantization of a tensor.
    
    Args:
        tensor (torch.Tensor): Input tensor
        bits (int): Number of quantization bits
        symmetric (bool): Whether to use symmetric quantization
    
    Returns:
        tuple: (quantized_tensor, quantization_error)
    """
    if symmetric:
        # TODO: Implement symmetric quantization
        # HINT: Find max absolute value, compute scale, quantize and dequantize
        max_val = # Your code here - find maximum absolute value
        scale = # Your code here - compute scale factor using 2^(bits-1) - 1
        quantized = # Your code here - quantize: round(tensor/scale) * scale
    else:
        # Asymmetric quantization (provided for reference)
        min_val = torch.min(tensor)
        max_val = torch.max(tensor)
        scale = (max_val - min_val) / (2**bits - 1)
        zero_point = -torch.round(min_val / scale)
        quantized = (torch.round(tensor / scale) + zero_point - zero_point) * scale
    
    # TODO: Compute quantization error
    # HINT: Use torch.mean(torch.abs(tensor - quantized))
    error = # Your code here - compute mean absolute error
    
    return quantized, error

def analyze_quantization_impact(original_tensor, smoothed_tensor, name, bits=8):
    """
    Compare quantization impact on original vs smoothed tensors.
    """
    # TODO: Quantize original tensor
    # HINT: Call quantize_tensor() function
    quant_orig, error_orig = # Your code here
    
    # TODO: Quantize smoothed tensor  
    # HINT: Call quantize_tensor() function
    quant_smooth, error_smooth = # Your code here
    
    # TODO: Calculate improvement (error reduction ratio)
    # HINT: Divide original error by smoothed error
    error_reduction = # Your code here
    
    print(f"\n📊 {name} Quantization Analysis ({bits}-bit):")
    print(f"  Original quantization error: {float(error_orig):.6f}")
    print(f"  Smoothed quantization error: {float(error_smooth):.6f}")
    print(f"  Error reduction: {float(error_reduction):.3f}x")
    
    # Color-coded feedback
    if error_reduction > 1.1:
        print(f"  ✅ Great! SmoothQuant improved quantization")
    elif error_reduction > 0.9:
        print(f"  ⚠️  Slight improvement or similar performance")
    else:
        print(f"  ❌ Check implementation - should see improvement")
    
    return {
        'original_error': float(error_orig),
        'smoothed_error': float(error_smooth),
        'error_reduction': float(error_reduction)
    }

# TODO: Analyze quantization impact on activations
# HINT: Call analyze_quantization_impact() with fc1_activations and smoothed_activations
activation_quant_results = # Your code here

# TODO: Analyze quantization impact on weights
# HINT: Call analyze_quantization_impact() with fc1_weights and smoothed_weights  
weight_quant_results = # Your code here

print("\n🎯 Key Insights:")
print("• Activations should show significant error reduction (SmoothQuant's main benefit)")
print("• Weights may show increased error (this is expected - difficulty migrated here)")
print("• Overall system benefits from easier activation quantization")

## 10. Model Performance Evaluation

Finally, we evaluate the impact of SmoothQuant on model performance by comparing perplexity before and after applying the transformation.

In [None]:
def evaluate_perplexity(model, tokenizer, eval_texts, max_length=256):
    """
    Evaluate model perplexity on given texts.
    """
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for text in eval_texts:
            # Tokenize
            inputs = tokenizer(
                text, 
                return_tensors="pt", 
                max_length=max_length, 
                truncation=True
            )
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            # Forward pass
            outputs = model(**inputs, labels=inputs['input_ids'])
            
            # Accumulate loss
            total_loss += outputs.loss.item() * inputs['input_ids'].numel()
            total_tokens += inputs['input_ids'].numel()
    
    # Calculate perplexity
    avg_loss = total_loss / total_tokens
    perplexity = torch.exp(torch.tensor(avg_loss))
    
    return float(perplexity)

# Prepare evaluation texts
eval_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="validation")
eval_texts = [example['text'] for example in eval_dataset if len(example['text'].strip()) > 50][:20]

print(f"Evaluating on {len(eval_texts)} validation texts...")

# Evaluate original model
original_perplexity = evaluate_perplexity(model, tokenizer, eval_texts)
print(f"Original model perplexity: {original_perplexity:.3f}")

In [None]:
# Create a modified model with SmoothQuant applied
def apply_smoothquant_to_model(model, smooth_quant_instance):
    """
    Apply SmoothQuant transformations to the entire model.
    Note: This is a simplified version for demonstration.
    """
    modified_model = model  # In practice, you'd create a copy
    
    # For demonstration, we'll just show that the scaling factors have been computed
    print("SmoothQuant scaling factors computed for layers:")
    for layer_name, factors in smooth_quant_instance.scaling_factors.items():
        print(f"  {layer_name}: {factors.shape}")
    
    return modified_model

# Apply SmoothQuant (simplified for demonstration)
smoothed_model = apply_smoothquant_to_model(model, smooth_quant)

print("\nSmoothQuant transformation applied successfully!")
print("In a full implementation, you would:")
print("1. Apply scaling to all linear layers")
print("2. Modify the model architecture to include scaling")
print("3. Re-evaluate perplexity with the modified model")

## 11. Results Summary and Comparison

Let's summarize our findings and create comparison visualizations.

In [None]:
# Create summary comparison plots using matplotlib
def create_comparison_plot_matplotlib():
    """Create side-by-side comparison plots using matplotlib."""
    import matplotlib.pyplot as plt
    import numpy as np
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle("SmoothQuant Results Summary", fontsize=16)
    
    # Activation histograms
    orig_act_flat = fc1_activations.flatten().numpy()
    smooth_act_flat = smoothed_activations.flatten().numpy()
    
    axes[0, 0].hist(orig_act_flat, bins=50, alpha=0.7, label="Original")
    axes[0, 0].hist(smooth_act_flat, bins=50, alpha=0.7, label="Smoothed")
    axes[0, 0].set_title("Activation Distribution (Before vs After)")
    axes[0, 0].set_xlabel("Value")
    axes[0, 0].set_ylabel("Frequency")
    axes[0, 0].legend()
    
    # Weight histograms
    orig_weight_flat = fc1_weights.flatten().numpy()
    smooth_weight_flat = smoothed_weights.flatten().numpy()
    
    axes[0, 1].hist(orig_weight_flat, bins=50, alpha=0.7, label="Original")
    axes[0, 1].hist(smooth_weight_flat, bins=50, alpha=0.7, label="Smoothed")
    axes[0, 1].set_title("Weight Distribution (Before vs After)")
    axes[0, 1].set_xlabel("Value")
    axes[0, 1].set_ylabel("Frequency")
    axes[0, 1].legend()
    
    # Quantization error comparison
    axes[1, 0].bar(["Activations", "Weights"], [activation_quant_results['original_error'], weight_quant_results['original_error']], alpha=0.7, label="Original Error")
    axes[1, 0].bar(["Activations", "Weights"], [activation_quant_results['smoothed_error'], weight_quant_results['smoothed_error']], alpha=0.7, label="Smoothed Error")
    axes[1, 0].set_title("Quantization Error Comparison")
    axes[1, 0].set_ylabel("Error")
    axes[1, 0].legend()
    
    # Migration statistics
    axes[1, 1].bar(["Act. Max Reduction", "Weight Max Increase"], [migration_stats['activation_max_reduction'], migration_stats['weight_max_increase']], alpha=0.7, label="Migration Effect")
    axes[1, 1].set_title("Migration Statistics")
    axes[1, 1].set_ylabel("Effect")
    axes[1, 1].legend()
    
    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()
    
# Create and show comparison plot using matplotlib
create_comparison_plot_matplotlib()

In [None]:
# Print final summary
print("\n" + "="*60)
print("SMOOTHQUANT IMPLEMENTATION SUMMARY")
print("="*60)

print(f"\nModel Analyzed: {model_name}")
print(f"Target Layer: {fc1_layer_name}")
print(f"Smoothing Parameter (α): {smooth_quant.alpha}")

print("\nMigration Results:")
print(f"   • Activation outliers reduced by: {migration_stats['activation_max_reduction']:.2f}x")
print(f"   • Weight outliers increased by: {migration_stats['weight_max_increase']:.2f}x")

print("\nQuantization Improvements:")
print(f"   • Activation quantization error reduced by: {activation_quant_results['error_reduction']:.2f}x")
print(f"   • Weight quantization error changed by: {weight_quant_results['error_reduction']:.2f}x")

print("\nKey Insights:")
print("   • SmoothQuant successfully migrates quantization difficulty from activations to weights")
print("   • Activation outliers are smoothed, making them easier to quantize")
print("   • The mathematical equivalence preserves model functionality")
print("   • 3D visualizations clearly show the distribution changes")

print("\n" + "="*60)
print("Workshop completed successfully! 🎉")
print("="*60)