# Energy-Aware Quantization Measurement Harness
## ESE 5390 Final Project: LLM Quantization Energy Measurement

This notebook implements a complete measurement harness for measuring energy consumption, latency, and accuracy across different precision levels (FP32, FP16, INT8) for transformer models.

In [1]:
!git clone https://github.com/krishkc5/energy_aware_quantization.git

Cloning into 'energy_aware_quantization'...
remote: Enumerating objects: 118, done.[K
remote: Counting objects: 100% (118/118), done.[K
remote: Compressing objects: 100% (93/93), done.[K
remote: Total 118 (delta 48), reused 77 (delta 21), pack-reused 0 (from 0)[K
Receiving objects: 100% (118/118), 314.03 KiB | 4.30 MiB/s, done.
Resolving deltas: 100% (48/48), done.


## 0. Setup and Imports

In [2]:
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import pandas as pd
import json
import time
import subprocess
import threading
from pathlib import Path
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.6.0+cu124
CUDA available: True
CUDA device: Tesla P100-PCIE-16GB


## 1. Configuration Management

In [None]:
def create_default_config() -> Dict:
    """
    Create default experiment configuration.
    Modify these values as needed for your experiments.
    """
    config = {
        "model_name": "distilbert-base-uncased-finetuned-sst-2-english",
        "precision": "fp32",  # Will be overridden per experiment
        "batch_size": 8,
        "seq_len": 128,
        "num_loops": 300,  # Number of measurement iterations
        "warmup_loops": 50,  # Number of warmup iterations
        "dataset_path": "./data/pre_tokenized.pt",  # Path to pre-tokenized dataset
        "device": "cuda" if torch.cuda.is_available() else "cpu",
        "num_trials": 5,  # Number of trials per precision level
        "power_poll_interval_ms": 100,  # Power sampling interval
        "results_dir": "./results"
    }
    return config

def load_config(config_path: Optional[str] = None) -> Dict:
    """
    Load configuration from JSON file or use defaults.
    """
    if config_path and Path(config_path).exists():
        with open(config_path, 'r') as f:
            config = json.load(f)
        print(f"Loaded config from {config_path}")
    else:
        config = create_default_config()
        print("Using default configuration")
    
    # Create results directory if it doesn't exist
    Path(config["results_dir"]).mkdir(parents=True, exist_ok=True)
    
    return config

def save_config(config: Dict, path: str):
    """
    Save configuration to JSON file.
    """
    with open(path, 'w') as f:
        json.dump(config, f, indent=2)
    print(f"Config saved to {path}")

# Load configuration
config = load_config()
print("\nConfiguration:")
print(json.dumps(config, indent=2))

## 2. Dataset Loading (Zero-IO During Measurement)

In [None]:
def load_pre_tokenized_dataset(dataset_path: str, device: str) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Load pre-tokenized dataset and move to GPU.
    This ensures zero IO during measurement loops.
    
    Expected format: dict with keys 'input_ids', 'attention_mask', 'labels'
    """
    print(f"Loading dataset from {dataset_path}...")
    data = torch.load(dataset_path)
    
    input_ids = data["input_ids"].to(device)
    attention_mask = data["attention_mask"].to(device)
    labels = data["labels"].to(device)
    
    print(f"Dataset loaded: {input_ids.shape[0]} samples")
    print(f"Sequence length: {input_ids.shape[1]}")
    print(f"Memory on device: {input_ids.element_size() * input_ids.nelement() / 1024**2:.2f} MB")
    
    return input_ids, attention_mask, labels

def batched_iterator(input_ids: torch.Tensor, attention_mask: torch.Tensor, batch_size: int):
    """
    Create a cycling batch iterator with no IO.
    Wraps around when reaching the end of the dataset.
    """
    N = input_ids.size(0)
    idx = 0
    
    while True:
        end_idx = idx + batch_size
        
        # Get batch
        if end_idx <= N:
            batch_ids = input_ids[idx:end_idx]
            batch_mask = attention_mask[idx:end_idx]
        else:
            # Wrap around if needed
            idx = 0
            batch_ids = input_ids[idx:idx+batch_size]
            batch_mask = attention_mask[idx:idx+batch_size]
        
        # Ensure we have full batch
        if batch_ids.size(0) < batch_size:
            idx = 0
            continue
        
        yield batch_ids, batch_mask
        idx = (idx + batch_size) % N

## 3. Model Loading Per Precision

In [None]:
def load_model(model_name: str, precision: str, device: str):
    """
    Load model with specified precision level.
    
    Args:
        model_name: HuggingFace model identifier
        precision: 'fp32', 'fp16', or 'int8'
        device: 'cuda' or 'cpu'
    
    Returns:
        model: Loaded model in eval mode
    """
    print(f"\nLoading model: {model_name} with precision: {precision}")
    
    # Load base model
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        torch_dtype=torch.float32  # Start with FP32
    )
    
    model.to(device)
    model.eval()
    
    # Apply precision conversion
    if precision == "fp16":
        print("Converting to FP16...")
        model = model.half()
    
    elif precision == "int8":
        print("Applying INT8 quantization...")
        # Post-training dynamic quantization
        if device == "cpu":
            model = torch.quantization.quantize_dynamic(
                model,
                {nn.Linear},  # Quantize linear layers
                dtype=torch.qint8
            )
        else:
            # For GPU, you might use different quantization methods
            # This is a placeholder - adjust based on your quantization approach
            print("Note: INT8 quantization on GPU requires specific libraries (e.g., bitsandbytes)")
            print("Using FP16 as fallback for GPU INT8...")
            model = model.half()
    
    elif precision != "fp32":
        raise ValueError(f"Unknown precision: {precision}")
    
    # Calculate model size
    param_size = sum(p.nelement() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.nelement() * b.element_size() for b in model.buffers())
    model_size_mb = (param_size + buffer_size) / 1024**2
    
    print(f"Model size: {model_size_mb:.2f} MB")
    print(f"Number of parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M")
    
    return model

def get_model_info(model) -> Dict:
    """
    Get detailed model information.
    """
    param_size = sum(p.nelement() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.nelement() * b.element_size() for b in model.buffers())
    
    return {
        "model_size_mb": (param_size + buffer_size) / 1024**2,
        "num_parameters": sum(p.numel() for p in model.parameters()),
        "num_trainable_params": sum(p.numel() for p in model.parameters() if p.requires_grad)
    }

## 4. Warmup Phase

In [None]:
def warmup(model, batch_iter, num_iters: int, device: str):
    """
    Warmup phase to stabilize GPU clocks and compile kernels.
    No timing or power logging during this phase.
    """
    print(f"Running warmup for {num_iters} iterations...")
    
    with torch.no_grad():
        for i in range(num_iters):
            input_ids, attention_mask = next(batch_iter)
            _ = model(input_ids, attention_mask=attention_mask)
            
            if i % 10 == 0:
                print(f"  Warmup iteration {i}/{num_iters}", end='\r')
    
    if device == "cuda":
        torch.cuda.synchronize()
    
    print(f"\nWarmup complete.")

## 5. Power Logger

In [None]:
class PowerLogger:
    """
    Power logger using nvidia-smi to sample GPU power consumption.
    """
    
    def __init__(self, poll_interval_ms: int = 100, device_id: int = 0):
        """
        Args:
            poll_interval_ms: Sampling interval in milliseconds
            device_id: GPU device ID to monitor
        """
        self.poll_interval_ms = poll_interval_ms
        self.device_id = device_id
        self.proc = None
        self.samples = []
        self.reader_thread = None
        self.stop_flag = threading.Event()
    
    def start(self):
        """
        Start power logging subprocess.
        """
        print("Starting power logger...")
        self.samples = []
        self.stop_flag.clear()
        
        # nvidia-smi command for power monitoring
        # Adjust this command based on your environment (especially Kaggle)
        cmd = [
            "nvidia-smi",
            "--query-gpu=power.draw",
            "--format=csv,noheader,nounits",
            f"--loop-ms={self.poll_interval_ms}",
            f"--id={self.device_id}"
        ]
        
        try:
            self.proc = subprocess.Popen(
                cmd,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
                universal_newlines=True,
                bufsize=1
            )
            
            # Start reader thread
            self.reader_thread = threading.Thread(target=self._read_output)
            self.reader_thread.start()
            
            # Give it a moment to start
            time.sleep(0.2)
            
        except Exception as e:
            print(f"Warning: Could not start nvidia-smi: {e}")
            print("Power measurements will not be available.")
            self.proc = None
    
    def _read_output(self):
        """
        Read power samples from subprocess in background thread.
        """
        if self.proc is None:
            return
        
        while not self.stop_flag.is_set():
            line = self.proc.stdout.readline()
            if not line:
                break
            
            try:
                power = float(line.strip())
                self.samples.append(power)
            except ValueError:
                continue
    
    def stop(self):
        """
        Stop power logging and collect samples.
        """
        print("Stopping power logger...")
        self.stop_flag.set()
        
        if self.proc:
            self.proc.terminate()
            try:
                self.proc.wait(timeout=2)
            except subprocess.TimeoutExpired:
                self.proc.kill()
        
        if self.reader_thread:
            self.reader_thread.join(timeout=2)
        
        print(f"Collected {len(self.samples)} power samples")
    
    def get_samples(self) -> List[float]:
        """
        Return collected power samples in watts.
        """
        return self.samples.copy()

## 6. Timed Inference Loop

In [None]:
def run_inference_loop(model, batch_iter, num_loops: int, device: str) -> float:
    """
    Run timed inference loop without power logging.
    
    Args:
        model: Model to run inference on
        batch_iter: Batch iterator
        num_loops: Number of inference iterations
        device: Device type
    
    Returns:
        total_time: Total time in seconds
    """
    print(f"Running {num_loops} timed inference iterations...")
    
    # Synchronize before starting timer
    if device == "cuda":
        torch.cuda.synchronize()
    
    start = time.perf_counter()
    
    with torch.no_grad():
        for i in range(num_loops):
            input_ids, attention_mask = next(batch_iter)
            _ = model(input_ids, attention_mask=attention_mask)
            
            if i % 50 == 0:
                print(f"  Iteration {i}/{num_loops}", end='\r')
    
    # Synchronize before stopping timer
    if device == "cuda":
        torch.cuda.synchronize()
    
    end = time.perf_counter()
    total_time = end - start
    
    print(f"\nInference complete. Total time: {total_time:.3f}s")
    
    return total_time

## 7. Combined Measurement (Time + Power)

In [None]:
def measure_with_power(model, batch_iter, num_loops: int, device: str, 
                       poll_interval_ms: int = 100) -> Tuple[float, List[float]]:
    """
    Run timed inference with simultaneous power logging.
    
    Returns:
        total_time: Total inference time in seconds
        power_samples: List of power measurements in watts
    """
    logger = PowerLogger(poll_interval_ms=poll_interval_ms)
    
    # Start power logging
    logger.start()
    
    # Small delay to ensure logger is running
    time.sleep(0.5)
    
    # Run timed inference
    total_time = run_inference_loop(model, batch_iter, num_loops, device)
    
    # Stop power logging
    logger.stop()
    
    power_samples = logger.get_samples()
    
    return total_time, power_samples

## 8. Energy and Latency Computation

In [None]:
def compute_energy_metrics(power_samples: List[float], total_time: float, 
                          num_inferences: int, batch_size: int) -> Dict:
    """
    Compute energy and latency metrics from power samples and timing.
    
    Args:
        power_samples: List of power measurements in watts
        total_time: Total inference time in seconds
        num_inferences: Total number of inference iterations
        batch_size: Batch size used
    
    Returns:
        Dictionary with computed metrics
    """
    metrics = {}
    
    # Latency metrics
    total_samples = num_inferences * batch_size
    metrics["latency_per_sample_s"] = total_time / total_samples
    metrics["latency_per_sample_ms"] = metrics["latency_per_sample_s"] * 1000
    metrics["latency_per_batch_s"] = total_time / num_inferences
    metrics["latency_per_batch_ms"] = metrics["latency_per_batch_s"] * 1000
    metrics["throughput_samples_per_sec"] = total_samples / total_time
    
    # Power and energy metrics
    if len(power_samples) > 0:
        metrics["avg_power_w"] = float(np.mean(power_samples))
        metrics["std_power_w"] = float(np.std(power_samples))
        metrics["min_power_w"] = float(np.min(power_samples))
        metrics["max_power_w"] = float(np.max(power_samples))
        metrics["num_power_samples"] = len(power_samples)
        
        # Energy computation: P * t
        metrics["energy_total_j"] = metrics["avg_power_w"] * total_time
        metrics["energy_per_sample_j"] = metrics["energy_total_j"] / total_samples
        metrics["energy_per_sample_mj"] = metrics["energy_per_sample_j"] * 1000
        metrics["energy_per_batch_j"] = metrics["energy_total_j"] / num_inferences
    else:
        print("Warning: No power samples collected")
        metrics["avg_power_w"] = None
        metrics["energy_total_j"] = None
        metrics["energy_per_sample_j"] = None
    
    # GPU memory metrics (if available)
    if torch.cuda.is_available():
        metrics["peak_memory_mb"] = torch.cuda.max_memory_allocated() / 1024**2
        metrics["current_memory_mb"] = torch.cuda.memory_allocated() / 1024**2
    
    return metrics

def print_metrics(metrics: Dict, precision: str, trial: int):
    """
    Pretty print metrics.
    """
    print(f"\n{'='*60}")
    print(f"Metrics for {precision.upper()} - Trial {trial}")
    print(f"{'='*60}")
    print(f"Latency per sample: {metrics['latency_per_sample_ms']:.3f} ms")
    print(f"Throughput: {metrics['throughput_samples_per_sec']:.2f} samples/sec")
    
    if metrics['avg_power_w'] is not None:
        print(f"Average power: {metrics['avg_power_w']:.2f} W")
        print(f"Energy per sample: {metrics['energy_per_sample_mj']:.3f} mJ")
        print(f"Total energy: {metrics['energy_total_j']:.3f} J")
    
    if 'peak_memory_mb' in metrics:
        print(f"Peak GPU memory: {metrics['peak_memory_mb']:.2f} MB")
    print(f"{'='*60}\n")

## 9. Multi-Trial Runner

In [None]:
def run_experiments_for_precision(config: Dict, precision: str, 
                                 num_trials: int = 5) -> List[Dict]:
    """
    Run multiple trials for a given precision level.
    
    Args:
        config: Experiment configuration
        precision: Precision level ('fp32', 'fp16', 'int8')
        num_trials: Number of trials to run
    
    Returns:
        List of metric dictionaries, one per trial
    """
    print(f"\n{'#'*60}")
    print(f"# Running experiments for {precision.upper()}")
    print(f"# Number of trials: {num_trials}")
    print(f"{'#'*60}\n")
    
    device = config["device"]
    
    # Load dataset once (stays on GPU)
    print("Loading dataset...")
    input_ids, attention_mask, labels = load_pre_tokenized_dataset(
        config["dataset_path"], 
        device
    )
    
    results = []
    
    for trial in range(num_trials):
        print(f"\n{'*'*60}")
        print(f"* Trial {trial + 1}/{num_trials} for {precision.upper()}")
        print(f"{'*'*60}")
        
        # Reset GPU memory tracking
        if torch.cuda.is_available():
            torch.cuda.reset_peak_memory_stats()
            torch.cuda.empty_cache()
        
        # Load model for this trial
        model = load_model(config["model_name"], precision, device)
        model_info = get_model_info(model)
        
        # Create batch iterator
        batch_iter = batched_iterator(input_ids, attention_mask, config["batch_size"])
        
        # Warmup
        warmup(model, batch_iter, config["warmup_loops"], device)
        
        # Measurement with power logging
        total_time, power_samples = measure_with_power(
            model, 
            batch_iter, 
            config["num_loops"], 
            device,
            config["power_poll_interval_ms"]
        )
        
        # Compute metrics
        metrics = compute_energy_metrics(
            power_samples,
            total_time,
            config["num_loops"],
            config["batch_size"]
        )
        
        # Add metadata
        metrics["precision"] = precision
        metrics["trial"] = trial
        metrics["batch_size"] = config["batch_size"]
        metrics["seq_len"] = config["seq_len"]
        metrics["num_loops"] = config["num_loops"]
        metrics["model_size_mb"] = model_info["model_size_mb"]
        metrics["num_parameters"] = model_info["num_parameters"]
        
        # Print and store
        print_metrics(metrics, precision, trial)
        results.append(metrics)
        
        # Clean up
        del model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    
    return results

## 10. Aggregate Results Across Trials

In [None]:
def aggregate_trials(trial_results: List[Dict]) -> Dict:
    """
    Aggregate metrics across multiple trials.
    
    Returns:
        Dictionary with mean and std for each metric
    """
    if not trial_results:
        return {}
    
    # Extract numeric metrics
    numeric_keys = [
        "latency_per_sample_ms", "latency_per_batch_ms",
        "throughput_samples_per_sec",
        "avg_power_w", "energy_per_sample_j", "energy_per_sample_mj",
        "energy_total_j", "peak_memory_mb"
    ]
    
    aggregated = {}
    
    # Copy metadata from first trial
    aggregated["precision"] = trial_results[0]["precision"]
    aggregated["batch_size"] = trial_results[0]["batch_size"]
    aggregated["seq_len"] = trial_results[0]["seq_len"]
    aggregated["num_trials"] = len(trial_results)
    aggregated["model_size_mb"] = trial_results[0]["model_size_mb"]
    
    # Compute mean and std for each metric
    for key in numeric_keys:
        if key in trial_results[0] and trial_results[0][key] is not None:
            values = [r[key] for r in trial_results if key in r and r[key] is not None]
            if values:
                aggregated[f"{key}_mean"] = float(np.mean(values))
                aggregated[f"{key}_std"] = float(np.std(values))
                aggregated[f"{key}_min"] = float(np.min(values))
                aggregated[f"{key}_max"] = float(np.max(values))
    
    return aggregated

def print_aggregated_results(agg: Dict):
    """
    Print aggregated results in a readable format.
    """
    print(f"\n{'='*70}")
    print(f"AGGREGATED RESULTS: {agg['precision'].upper()}")
    print(f"Number of trials: {agg['num_trials']}")
    print(f"{'='*70}")
    
    print(f"\nLatency:")
    if 'latency_per_sample_ms_mean' in agg:
        print(f"  Per sample: {agg['latency_per_sample_ms_mean']:.3f} ± {agg['latency_per_sample_ms_std']:.3f} ms")
    
    print(f"\nThroughput:")
    if 'throughput_samples_per_sec_mean' in agg:
        print(f"  {agg['throughput_samples_per_sec_mean']:.2f} ± {agg['throughput_samples_per_sec_std']:.2f} samples/sec")
    
    print(f"\nPower:")
    if 'avg_power_w_mean' in agg:
        print(f"  Average: {agg['avg_power_w_mean']:.2f} ± {agg['avg_power_w_std']:.2f} W")
    
    print(f"\nEnergy:")
    if 'energy_per_sample_mj_mean' in agg:
        print(f"  Per sample: {agg['energy_per_sample_mj_mean']:.3f} ± {agg['energy_per_sample_mj_std']:.3f} mJ")
    
    print(f"\nMemory:")
    print(f"  Model size: {agg['model_size_mb']:.2f} MB")
    if 'peak_memory_mb_mean' in agg:
        print(f"  Peak GPU memory: {agg['peak_memory_mb_mean']:.2f} ± {agg['peak_memory_mb_std']:.2f} MB")
    
    print(f"{'='*70}\n")

## 11. Results Writing

In [None]:
def write_results_to_csv(all_results: List[Dict], output_path: str):
    """
    Write aggregated results to CSV file.
    """
    df = pd.DataFrame(all_results)
    df.to_csv(output_path, index=False)
    print(f"Results written to {output_path}")

def write_detailed_results(trial_results: List[Dict], output_path: str):
    """
    Write detailed per-trial results to CSV.
    """
    df = pd.DataFrame(trial_results)
    df.to_csv(output_path, index=False)
    print(f"Detailed results written to {output_path}")

def save_results_json(results: Dict, output_path: str):
    """
    Save results as JSON for easy loading later.
    """
    with open(output_path, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"Results saved to {output_path}")

## 12. Main Experiment Runner

In [None]:
def run_all_experiments(config: Dict, precisions: List[str]):
    """
    Main entry point: run experiments for all precision levels.
    
    Args:
        config: Experiment configuration
        precisions: List of precision levels to test (e.g., ['fp32', 'fp16', 'int8'])
    """
    print("\n" + "#"*70)
    print("# STARTING ENERGY MEASUREMENT EXPERIMENTS")
    print("#"*70)
    print(f"\nConfiguration:")
    print(f"  Model: {config['model_name']}")
    print(f"  Batch size: {config['batch_size']}")
    print(f"  Sequence length: {config['seq_len']}")
    print(f"  Measurement loops: {config['num_loops']}")
    print(f"  Warmup loops: {config['warmup_loops']}")
    print(f"  Trials per precision: {config['num_trials']}")
    print(f"  Device: {config['device']}")
    print(f"  Precisions to test: {precisions}")
    
    all_aggregated = []
    all_detailed = []
    
    for precision in precisions:
        # Run trials for this precision
        trial_results = run_experiments_for_precision(
            config, 
            precision, 
            num_trials=config["num_trials"]
        )
        
        # Aggregate results
        aggregated = aggregate_trials(trial_results)
        print_aggregated_results(aggregated)
        
        # Store results
        all_aggregated.append(aggregated)
        all_detailed.extend(trial_results)
        
        # Save intermediate results
        results_dir = Path(config["results_dir"])
        save_results_json(
            {"aggregated": aggregated, "trials": trial_results},
            results_dir / f"{precision}_results.json"
        )
    
    # Write final results
    results_dir = Path(config["results_dir"])
    write_results_to_csv(all_aggregated, results_dir / "results_aggregated.csv")
    write_detailed_results(all_detailed, results_dir / "results_detailed.csv")
    
    print("\n" + "#"*70)
    print("# EXPERIMENTS COMPLETE")
    print("#"*70)
    print(f"\nResults saved to: {results_dir}")
    
    return all_aggregated, all_detailed

## 13. Example: Create Pre-tokenized Dataset

This cell shows how to prepare a pre-tokenized dataset that can be loaded directly to GPU.

In [None]:
def create_pre_tokenized_dataset(model_name: str, dataset_name: str = "sst2", 
                                 num_samples: int = 1000, output_path: str = "./data/pre_tokenized.pt"):
    """
    Create and save a pre-tokenized dataset.
    This is done ONCE before running experiments.
    
    Example for SST-2 dataset.
    """
    from datasets import load_dataset
    
    print(f"Creating pre-tokenized dataset...")
    print(f"  Model: {model_name}")
    print(f"  Dataset: {dataset_name}")
    print(f"  Samples: {num_samples}")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Load dataset
    if dataset_name == "sst2":
        dataset = load_dataset("glue", "sst2", split="validation")
    else:
        raise ValueError(f"Dataset {dataset_name} not implemented")
    
    # Limit to num_samples
    if len(dataset) > num_samples:
        dataset = dataset.select(range(num_samples))
    
    # Tokenize
    def tokenize_function(examples):
        return tokenizer(
            examples["sentence"],
            padding="max_length",
            truncation=True,
            max_length=128,
            return_tensors="pt"
        )
    
    # Process in batches
    all_input_ids = []
    all_attention_mask = []
    all_labels = []
    
    for item in dataset:
        encoded = tokenizer(
            item["sentence"],
            padding="max_length",
            truncation=True,
            max_length=128,
            return_tensors="pt"
        )
        all_input_ids.append(encoded["input_ids"])
        all_attention_mask.append(encoded["attention_mask"])
        all_labels.append(item["label"])
    
    # Stack into tensors
    input_ids = torch.cat(all_input_ids, dim=0)
    attention_mask = torch.cat(all_attention_mask, dim=0)
    labels = torch.tensor(all_labels)
    
    # Save
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
    torch.save({
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }, output_path)
    
    print(f"\nDataset saved to: {output_path}")
    print(f"  Shape: {input_ids.shape}")
    print(f"  Size: {(input_ids.element_size() * input_ids.nelement() + 
                       attention_mask.element_size() * attention_mask.nelement() + 
                       labels.element_size() * labels.nelement()) / 1024**2:.2f} MB")

# Uncomment to create dataset
# create_pre_tokenized_dataset(
#     model_name="distilbert-base-uncased-finetuned-sst-2-english",
#     dataset_name="sst2",
#     num_samples=1000,
#     output_path="./data/pre_tokenized.pt"
# )

## 14. RUN EXPERIMENTS

This is the main execution cell. Run this after:
1. Creating your pre-tokenized dataset
2. Adjusting the config as needed

In [None]:
# Define precision levels to test
precisions_to_test = ["fp32", "fp16"]  # Add "int8" when ready

# Update config if needed
config["num_trials"] = 5
config["num_loops"] = 300
config["warmup_loops"] = 50
config["batch_size"] = 8

# Save config
save_config(config, Path(config["results_dir"]) / "experiment_config.json")

# Run all experiments
aggregated_results, detailed_results = run_all_experiments(config, precisions_to_test)

## 15. Results Analysis and Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

def plot_comparison(aggregated_results: List[Dict]):
    """
    Create comparison plots for different precision levels.
    """
    df = pd.DataFrame(aggregated_results)
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Latency comparison
    ax = axes[0, 0]
    if 'latency_per_sample_ms_mean' in df.columns:
        x = range(len(df))
        ax.bar(x, df['latency_per_sample_ms_mean'], 
               yerr=df['latency_per_sample_ms_std'],
               capsize=5, alpha=0.7)
        ax.set_xticks(x)
        ax.set_xticklabels(df['precision'].str.upper())
        ax.set_ylabel('Latency (ms)')
        ax.set_title('Latency per Sample')
        ax.grid(axis='y', alpha=0.3)
    
    # Energy comparison
    ax = axes[0, 1]
    if 'energy_per_sample_mj_mean' in df.columns:
        x = range(len(df))
        ax.bar(x, df['energy_per_sample_mj_mean'],
               yerr=df['energy_per_sample_mj_std'],
               capsize=5, alpha=0.7, color='orange')
        ax.set_xticks(x)
        ax.set_xticklabels(df['precision'].str.upper())
        ax.set_ylabel('Energy (mJ)')
        ax.set_title('Energy per Sample')
        ax.grid(axis='y', alpha=0.3)
    
    # Throughput comparison
    ax = axes[1, 0]
    if 'throughput_samples_per_sec_mean' in df.columns:
        x = range(len(df))
        ax.bar(x, df['throughput_samples_per_sec_mean'],
               yerr=df['throughput_samples_per_sec_std'],
               capsize=5, alpha=0.7, color='green')
        ax.set_xticks(x)
        ax.set_xticklabels(df['precision'].str.upper())
        ax.set_ylabel('Throughput (samples/sec)')
        ax.set_title('Inference Throughput')
        ax.grid(axis='y', alpha=0.3)
    
    # Model size comparison
    ax = axes[1, 1]
    if 'model_size_mb' in df.columns:
        x = range(len(df))
        ax.bar(x, df['model_size_mb'], alpha=0.7, color='purple')
        ax.set_xticks(x)
        ax.set_xticklabels(df['precision'].str.upper())
        ax.set_ylabel('Model Size (MB)')
        ax.set_title('Model Size on Disk')
        ax.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    
    # Save figure
    results_dir = Path(config["results_dir"])
    plt.savefig(results_dir / "comparison_plots.png", dpi=300, bbox_inches='tight')
    print(f"\nPlots saved to {results_dir / 'comparison_plots.png'}")
    plt.show()

# Generate plots
if aggregated_results:
    plot_comparison(aggregated_results)

## 16. Summary Statistics Table

In [None]:
def create_summary_table(aggregated_results: List[Dict]):
    """
    Create a formatted summary table.
    """
    summary_data = []
    
    for result in aggregated_results:
        row = {
            "Precision": result["precision"].upper(),
            "Latency (ms)": f"{result.get('latency_per_sample_ms_mean', 0):.3f} ± {result.get('latency_per_sample_ms_std', 0):.3f}",
            "Throughput (samples/s)": f"{result.get('throughput_samples_per_sec_mean', 0):.2f} ± {result.get('throughput_samples_per_sec_std', 0):.2f}",
            "Energy (mJ)": f"{result.get('energy_per_sample_mj_mean', 0):.3f} ± {result.get('energy_per_sample_mj_std', 0):.3f}" if result.get('energy_per_sample_mj_mean') else "N/A",
            "Power (W)": f"{result.get('avg_power_w_mean', 0):.2f} ± {result.get('avg_power_w_std', 0):.2f}" if result.get('avg_power_w_mean') else "N/A",
            "Model Size (MB)": f"{result.get('model_size_mb', 0):.2f}",
            "Peak Memory (MB)": f"{result.get('peak_memory_mb_mean', 0):.2f}" if result.get('peak_memory_mb_mean') else "N/A"
        }
        summary_data.append(row)
    
    df = pd.DataFrame(summary_data)
    
    print("\n" + "="*100)
    print("SUMMARY TABLE")
    print("="*100)
    print(df.to_string(index=False))
    print("="*100 + "\n")
    
    # Save to CSV
    results_dir = Path(config["results_dir"])
    df.to_csv(results_dir / "summary_table.csv", index=False)
    print(f"Summary table saved to {results_dir / 'summary_table.csv'}")
    
    return df

if aggregated_results:
    summary_df = create_summary_table(aggregated_results)

## Notes and Next Steps

### Checklist for "Done" Measurement Harness:

- [ ] Dataset loads from .pt → tensors on GPU, no IO in loop
- [ ] Model loads once per trial, moved to GPU, correct precision
- [ ] Warmup runs N iterations, no timing/power
- [ ] Timed loop yields stable total_time (small variation across trials)
- [ ] PowerLogger returns non-empty list of watt values during loop
- [ ] Energy metrics compute without errors
- [ ] Per-precision results appear in CSV/JSON with all metrics
- [ ] No code inside measurement loop does: disk IO, tokenization, CPU↔GPU copies, model loading

### Important Notes:

1. **Power Monitoring on Kaggle**: The nvidia-smi command may need adjustment for Kaggle environments. Test with a simple cell first.

2. **INT8 Quantization**: The current INT8 implementation uses PyTorch's dynamic quantization for CPU. For GPU, you may need to use libraries like `bitsandbytes` or implement custom quantization.

3. **Dataset Creation**: Make sure to run the dataset creation cell once before running experiments.

4. **Memory Management**: The harness includes GPU memory cleanup between trials to ensure fair comparison.

5. **Results**: All results are saved in multiple formats (CSV, JSON) for easy analysis and sharing.