# Day 1: Pre-tokenized Dataset for Energy Measurement
## Optimized for Kaggle GPU P100

This notebook creates a pre-tokenized dataset with **ZERO I/O overhead** during energy measurements.

**⚠️ IMPORTANT: Enable GPU in Kaggle**
- Go to Settings (right panel) → Accelerator → GPU P100
- Click "Save" and wait for session to restart

## Step 1: Verify GPU Access

In [2]:
import torch

# Check GPU availability
print("="*60)
print("GPU CHECK")
print("="*60)
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    print(f"Device count: {torch.cuda.device_count()}")
    print(f"Current device: {torch.cuda.current_device()}")
    
    # Check memory
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"Total GPU memory: {total_memory:.2f} GB")
    print("\n✓ GPU is ready!")
else:
    print("\n⚠️ WARNING: GPU not available!")
    print("Please enable GPU in Kaggle settings (Accelerator → GPU P100)")

print("="*60)

GPU CHECK
PyTorch version: 2.6.0+cu124
CUDA available: True
CUDA version: 12.4
Device name: Tesla P100-PCIE-16GB
Device count: 1
Current device: 0
Total GPU memory: 17.06 GB

✓ GPU is ready!


## Step 2: Install Dependencies

Kaggle has most packages pre-installed, but we'll ensure we have the latest versions.

In [3]:
# Install/upgrade required packages
!pip install -q transformers datasets accelerate

print("✓ Dependencies installed/verified")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0mm
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m62.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Step 3: Define Dataset Preparation Functions

These functions will create our pre-tokenized dataset.

In [None]:
"""
Dataset Preparation Module
Pre-tokenize dataset to eliminate I/O overhead during energy measurement
"""

import torch
from transformers import DistilBertTokenizer
from datasets import load_dataset
from pathlib import Path
import json


def prepare_tokenized_dataset(
    num_samples: int = None,
    max_length: int = 128,
    dataset_name: str = "sst2",
    output_dir: str = "/kaggle/working/tokenized_data",
    seed: int = 42
):
    """
    Pre-tokenize dataset and save to disk.
    
    Args:
        num_samples: Number of examples to tokenize. If None, uses entire dataset.
        max_length: Maximum sequence length (128 is good for DistilBERT)
        dataset_name: Which GLUE task to use ("sst2", "mnli")
        output_dir: Directory to save tokenized data
        seed: Random seed for reproducibility
    """
    
    print("="*60)
    print("Pre-tokenizing Dataset for Energy Measurement")
    print("="*60)
    
    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    # Load tokenizer
    print("\n[1/5] Loading DistilBERT tokenizer...")
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    
    # Load dataset
    print(f"[2/5] Loading {dataset_name} validation set...")
    if dataset_name == "sst2":
        dataset = load_dataset("glue", "sst2", split="validation")
        text_key = "sentence"
        label_key = "label"
        num_labels = 2
    elif dataset_name == "mnli":
        dataset = load_dataset("glue", "mnli", split="validation_matched")
        text_key = "premise"
        label_key = "label"
        num_labels = 3
    else:
        raise ValueError(f"Dataset {dataset_name} not supported")
    
    # Get full dataset size
    full_dataset_size = len(dataset)
    print(f"  Full dataset size: {full_dataset_size} samples")
    
    # Sample examples (if num_samples is specified and less than full size)
    if num_samples is None:
        print(f"[3/5] Using entire dataset ({full_dataset_size} samples)...")
        dataset = dataset.shuffle(seed=seed)
        actual_num_samples = full_dataset_size
    elif num_samples >= full_dataset_size:
        print(f"[3/5] Requested {num_samples} samples, but dataset only has {full_dataset_size}. Using all {full_dataset_size} samples...")
        dataset = dataset.shuffle(seed=seed)
        actual_num_samples = full_dataset_size
    else:
        print(f"[3/5] Selecting {num_samples} examples from {full_dataset_size} (seed={seed})...")
        dataset = dataset.shuffle(seed=seed).select(range(num_samples))
        actual_num_samples = num_samples
    
    # Tokenize all examples
    print(f"[4/5] Tokenizing with max_length={max_length}...")
    texts = [example[text_key] for example in dataset]
    labels = [example[label_key] for example in dataset]
    
    # Tokenize in batch
    encodings = tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )
    
    labels_tensor = torch.tensor(labels, dtype=torch.long)
    
    # Save tensors
    print(f"[5/5] Saving to {output_dir}...")
    torch.save(encodings['input_ids'], output_path / 'input_ids.pt')
    torch.save(encodings['attention_mask'], output_path / 'attention_mask.pt')
    torch.save(labels_tensor, output_path / 'labels.pt')
    
    # Save metadata
    metadata = {
        'num_samples': actual_num_samples,
        'max_length': max_length,
        'dataset_name': dataset_name,
        'num_labels': num_labels,
        'seed': seed,
        'tokenizer': 'distilbert-base-uncased',
    }
    
    with open(output_path / 'metadata.json', 'w') as f:
        json.dump(metadata, f, indent=2)
    
    # Print summary
    print("\n" + "="*60)
    print("Dataset Preparation Complete!")
    print("="*60)
    print(f"Number of samples:     {actual_num_samples}")
    print(f"Max sequence length:   {max_length}")
    print(f"Dataset:               {dataset_name}")
    print(f"Number of labels:      {num_labels}")
    print(f"\nSaved files:")
    print(f"  - input_ids.pt       {encodings['input_ids'].shape}")
    print(f"  - attention_mask.pt  {encodings['attention_mask'].shape}")
    print(f"  - labels.pt          {labels_tensor.shape}")
    print(f"  - metadata.json")
    print("="*60)
    
    # Show examples
    print("\nFirst 3 examples:")
    for i in range(min(3, actual_num_samples)):
        print(f"\n{i+1}. {texts[i][:70]}...")
        print(f"   Label: {labels[i]}")
    
    return metadata


class PreTokenizedDataset:
    """
    Efficient dataset class for pre-tokenized data.
    Zero I/O overhead during iteration.
    """
    
    def __init__(self, data_dir: str = "/kaggle/working/tokenized_data"):
        """Load pre-tokenized dataset from disk."""
        data_path = Path(data_dir)
        
        # Load all data into memory once
        self.input_ids = torch.load(data_path / 'input_ids.pt')
        self.attention_mask = torch.load(data_path / 'attention_mask.pt')
        self.labels = torch.load(data_path / 'labels.pt')
        
        with open(data_path / 'metadata.json', 'r') as f:
            self.metadata = json.load(f)
        
        self.num_samples = len(self.labels)
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        """Get a single example."""
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'labels': self.labels[idx]
        }
    
    def get_batch(self, batch_size: int = 8):
        """Iterate over batches with zero I/O overhead."""
        for i in range(0, self.num_samples, batch_size):
            end_idx = min(i + batch_size, self.num_samples)
            yield {
                'input_ids': self.input_ids[i:end_idx],
                'attention_mask': self.attention_mask[i:end_idx],
                'labels': self.labels[i:end_idx]
            }
    
    def to_device(self, device):
        """Move all tensors to device (GPU) at once."""
        self.input_ids = self.input_ids.to(device)
        self.attention_mask = self.attention_mask.to(device)
        self.labels = self.labels.to(device)
        return self


print("✓ Dataset preparation functions defined")

✓ Dataset preparation functions defined


## Step 4: Create Pre-tokenized Dataset

Create 50 pre-tokenized examples from SST-2 dataset.

In [5]:
# Create the tokenized dataset
metadata = prepare_tokenized_dataset(
    num_samples=50,
    max_length=128,
    dataset_name='sst2',
    output_dir='/kaggle/working/tokenized_data',
    seed=42
)

Pre-tokenizing Dataset for Energy Measurement

[1/5] Loading DistilBERT tokenizer...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

[2/5] Loading sst2 validation set...


README.md: 0.00B [00:00, ?B/s]

sst2/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

sst2/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

sst2/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

[3/5] Selecting 50 examples (seed=42)...
[4/5] Tokenizing with max_length=128...
[5/5] Saving to /kaggle/working/tokenized_data...

Dataset Preparation Complete!
Number of samples:     50
Max sequence length:   128
Dataset:               sst2
Number of labels:      2

Saved files:
  - input_ids.pt       torch.Size([50, 128])
  - attention_mask.pt  torch.Size([50, 128])
  - labels.pt          torch.Size([50])
  - metadata.json

First 3 examples:

1. it gets onto the screen just about as much of the novella as one could...
   Label: 1

2. my big fat greek wedding uses stereotypes in a delightful blend of swe...
   Label: 1

3. for the most part , director anne-sophie birot 's first feature is a s...
   Label: 1


## Step 5: Verify the Dataset

In [6]:
# Load and verify the dataset
dataset = PreTokenizedDataset('/kaggle/working/tokenized_data')

print("="*60)
print("Dataset Verification")
print("="*60)
print(f"Number of samples: {len(dataset)}")
print(f"Metadata: {dataset.metadata}")

# Check first example
example = dataset[0]
print(f"\nFirst example:")
print(f"  input_ids shape:      {example['input_ids'].shape}")
print(f"  attention_mask shape: {example['attention_mask'].shape}")
print(f"  label:                {example['labels'].item()}")

# Test batch iteration
print(f"\nBatch iteration test (batch_size=8):")
batch_count = 0
for batch in dataset.get_batch(8):
    batch_count += 1
    print(f"  Batch {batch_count}: {batch['input_ids'].shape}, labels: {batch['labels'].tolist()}")

print(f"\n✓ Dataset verified! Total batches: {batch_count}")
print("="*60)

Dataset Verification
Number of samples: 50
Metadata: {'num_samples': 50, 'max_length': 128, 'dataset_name': 'sst2', 'num_labels': 2, 'seed': 42, 'tokenizer': 'distilbert-base-uncased'}

First example:
  input_ids shape:      torch.Size([128])
  attention_mask shape: torch.Size([128])
  label:                1

Batch iteration test (batch_size=8):
  Batch 1: torch.Size([8, 128]), labels: [1, 1, 1, 1, 1, 0, 1, 0]
  Batch 2: torch.Size([8, 128]), labels: [1, 0, 0, 1, 1, 1, 1, 0]
  Batch 3: torch.Size([8, 128]), labels: [1, 0, 1, 0, 0, 1, 0, 0]
  Batch 4: torch.Size([8, 128]), labels: [0, 0, 0, 1, 0, 1, 1, 1]
  Batch 5: torch.Size([8, 128]), labels: [1, 0, 1, 0, 1, 1, 0, 1]
  Batch 6: torch.Size([8, 128]), labels: [1, 0, 0, 1, 1, 0, 1, 1]
  Batch 7: torch.Size([2, 128]), labels: [0, 1]

✓ Dataset verified! Total batches: 7


## Step 6: Load DistilBERT Model

In [14]:
from transformers import DistilBertForSequenceClassification
import torch

# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Load FINE-TUNED model for SST-2 (this is the fix!)
print("\nLoading fine-tuned DistilBERT model for SST-2...")
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased-finetuned-sst-2-english',  # ← Changed from 'distilbert-base-uncased'
    num_labels=2  # Binary classification for SST-2
)

model = model.to(device)
model.eval()

print("✓ Fine-tuned model loaded and moved to", device)

# Show model size
param_count = sum(p.numel() for p in model.parameters())
print(f"\nModel parameters: {param_count:,} ({param_count/1e6:.1f}M)")

Using device: cuda

Loading fine-tuned DistilBERT model for SST-2...


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

✓ Fine-tuned model loaded and moved to cuda

Model parameters: 66,955,010 (67.0M)


## Step 7: Move Dataset to GPU (ONE TIME)

**KEY FEATURE:** Move all data to GPU once. During measurement, there will be ZERO CPU→GPU transfers!

In [15]:
# Reload dataset and move to GPU
dataset = PreTokenizedDataset('/kaggle/working/tokenized_data')

if torch.cuda.is_available():
    print("Moving dataset to GPU...")
    dataset.to_device(device)
    print("✓ Dataset on GPU")
    
    # Verify
    batch = next(dataset.get_batch(8))
    print(f"\nVerification:")
    print(f"  Batch input_ids device: {batch['input_ids'].device}")
    print(f"  Batch labels device:    {batch['labels'].device}")
    
    # Check GPU memory
    allocated = torch.cuda.memory_allocated(0) / 1e6
    print(f"\nGPU memory allocated: {allocated:.2f} MB")
else:
    print("CPU mode - dataset stays in CPU memory")

Moving dataset to GPU...
✓ Dataset on GPU

Verification:
  Batch input_ids device: cuda:0
  Batch labels device:    cuda:0

GPU memory allocated: 280.07 MB


## Step 8: Run Baseline Inference (FP32)

Test the complete pipeline with **ZERO I/O** during inference.

In [16]:
import time

print("="*60)
print("FP32 Baseline Inference (ZERO I/O!)")
print("="*60)

# Warmup
print("\nWarming up (2 batches)...")
with torch.no_grad():
    for i, batch in enumerate(dataset.get_batch(8)):
        if i >= 2:
            break
        _ = model(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask']
        )

if torch.cuda.is_available():
    torch.cuda.synchronize()

print("✓ Warmup complete")

# Actual inference measurement
print("\nRunning inference...")
correct = 0
total = 0
start_time = time.perf_counter()

with torch.no_grad():
    for batch in dataset.get_batch(batch_size=8):
        # Forward pass (NO I/O!)
        outputs = model(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask']
        )
        
        # Compute accuracy
        preds = outputs.logits.argmax(dim=-1)
        correct += (preds == batch['labels']).sum().item()
        total += len(batch['labels'])

if torch.cuda.is_available():
    torch.cuda.synchronize()

end_time = time.perf_counter()

# Results
latency = end_time - start_time
accuracy = 100.0 * correct / total
throughput = total / latency

print("\n" + "="*60)
print("Results")
print("="*60)
print(f"Accuracy:    {accuracy:.2f}% ({correct}/{total})")
print(f"Latency:     {latency:.3f} seconds")
print(f"Throughput:  {throughput:.2f} samples/second")
print(f"Per-sample:  {latency/total*1000:.2f} ms")
print("="*60)

# Sanity check
if accuracy > 75:
    print("\n✓ PASS: Accuracy looks good!")
else:
    print("\n⚠️ WARNING: Accuracy is low. Expected ~85-90% for pretrained DistilBERT on SST-2")

FP32 Baseline Inference (ZERO I/O!)

Warming up (2 batches)...
✓ Warmup complete

Running inference...

Results
Accuracy:    86.00% (43/50)
Latency:     0.108 seconds
Throughput:  461.90 samples/second
Per-sample:  2.16 ms

✓ PASS: Accuracy looks good!


## Step 9: Measure GPU Power (nvidia-smi)

Now let's measure GPU power consumption during inference.

In [17]:
# Check if nvidia-smi is available
import subprocess

try:
    result = subprocess.run(['nvidia-smi', '--query-gpu=name,power.draw', 
                           '--format=csv,noheader'], 
                          capture_output=True, text=True, timeout=2)
    
    if result.returncode == 0:
        print("="*60)
        print("GPU Power Monitoring Available")
        print("="*60)
        print(result.stdout.strip())
        print("\n✓ nvidia-smi is available for power monitoring")
    else:
        print("⚠️ nvidia-smi not responding properly")
except Exception as e:
    print(f"⚠️ nvidia-smi not available: {e}")
    print("This is expected in some Kaggle environments.")
    print("You may need to use your local machine or a cloud GPU with nvidia-smi access.")

GPU Power Monitoring Available
Tesla P100-PCIE-16GB, 37.27 W

✓ nvidia-smi is available for power monitoring


## Step 10: Integrate with Energy Measurement Harness

Here's how to integrate with Krishna's energy measurement pipeline:

In [18]:
# Save the energy_utils.py content (from your uploaded file)
# This is Krishna's measurement harness

# For demonstration, here's the integration pattern:

print("="*60)
print("Energy Measurement Integration Pattern")
print("="*60)

integration_code = '''
# Krishna's harness integration:

from energy_utils import GPUPowerMonitor, benchmark_model_energy
from prepare_dataset import PreTokenizedDataset
import torch

# Setup
device = torch.device('cuda')
model = load_your_model().to(device).eval()

# Load dataset BEFORE measurement (ONE TIME)
dataset = PreTokenizedDataset('/kaggle/working/tokenized_data')
dataset.to_device(device)  # All data on GPU

# Warmup
for i, batch in enumerate(dataset.get_batch(8)):
    if i >= 2:
        break
    _ = model(batch['input_ids'], batch['attention_mask'])

torch.cuda.synchronize()

# Measure energy
monitor = GPUPowerMonitor(gpu_id=0, sample_interval_ms=200)
monitor.start()

# Inference (NO I/O!)
with torch.no_grad():
    for batch in dataset.get_batch(8):
        outputs = model(batch['input_ids'], batch['attention_mask'])

torch.cuda.synchronize()
energy_stats = monitor.stop()

print(f"Energy: {energy_stats['energy_joules']:.3f} J")
print(f"Power:  {energy_stats['mean_power_watts']:.2f} W")
'''

print(integration_code)
print("="*60)

Energy Measurement Integration Pattern

# Krishna's harness integration:

from energy_utils import GPUPowerMonitor, benchmark_model_energy
from prepare_dataset import PreTokenizedDataset
import torch

# Setup
device = torch.device('cuda')
model = load_your_model().to(device).eval()

# Load dataset BEFORE measurement (ONE TIME)
dataset = PreTokenizedDataset('/kaggle/working/tokenized_data')
dataset.to_device(device)  # All data on GPU

# Warmup
for i, batch in enumerate(dataset.get_batch(8)):
    if i >= 2:
        break
    _ = model(batch['input_ids'], batch['attention_mask'])

torch.cuda.synchronize()

# Measure energy
monitor = GPUPowerMonitor(gpu_id=0, sample_interval_ms=200)
monitor.start()

# Inference (NO I/O!)
with torch.no_grad():
    for batch in dataset.get_batch(8):
        outputs = model(batch['input_ids'], batch['attention_mask'])

torch.cuda.synchronize()
energy_stats = monitor.stop()

print(f"Energy: {energy_stats['energy_joules']:.3f} J")
print(f"Power:  {energy_stats

## Step 11: Create Different Dataset Sizes (for Ablations)

In [19]:
# Create datasets for ablation studies
print("Creating multiple dataset configurations...\n")

configs = [
    {'num_samples': 30, 'name': 'small'},
    {'num_samples': 50, 'name': 'standard'},
    {'num_samples': 100, 'name': 'large'},
]

for config in configs:
    output_dir = f"/kaggle/working/tokenized_data_{config['name']}"
    print(f"Creating {config['name']} dataset ({config['num_samples']} samples)...")
    
    prepare_tokenized_dataset(
        num_samples=config['num_samples'],
        max_length=128,
        dataset_name='sst2',
        output_dir=output_dir,
        seed=42
    )
    print()

print("✓ All dataset configurations created!")

Creating multiple dataset configurations...

Creating small dataset (30 samples)...
Pre-tokenizing Dataset for Energy Measurement

[1/5] Loading DistilBERT tokenizer...
[2/5] Loading sst2 validation set...
[3/5] Selecting 30 examples (seed=42)...
[4/5] Tokenizing with max_length=128...
[5/5] Saving to /kaggle/working/tokenized_data_small...

Dataset Preparation Complete!
Number of samples:     30
Max sequence length:   128
Dataset:               sst2
Number of labels:      2

Saved files:
  - input_ids.pt       torch.Size([30, 128])
  - attention_mask.pt  torch.Size([30, 128])
  - labels.pt          torch.Size([30])
  - metadata.json

First 3 examples:

1. it gets onto the screen just about as much of the novella as one could...
   Label: 1

2. my big fat greek wedding uses stereotypes in a delightful blend of swe...
   Label: 1

3. for the most part , director anne-sophie birot 's first feature is a s...
   Label: 1

Creating standard dataset (50 samples)...
Pre-tokenizing Dataset for

## Step 12: Summary and Next Steps

In [23]:
print("="*70)
print(" "*20 + "DAY 1 CHECKPOINT COMPLETE ✓")
print("="*70)

print("\n Files Created:")
!ls -lh /kaggle/working/tokenized_data/

print("\n Accomplished:")
print("  ✓ Created pre-tokenized dataset (50 samples)")
print("  ✓ Verified zero I/O during iteration")
print("  ✓ Tested with DistilBERT FP32 model")
print(f"  ✓ Baseline accuracy: {accuracy:.2f}%")
print("  ✓ Dataset on GPU (zero transfer cost during inference)")
print("  ✓ Ready for energy measurement")

print("\n Key Metrics (FP32 Baseline):")
print(f"  • Accuracy:    {accuracy:.2f}%")
print(f"  • Latency:     {latency:.3f} s")
print(f"  • Throughput:  {throughput:.2f} samples/s")
print(f"  • Device:      {device}")

print("\n Ready for Day 2:")
print("  1. Taara: working FP32 baseline with known accuracy")
print("  2. Krishna: Dataset integrates with your energy harness")
print("  3. Thomas: Ready to add FP16 and INT8 quantization")

print("\n Critical Achievement:")
print("  ZERO I/O during inference measurement!")

print("\n" + "="*70)
print("Next: Integrate with Krishna's energy measurement harness")
print("="*70)

                    DAY 1 CHECKPOINT COMPLETE ✓

 Files Created:
total 112K
-rw-r--r-- 1 root root  52K Dec  1 02:48 attention_mask.pt
-rw-r--r-- 1 root root  52K Dec  1 02:48 input_ids.pt
-rw-r--r-- 1 root root 1.6K Dec  1 02:48 labels.pt
-rw-r--r-- 1 root root  145 Dec  1 02:48 metadata.json

 Accomplished:
  ✓ Created pre-tokenized dataset (50 samples)
  ✓ Verified zero I/O during iteration
  ✓ Tested with DistilBERT FP32 model
  ✓ Baseline accuracy: 86.00%
  ✓ Dataset on GPU (zero transfer cost during inference)
  ✓ Ready for energy measurement

 Key Metrics (FP32 Baseline):
  • Accuracy:    86.00%
  • Latency:     0.108 s
  • Throughput:  461.90 samples/s
  • Device:      cuda

 Ready for Day 2:
  1. Taara: working FP32 baseline with known accuracy
  2. Krishna: Dataset integrates with your energy harness
  3. Thomas: Ready to add FP16 and INT8 quantization

 Critical Achievement:
  ZERO I/O during inference measurement!

Next: Integrate with Krishna's energy measurement harness


## Notes for Team

### Taara:
- Dataset is ready and tested
- FP32 baseline accuracy confirmed (~{accuracy:.1f}%)
- Can proceed to Day 2 tasks

### For Krishna:
- Zero I/O confirmed during iteration
- All data can be pre-loaded to GPU
- Ready to integrate with `energy_utils.py`
- Check if nvidia-smi works in your environment

### For Thomas:
- Same dataset works for all precision levels
-  num_labels=2 for SST-2
- Ready for FP16, INT8, mixed precision experiments

### Important Files:
- `/kaggle/working/tokenized_data/` - Main dataset (50 samples)
- `/kaggle/working/tokenized_data_small/` - Small dataset (30 samples)
- `/kaggle/working/tokenized_data_large/` - Large dataset (100 samples)

### To Download:
You can download the tokenized data using Kaggle's output feature or save to your Kaggle dataset.

In [22]:
!zip -r tokenized_datasets.zip /kaggle/working/tokenized_data*

  adding: kaggle/working/tokenized_data/ (stored 0%)
  adding: kaggle/working/tokenized_data/metadata.json (deflated 23%)
  adding: kaggle/working/tokenized_data/input_ids.pt (deflated 92%)
  adding: kaggle/working/tokenized_data/labels.pt (deflated 67%)
  adding: kaggle/working/tokenized_data/attention_mask.pt (deflated 98%)
  adding: kaggle/working/tokenized_data_large/ (stored 0%)
  adding: kaggle/working/tokenized_data_large/metadata.json (deflated 23%)
  adding: kaggle/working/tokenized_data_large/input_ids.pt (deflated 93%)
  adding: kaggle/working/tokenized_data_large/labels.pt (deflated 73%)
  adding: kaggle/working/tokenized_data_large/attention_mask.pt (deflated 99%)
  adding: kaggle/working/tokenized_data_small/ (stored 0%)
  adding: kaggle/working/tokenized_data_small/metadata.json (deflated 23%)
  adding: kaggle/working/tokenized_data_small/input_ids.pt (deflated 92%)
  adding: kaggle/working/tokenized_data_small/labels.pt (deflated 63%)
  adding: kaggle/working/tokenized_