# GPT-2 Small: Pre-tokenized Dataset for Energy Measurement
## WikiText-2 with ZERO I/O Overhead

This notebook creates a pre-tokenized dataset for GPT-2 Small with **ZERO I/O overhead** during energy measurements.

**Dataset**: WikiText-2 (standard language modeling benchmark)

**Model**: GPT-2 Small (124M parameters)

**Task**: Next-token prediction (language modeling)

**Key Feature**: All data pre-tokenized and loaded to GPU once - no CPU‚ÜíGPU transfers during inference!

## Step 1: Verify GPU Access

In [12]:
import torch

print("="*60)
print("GPU CHECK")
print("="*60)
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    print(f"Device count: {torch.cuda.device_count()}")
    
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"Total GPU memory: {total_memory:.2f} GB")
    print("\n‚úì GPU is ready!")
else:
    print("\n‚ö†Ô∏è WARNING: GPU not available!")

print("="*60)

GPU CHECK
PyTorch version: 2.6.0+cu124
CUDA available: True
CUDA version: 12.4
Device name: Tesla T4
Device count: 2
Total GPU memory: 15.83 GB

‚úì GPU is ready!


## Step 2: Install Dependencies

In [13]:
# Install required packages
!pip install -q transformers datasets accelerate

print("‚úì Dependencies installed")

‚úì Dependencies installed


## Step 3: Define Dataset Preparation Functions for GPT-2

In [14]:
"""
GPT-2 Dataset Preparation Module
Pre-tokenize WikiText-2 to eliminate I/O overhead during energy measurement
"""

import torch
from transformers import GPT2Tokenizer
from datasets import load_dataset
from pathlib import Path
import json


def prepare_gpt2_tokenized_dataset(
    num_samples: int = 10000,
    max_length: int = 128,
    output_dir: str = "/kaggle/working/gpt2_tokenized_data",
    seed: int = 42
):
    """
    Pre-tokenize WikiText-2 dataset for GPT-2 and save to disk.
    
    Args:
        num_samples: Number of sequences to tokenize (set high to use full dataset, ~4000 available)
        max_length: Maximum sequence length (128 is good for GPT-2)
        output_dir: Directory to save tokenized data
        seed: Random seed for reproducibility
    """
    
    print("="*60)
    print("Pre-tokenizing WikiText-2 for GPT-2 Small")
    print("="*60)
    
    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    # Load tokenizer
    print("\n[1/5] Loading GPT-2 tokenizer...")
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    
    # GPT-2 tokenizer doesn't have a pad token by default
    tokenizer.pad_token = tokenizer.eos_token
    
    # Load WikiText-2 dataset
    print(f"[2/5] Loading WikiText-2 test set...")
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    
    # Filter out empty lines and very short texts
    print(f"[3/5] Filtering and selecting sequences (seed={seed})...")
    
    # Filter non-empty texts with sufficient length
    def is_valid_text(example):
        text = example['text'].strip()
        return len(text) > 50  # At least 50 characters
    
    dataset = dataset.filter(is_valid_text)
    
    print(f"   Total valid sequences available: {len(dataset)}")
    
    # Select samples (up to available) - will use ALL if num_samples > available
    actual_samples = min(num_samples, len(dataset))
    dataset = dataset.shuffle(seed=seed).select(range(actual_samples))
    
    if actual_samples == len(dataset):
        print(f"   Using FULL dataset: {actual_samples} samples")
    else:
        print(f"   Selected: {actual_samples} samples")
    
    # Tokenize all examples
    print(f"[4/5] Tokenizing with max_length={max_length}...")
    texts = [example['text'] for example in dataset]
    
    # Tokenize in batch
    encodings = tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )
    
    # For language modeling, labels are the input_ids shifted by 1
    # We'll just use input_ids as labels (model handles the shift internally)
    labels = encodings['input_ids'].clone()
    
    # Save tensors
    print(f"[5/5] Saving to {output_dir}...")
    torch.save(encodings['input_ids'], output_path / 'input_ids.pt')
    torch.save(encodings['attention_mask'], output_path / 'attention_mask.pt')
    torch.save(labels, output_path / 'labels.pt')
    
    # Save metadata
    metadata = {
        'num_samples': actual_samples,
        'max_length': max_length,
        'dataset_name': 'wikitext-2',
        'model': 'gpt2',
        'task': 'language_modeling',
        'seed': seed,
        'tokenizer': 'gpt2',
    }
    
    with open(output_path / 'metadata.json', 'w') as f:
        json.dump(metadata, f, indent=2)
    
    # Print summary
    print("\n" + "="*60)
    print("Dataset Preparation Complete!")
    print("="*60)
    print(f"Number of samples:     {len(texts)}")
    print(f"Max sequence length:   {max_length}")
    print(f"Dataset:               WikiText-2")
    print(f"Task:                  Language Modeling")
    print(f"\nSaved files:")
    print(f"  - input_ids.pt       {encodings['input_ids'].shape}")
    print(f"  - attention_mask.pt  {encodings['attention_mask'].shape}")
    print(f"  - labels.pt          {labels.shape}")
    print(f"  - metadata.json")
    
    # Calculate approximate size
    total_size_mb = (encodings['input_ids'].element_size() * encodings['input_ids'].nelement() * 3) / (1024**2)
    print(f"\nTotal dataset size: ~{total_size_mb:.2f} MB")
    print("="*60)
    
    # Show examples
    print("\nFirst 3 text samples:")
    for i in range(min(3, len(texts))):
        preview = texts[i][:100].replace('\n', ' ')
        print(f"\n{i+1}. {preview}...")
    
    return metadata


class GPT2PreTokenizedDataset:
    """
    Efficient dataset class for pre-tokenized GPT-2 data.
    Zero I/O overhead during iteration.
    """
    
    def __init__(self, data_dir: str = "/kaggle/working/gpt2_tokenized_data"):
        """Load pre-tokenized dataset from disk."""
        data_path = Path(data_dir)
        
        # Load all data into memory once
        self.input_ids = torch.load(data_path / 'input_ids.pt')
        self.attention_mask = torch.load(data_path / 'attention_mask.pt')
        self.labels = torch.load(data_path / 'labels.pt')
        
        with open(data_path / 'metadata.json', 'r') as f:
            self.metadata = json.load(f)
        
        self.num_samples = len(self.labels)
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        """Get a single example."""
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'labels': self.labels[idx]
        }
    
    def to_device(self, device):
        """Move all tensors to device (GPU) at once."""
        self.input_ids = self.input_ids.to(device)
        self.attention_mask = self.attention_mask.to(device)
        self.labels = self.labels.to(device)
        return self


print("‚úì GPT-2 dataset preparation functions defined")

‚úì GPT-2 dataset preparation functions defined


## Step 4: Create Pre-tokenized Dataset

Create dataset from **FULL WikiText-2 test set** (~4000 sequences after filtering).

In [15]:
# Create the tokenized dataset using FULL WikiText-2 test set
# Set num_samples very high (10000) - it will automatically use all available sequences
metadata = prepare_gpt2_tokenized_dataset(
    num_samples=10000,  # Will use all available (typically ~4000 after filtering)
    max_length=128,
    output_dir='/kaggle/working/gpt2_tokenized_data',
    seed=42
)

Pre-tokenizing WikiText-2 for GPT-2 Small

[1/5] Loading GPT-2 tokenizer...
[2/5] Loading WikiText-2 test set...
[3/5] Filtering and selecting sequences (seed=42)...
   Total valid sequences available: 1940
   Using FULL dataset: 1940 samples
[4/5] Tokenizing with max_length=128...
[5/5] Saving to /kaggle/working/gpt2_tokenized_data...

Dataset Preparation Complete!
Number of samples:     1940
Max sequence length:   128
Dataset:               WikiText-2
Task:                  Language Modeling

Saved files:
  - input_ids.pt       torch.Size([1940, 128])
  - attention_mask.pt  torch.Size([1940, 128])
  - labels.pt          torch.Size([1940, 128])
  - metadata.json

Total dataset size: ~5.68 MB

First 3 text samples:

1.  San Lorenzo Colossal Head 7 ( also known as San Lorenzo Monument 53 ) measures 2 @.@ 7 metres ( 8 @...

2.  Pool champion Willie Mosconi has a cameo appearance as Willie , who holds the stakes for Eddie and ...

3.  ‚Ä† ‚Äî A game postponed from Round 7 , held in Round 

## Step 5: Verify the Dataset

In [16]:
# Load and verify the dataset
dataset = GPT2PreTokenizedDataset('/kaggle/working/gpt2_tokenized_data')

print("="*60)
print("Dataset Verification")
print("="*60)
print(f"Number of samples: {len(dataset)}")
print(f"Metadata: {dataset.metadata}")

# Check first example
example = dataset[0]
print(f"\nFirst example:")
print(f"  input_ids shape:      {example['input_ids'].shape}")
print(f"  attention_mask shape: {example['attention_mask'].shape}")
print(f"  labels shape:         {example['labels'].shape}")

# Decode first example to verify
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
decoded_text = tokenizer.decode(example['input_ids'], skip_special_tokens=True)
print(f"\nDecoded text (first 100 chars): {decoded_text[:100]}...")

print(f"\n‚úì Dataset verified!")
print("="*60)

Dataset Verification
Number of samples: 1940
Metadata: {'num_samples': 1940, 'max_length': 128, 'dataset_name': 'wikitext-2', 'model': 'gpt2', 'task': 'language_modeling', 'seed': 42, 'tokenizer': 'gpt2'}

First example:
  input_ids shape:      torch.Size([128])
  attention_mask shape: torch.Size([128])
  labels shape:         torch.Size([128])

Decoded text (first 100 chars):  San Lorenzo Colossal Head 7 ( also known as San Lorenzo Monument 53 ) measures 2 @.@ 7 metres ( 8 @...

‚úì Dataset verified!


## Step 6: Load GPT-2 Model

In [17]:
from transformers import GPT2LMHeadModel
import torch

# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Load GPT-2 Small model
print("\nLoading GPT-2 Small model...")
model = GPT2LMHeadModel.from_pretrained('gpt2')

model = model.to(device)
model.eval()

print("‚úì GPT-2 Small model loaded and moved to", device)

# Show model size
param_count = sum(p.numel() for p in model.parameters())
model_size_mb = sum(p.element_size() * p.numel() for p in model.parameters()) / (1024 ** 2)
print(f"\nModel parameters: {param_count:,} ({param_count/1e6:.1f}M)")
print(f"Model size (FP32): {model_size_mb:.2f} MB")

Using device: cuda

Loading GPT-2 Small model...
‚úì GPT-2 Small model loaded and moved to cuda

Model parameters: 124,439,808 (124.4M)
Model size (FP32): 474.70 MB


## Step 7: Move Dataset to GPU (ONE TIME)

**KEY FEATURE:** Move all data to GPU once. During measurement, there will be ZERO CPU‚ÜíGPU transfers!

In [18]:
# Reload dataset and move to GPU
dataset = GPT2PreTokenizedDataset('/kaggle/working/gpt2_tokenized_data')

if torch.cuda.is_available():
    print("Moving dataset to GPU...")
    dataset.to_device(device)
    print("‚úì Dataset on GPU")
    
    # Verify
    sample = dataset[0]
    print(f"\nVerification:")
    print(f"  Input_ids device: {sample['input_ids'].device}")
    print(f"  Labels device:    {sample['labels'].device}")
    
    # Check GPU memory
    allocated = torch.cuda.memory_allocated(0) / 1e6
    print(f"\nGPU memory allocated: {allocated:.2f} MB")
else:
    print("CPU mode - dataset stays in CPU memory")

Moving dataset to GPU...
‚úì Dataset on GPU

Verification:
  Input_ids device: cuda:0
  Labels device:    cuda:0

GPU memory allocated: 612.25 MB


## Step 8: Run Baseline Inference (FP32)

Test the complete pipeline with **ZERO I/O** during inference.

In [19]:
import time
import torch.nn.functional as F

print("="*60)
print("FP32 Baseline Inference (ZERO I/O!)")
print("="*60)

# Warmup
print("\nWarming up (10 iterations)...")
with torch.no_grad():
    for i in range(10):
        sample = dataset[i]
        _ = model(
            input_ids=sample['input_ids'].unsqueeze(0),
            attention_mask=sample['attention_mask'].unsqueeze(0)
        )

if torch.cuda.is_available():
    torch.cuda.synchronize()

print("‚úì Warmup complete")

# Actual inference measurement
print("\nRunning inference on all samples...")
total_loss = 0
num_tokens = 0

start_time = time.perf_counter()

with torch.no_grad():
    for i in range(len(dataset)):
        sample = dataset[i]
        
        # Forward pass (NO I/O!)
        outputs = model(
            input_ids=sample['input_ids'].unsqueeze(0),
            attention_mask=sample['attention_mask'].unsqueeze(0),
            labels=sample['labels'].unsqueeze(0)
        )
        
        total_loss += outputs.loss.item()
        num_tokens += sample['attention_mask'].sum().item()

if torch.cuda.is_available():
    torch.cuda.synchronize()

end_time = time.perf_counter()

# Results
latency = end_time - start_time
avg_loss = total_loss / len(dataset)
perplexity = torch.exp(torch.tensor(avg_loss)).item()
throughput = len(dataset) / latency
tokens_per_sec = num_tokens / latency

print("\n" + "="*60)
print("Results")
print("="*60)
print(f"Average Loss:    {avg_loss:.4f}")
print(f"Perplexity:      {perplexity:.2f}")
print(f"Latency:         {latency:.3f} seconds")
print(f"Throughput:      {throughput:.2f} samples/second")
print(f"Tokens/sec:      {tokens_per_sec:.2f}")
print(f"Per-sample:      {latency/len(dataset)*1000:.2f} ms")
print("="*60)

print("\n‚úì Baseline inference complete!")

FP32 Baseline Inference (ZERO I/O!)

Warming up (10 iterations)...
‚úì Warmup complete

Running inference on all samples...

Results
Average Loss:    5.3825
Perplexity:      217.57
Latency:         21.943 seconds
Throughput:      88.41 samples/second
Tokens/sec:      9019.46
Per-sample:      11.31 ms

‚úì Baseline inference complete!


## Step 9: Test Power Monitoring

In [20]:
import subprocess

try:
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=name,power.draw', '--format=csv,noheader'],
        capture_output=True,
        text=True,
        timeout=2
    )
    
    if result.returncode == 0:
        print("="*60)
        print("GPU Power Monitoring Available")
        print("="*60)
        print(result.stdout.strip())
        print("\n‚úì nvidia-smi is available for power monitoring")
    else:
        print("‚ö†Ô∏è nvidia-smi not responding properly")
except Exception as e:
    print(f"‚ö†Ô∏è nvidia-smi not available: {e}")

GPU Power Monitoring Available
Tesla T4, 33.85 W
Tesla T4, 11.71 W

‚úì nvidia-smi is available for power monitoring


## Step 10: Summary

In [21]:
print("="*70)
print(" "*15 + "GPT-2 DATASET PREPARATION COMPLETE ‚úì")
print("="*70)

print("\nüìÅ Files Created:")
import os
data_dir = '/kaggle/working/gpt2_tokenized_data'
if os.path.exists(data_dir):
    for f in os.listdir(data_dir):
        fpath = os.path.join(data_dir, f)
        size = os.path.getsize(fpath) / 1024
        print(f"  - {f:25s} {size:>8.1f} KB")

print("\n‚úì Accomplished:")
print(f"  ‚Ä¢ Created pre-tokenized WikiText-2 dataset ({len(dataset)} samples)")
print("  ‚Ä¢ Verified zero I/O during iteration")
print("  ‚Ä¢ Tested with GPT-2 Small FP32 model")
print(f"  ‚Ä¢ Baseline perplexity: {perplexity:.2f}")
print("  ‚Ä¢ Dataset on GPU (zero transfer cost during inference)")
print("  ‚Ä¢ Ready for energy measurement")
print("  ‚Ä¢ MUCH MORE DATA than DistilBERT version (1000 vs 50 samples!)")

print("\nüìä Key Metrics (FP32 Baseline):")
print(f"  ‚Ä¢ Perplexity:  {perplexity:.2f}")
print(f"  ‚Ä¢ Latency:     {latency:.3f} s")
print(f"  ‚Ä¢ Throughput:  {throughput:.2f} samples/s")
print(f"  ‚Ä¢ Tokens/sec:  {tokens_per_sec:.2f}")
print(f"  ‚Ä¢ Device:      {device}")

print("\nüéØ Next Steps:")
print("  1. Use this dataset in final_quantization_benchmark_GPT2.ipynb")
print("  2. Benchmark FP32, FP16, and Mixed Precision")
print("  3. Measure energy consumption for each format")

print("\n‚ö° Critical Achievement:")
print("  ZERO I/O during inference measurement!")
print(f"  {len(dataset)} samples = statistically significant results!")

print("\n" + "="*70)

               GPT-2 DATASET PREPARATION COMPLETE ‚úì

üìÅ Files Created:
  - metadata.json                  0.2 KB
  - attention_mask.pt           1941.2 KB
  - labels.pt                   1941.1 KB
  - input_ids.pt                1941.2 KB

‚úì Accomplished:
  ‚Ä¢ Created pre-tokenized WikiText-2 dataset (1940 samples)
  ‚Ä¢ Verified zero I/O during iteration
  ‚Ä¢ Tested with GPT-2 Small FP32 model
  ‚Ä¢ Baseline perplexity: 217.57
  ‚Ä¢ Dataset on GPU (zero transfer cost during inference)
  ‚Ä¢ Ready for energy measurement
  ‚Ä¢ MUCH MORE DATA than DistilBERT version (1000 vs 50 samples!)

üìä Key Metrics (FP32 Baseline):
  ‚Ä¢ Perplexity:  217.57
  ‚Ä¢ Latency:     21.943 s
  ‚Ä¢ Throughput:  88.41 samples/s
  ‚Ä¢ Tokens/sec:  9019.46
  ‚Ä¢ Device:      cuda

üéØ Next Steps:
  1. Use this dataset in final_quantization_benchmark_GPT2.ipynb
  2. Benchmark FP32, FP16, and Mixed Precision
  3. Measure energy consumption for each format

‚ö° Critical Achievement:
  ZERO I/O during i

## Optional: Create Dataset Archive

In [22]:
# Create a zip file for easy download/sharing
!zip -r gpt2_tokenized_data.zip /kaggle/working/gpt2_tokenized_data/

print("\n‚úì Dataset archived to gpt2_tokenized_data.zip")

updating: kaggle/working/gpt2_tokenized_data/ (stored 0%)
updating: kaggle/working/gpt2_tokenized_data/metadata.json (deflated 29%)
updating: kaggle/working/gpt2_tokenized_data/attention_mask.pt (deflated 100%)
updating: kaggle/working/gpt2_tokenized_data/labels.pt (deflated 80%)
updating: kaggle/working/gpt2_tokenized_data/input_ids.pt (deflated 80%)

‚úì Dataset archived to gpt2_tokenized_data.zip
