# 02 Tokenizer & Dataset Preparation

Build a truncated tokenizer with dead token padding for Duckling_II.

**Goal:** Create a vocabulary of 8,192 tokens:
- **Live tokens (0-6143):** Top 6,144 GPT-2 tokens by frequency in TinyStories
- **Dead tokens (6144-8191):** 2,048 slots that never appear in training data

The dead tokens are our experimental subjects—we'll watch how they move through embedding space despite never receiving direct gradients.

In [1]:
# === Parameters ===
DATASET_NAME = "roneneldan/TinyStories"
SPLIT = "train"
RANDOM_SEED = 42

# Vocabulary sizes (from notebook 01 analysis)
LIVE_TOKENS = 6144      # Top tokens by frequency
DEAD_TOKENS = 2048      # Padding rows, never used
TOTAL_VOCAB = 8192      # Powers of two for alignment

# For building the frequency table
SAMPLE_SIZE = 100_000   # Same as notebook 01

# Context length
MAX_LENGTH = 512

# Output paths
OUTPUT_DIR = "."

In [2]:
import json
import numpy as np
from pathlib import Path
from collections import Counter
from datasets import load_dataset
from transformers import GPT2TokenizerFast
from tqdm.auto import tqdm

print("Imports complete.")

Imports complete.


## Step 1: Build Token Frequency Table

Reproduce the frequency analysis from notebook 01 to get our live token list.

In [3]:
# Load GPT-2 tokenizer
base_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
print(f"GPT-2 tokenizer loaded. Vocab size: {base_tokenizer.vocab_size:,}")

# Load dataset
print(f"\nLoading {DATASET_NAME}...")
dataset = load_dataset(DATASET_NAME, split=SPLIT)
print(f"Dataset loaded. Total stories: {len(dataset):,}")

# Sample for frequency analysis
if SAMPLE_SIZE and SAMPLE_SIZE < len(dataset):
    sample = dataset.shuffle(seed=RANDOM_SEED).select(range(SAMPLE_SIZE))
    print(f"Sampled {SAMPLE_SIZE:,} stories for frequency analysis.")
else:
    sample = dataset

# Count token frequencies
token_counts = Counter()
BATCH_SIZE = 1000
texts = sample["text"]
n_batches = (len(texts) + BATCH_SIZE - 1) // BATCH_SIZE

for i in tqdm(range(n_batches), desc="Counting tokens"):
    start = i * BATCH_SIZE
    end = min(start + BATCH_SIZE, len(texts))
    batch_encodings = base_tokenizer(texts[start:end], add_special_tokens=False)["input_ids"]
    for tokens in batch_encodings:
        token_counts.update(tokens)

print(f"\nUnique tokens found: {len(token_counts):,}")

GPT-2 tokenizer loaded. Vocab size: 50,257

Loading roneneldan/TinyStories...
Dataset loaded. Total stories: 2,119,719
Sampled 100,000 stories for frequency analysis.


Counting tokens:   0%|          | 0/100 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1087 > 1024). Running this sequence through the model will result in indexing errors



Unique tokens found: 17,852


## Step 2: Build Token Mapping

Create bidirectional mappings between GPT-2 token IDs and our compact vocabulary.

In [4]:
# Get top tokens by frequency
sorted_tokens = token_counts.most_common()
live_token_ids = [token_id for token_id, _ in sorted_tokens[:LIVE_TOKENS]]

print(f"Selected {len(live_token_ids):,} live tokens")
print(f"Dead token slots: {DEAD_TOKENS:,} (indices {LIVE_TOKENS} to {TOTAL_VOCAB - 1})")

# Build mappings
# gpt2_to_compact: GPT-2 token ID -> compact ID (0 to LIVE_TOKENS-1)
# compact_to_gpt2: compact ID -> GPT-2 token ID
gpt2_to_compact = {gpt2_id: compact_id for compact_id, gpt2_id in enumerate(live_token_ids)}
compact_to_gpt2 = {compact_id: gpt2_id for compact_id, gpt2_id in enumerate(live_token_ids)}

# Special token handling: use the most common token as UNK fallback
UNK_ID = 0  # Will map unknown tokens to the most frequent token (likely '.')

print(f"\nMapping built:")
print(f"  GPT-2 ID {live_token_ids[0]} -> compact ID 0 ('{base_tokenizer.decode([live_token_ids[0]])}')")
print(f"  GPT-2 ID {live_token_ids[-1]} -> compact ID {LIVE_TOKENS-1} ('{base_tokenizer.decode([live_token_ids[-1]])}')")

Selected 6,144 live tokens
Dead token slots: 2,048 (indices 6144 to 8191)

Mapping built:
  GPT-2 ID 13 -> compact ID 0 ('.')
  GPT-2 ID 44377 -> compact ID 6143 ('cking')


In [5]:
# Verify coverage
total_tokens = sum(token_counts.values())
covered_tokens = sum(token_counts[tid] for tid in live_token_ids)
coverage = covered_tokens / total_tokens

print(f"Coverage with {LIVE_TOKENS:,} live tokens: {coverage:.4%}")
print(f"OOV rate: {1 - coverage:.4%}")

Coverage with 6,144 live tokens: 99.5481%
OOV rate: 0.4519%


## Step 3: Create Tokenizer Wrapper

A wrapper class that uses GPT-2 for actual tokenization, then remaps to our compact vocabulary.

In [6]:
class CompactTokenizer:
    """
    Wrapper around GPT-2 tokenizer that remaps to a compact vocabulary.
    
    Live tokens: 0 to (live_size - 1)
    Dead tokens: live_size to (total_size - 1) -- never produced by encode()
    """
    
    def __init__(self, base_tokenizer, gpt2_to_compact, compact_to_gpt2, 
                 live_size, total_size, unk_id=0):
        self.base = base_tokenizer
        self.gpt2_to_compact = gpt2_to_compact
        self.compact_to_gpt2 = compact_to_gpt2
        self.live_size = live_size
        self.total_size = total_size
        self.unk_id = unk_id
        
        # For HuggingFace compatibility
        self.vocab_size = total_size
        self.pad_token_id = unk_id  # Use UNK as padding
        self.eos_token_id = unk_id  # Simplified: no special EOS
        self.bos_token_id = unk_id  # Simplified: no special BOS
    
    def encode(self, text, add_special_tokens=False):
        """Tokenize text and remap to compact vocabulary."""
        gpt2_ids = self.base.encode(text, add_special_tokens=add_special_tokens)
        return [self.gpt2_to_compact.get(tid, self.unk_id) for tid in gpt2_ids]
    
    def decode(self, compact_ids):
        """Decode compact IDs back to text."""
        # Only decode live tokens; dead tokens decode to empty string
        gpt2_ids = []
        for cid in compact_ids:
            if cid in self.compact_to_gpt2:
                gpt2_ids.append(self.compact_to_gpt2[cid])
            # Dead tokens (>= live_size) are silently skipped
        return self.base.decode(gpt2_ids)
    
    def __call__(self, texts, **kwargs):
        """Batch tokenization for HuggingFace compatibility."""
        if isinstance(texts, str):
            texts = [texts]
        
        add_special = kwargs.get('add_special_tokens', False)
        return {
            'input_ids': [self.encode(t, add_special_tokens=add_special) for t in texts]
        }

# Create our tokenizer
tokenizer = CompactTokenizer(
    base_tokenizer=base_tokenizer,
    gpt2_to_compact=gpt2_to_compact,
    compact_to_gpt2=compact_to_gpt2,
    live_size=LIVE_TOKENS,
    total_size=TOTAL_VOCAB,
    unk_id=UNK_ID
)

print(f"CompactTokenizer created:")
print(f"  vocab_size: {tokenizer.vocab_size:,}")
print(f"  live tokens: 0-{tokenizer.live_size - 1}")
print(f"  dead tokens: {tokenizer.live_size}-{tokenizer.total_size - 1}")

CompactTokenizer created:
  vocab_size: 8,192
  live tokens: 0-6143
  dead tokens: 6144-8191


## Step 4: Test the Tokenizer

In [7]:
# Test on a sample story
test_story = dataset[0]["text"]
print("Sample story:")
print("-" * 50)
print(test_story[:500] + "..." if len(test_story) > 500 else test_story)
print("-" * 50)

# Encode
compact_ids = tokenizer.encode(test_story)
print(f"\nEncoded to {len(compact_ids)} tokens")
print(f"First 20 token IDs: {compact_ids[:20]}")
print(f"Max token ID: {max(compact_ids)} (should be < {LIVE_TOKENS})")

# Decode
decoded = tokenizer.decode(compact_ids)
print(f"\nRound-trip decode matches: {decoded == test_story}")
if decoded != test_story:
    print(f"  (Differences due to OOV tokens mapped to UNK)")

Sample story:
--------------------------------------------------
One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b...
--------------------------------------------------

Encoded to 162 tokens
First 20 token IDs: [103, 20, 4, 6, 36, 60, 76, 27, 121, 6, 2007, 16, 9, 201, 0, 10, 166, 8, 7, 1486]
Max token ID: 4109 (should be < 6144)

Round-trip decode matches: True


In [8]:
# Verify dead tokens are truly dead
print("Verifying dead tokens never appear in encoded data...")

max_token_seen = 0
dead_token_appearances = 0

# Check a sample of stories
for i in tqdm(range(min(10000, len(dataset))), desc="Checking stories"):
    ids = tokenizer.encode(dataset[i]["text"])
    if ids:
        max_seen = max(ids)
        if max_seen > max_token_seen:
            max_token_seen = max_seen
        dead_token_appearances += sum(1 for tid in ids if tid >= LIVE_TOKENS)

print(f"\nMax token ID seen: {max_token_seen}")
print(f"Dead token appearances: {dead_token_appearances}")
print(f"Dead tokens are {'truly dead ✓' if dead_token_appearances == 0 else 'NOT dead ✗'}")

Verifying dead tokens never appear in encoded data...


Checking stories:   0%|          | 0/10000 [00:00<?, ?it/s]


Max token ID seen: 6143
Dead token appearances: 0
Dead tokens are truly dead ✓


## Step 5: Create Dataset Pipeline

Build a tokenized dataset ready for training.

In [9]:
def tokenize_function(examples):
    """Tokenize a batch of examples."""
    return tokenizer(examples["text"], add_special_tokens=False)

# Tokenize the full dataset
print(f"Tokenizing full dataset ({len(dataset):,} stories)...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    remove_columns=dataset.column_names,
    desc="Tokenizing"
)
print(f"Tokenization complete. {len(tokenized_dataset):,} examples.")

Tokenizing full dataset (2,119,719 stories)...


Tokenizing:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Tokenization complete. 2,119,719 examples.


In [10]:
def chunk_sequences(examples):
    """
    Concatenate all sequences and split into fixed-length chunks.
    This is the standard approach for causal LM training.
    """
    # Concatenate all input_ids
    concatenated = []
    for ids in examples["input_ids"]:
        concatenated.extend(ids)
    
    # Calculate number of complete chunks
    total_length = len(concatenated)
    n_chunks = total_length // MAX_LENGTH
    
    # Split into chunks (drop remainder)
    chunks = [
        concatenated[i * MAX_LENGTH : (i + 1) * MAX_LENGTH]
        for i in range(n_chunks)
    ]
    
    return {"input_ids": chunks}

# Chunk the tokenized dataset
print(f"Chunking into sequences of {MAX_LENGTH} tokens...")
chunked_dataset = tokenized_dataset.map(
    chunk_sequences,
    batched=True,
    batch_size=10000,  # Process many stories at once for efficient chunking
    desc="Chunking"
)
print(f"Chunking complete. {len(chunked_dataset):,} sequences of {MAX_LENGTH} tokens.")

Chunking into sequences of 512 tokens...


Chunking:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Chunking complete. 921,519 sequences of 512 tokens.


In [11]:
# Dataset statistics
n_sequences = len(chunked_dataset)
total_tokens = n_sequences * MAX_LENGTH

print("=" * 50)
print("DATASET STATISTICS")
print("=" * 50)
print(f"\nSequences: {n_sequences:,}")
print(f"Tokens per sequence: {MAX_LENGTH}")
print(f"Total tokens: {total_tokens:,}")
print(f"\nAt batch_size=64:")
print(f"  Steps per epoch: {n_sequences // 64:,}")
print(f"  Tokens per step: {64 * MAX_LENGTH:,}")

DATASET STATISTICS

Sequences: 921,519
Tokens per sequence: 512
Total tokens: 471,817,728

At batch_size=64:
  Steps per epoch: 14,398
  Tokens per step: 32,768


## Step 6: Save Artifacts

In [12]:
output_dir = Path(OUTPUT_DIR)

# Save token mapping
mapping_data = {
    "live_tokens": LIVE_TOKENS,
    "dead_tokens": DEAD_TOKENS,
    "total_vocab": TOTAL_VOCAB,
    "unk_id": UNK_ID,
    "gpt2_to_compact": {str(k): v for k, v in gpt2_to_compact.items()},
    "compact_to_gpt2": {str(k): v for k, v in compact_to_gpt2.items()},
    "live_token_ids": live_token_ids,  # Ordered by frequency
}

mapping_path = output_dir / "token_mapping.json"
with open(mapping_path, "w") as f:
    json.dump(mapping_data, f, indent=2)
print(f"Saved: {mapping_path}")

# Save chunked dataset
dataset_path = output_dir / "tokenized_dataset"
chunked_dataset.save_to_disk(str(dataset_path))
print(f"Saved: {dataset_path}/")

Saved: token_mapping.json


Saving the dataset (0/4 shards):   0%|          | 0/921519 [00:00<?, ? examples/s]

Saved: tokenized_dataset/


## Summary

This notebook created:

1. **Token mapping** (`token_mapping.json`): Bidirectional mapping between GPT-2 IDs and compact vocabulary
2. **Tokenized dataset** (`tokenized_dataset/`): Pre-chunked sequences ready for training

**Key properties:**
- Vocabulary: 8,192 total (6,144 live + 2,048 dead)
- Dead tokens (IDs 6144-8191) never appear in training data
- Coverage: ~99.5% of TinyStories tokens

**Next step:** Training with embedding recording (notebook 03).