# Tokenizer Explorer

Before an LLM reads a single word, it breaks text into **tokens**. This notebook makes that process visible.

**What you'll learn:**
1. Why tokenization exists
2. How to build a simple tokenizer from scratch
3. How GPT-style BPE tokenization works
4. Surprising edge cases that affect LLM behavior

In [None]:
!pip install tiktoken -q

In [None]:
import tiktoken
import random
from collections import Counter

## Part 1: Why Not Just Use Characters or Words?

The naive approaches both have serious problems.

In [None]:
sentence = "The unbelievably fast tokenizer splits text efficiently."

# Approach 1: Character-level
char_tokens = list(sentence)
print("Character-level tokenization:")
print(char_tokens)
print(f"Token count: {len(char_tokens)}")
print("Problem: sequences become very long, hard to learn meaning from single chars")

print()

# Approach 2: Word-level
word_tokens = sentence.split()
print("Word-level tokenization:")
print(word_tokens)
print(f"Token count: {len(word_tokens)}")
print("Problem: 'unbelievable', 'unbelievably', 'unbelievably' are all different tokens")
print("         vocabulary explodes, rare words become unknown")

## Part 2: Build BPE From Scratch

**Byte Pair Encoding (BPE)** solves this by starting with characters and merging the most frequent pairs repeatedly.

It finds a middle ground: common words become single tokens, rare words get split into meaningful pieces.

In [None]:
def get_pairs(vocab):
    """Count all adjacent pairs across the vocabulary."""
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_pair(pair, vocab):
    """Merge all instances of a pair in the vocabulary."""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word, freq in vocab.items():
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = freq
    return new_vocab

def train_bpe(text, num_merges=20):
    """Train a simple BPE tokenizer."""
    # Start: every word is split into characters + end-of-word marker
    words = text.lower().split()
    vocab = Counter()
    for word in words:
        vocab[' '.join(list(word)) + ' </w>'] += 1

    print("Initial vocabulary (top 5):")
    for word, freq in vocab.most_common(5):
        print(f"  '{word}' x{freq}")
    print()

    merges = []
    for i in range(num_merges):
        pairs = get_pairs(vocab)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        vocab = merge_pair(best_pair, vocab)
        merges.append(best_pair)
        print(f"Merge {i+1:2d}: {best_pair[0]} + {best_pair[1]} ‚Üí {''.join(best_pair)}  (appeared {pairs[best_pair]} times)")

    return vocab, merges

# Train on a simple corpus
corpus = """
the cat sat on the mat the cat is fat the cat sat the mat is flat
a cat a bat a hat a rat the rat sat on the mat
"""

vocab, merges = train_bpe(corpus, num_merges=15)

In [None]:
# Show final vocabulary
print("\nFinal vocabulary after merges:")
for word, freq in sorted(vocab.items(), key=lambda x: -x[1]):
    print(f"  '{word}' x{freq}")

## Part 3: GPT-4's Real Tokenizer

GPT models use `tiktoken` with a vocabulary of ~100,000 tokens trained on a massive corpus. Let's explore it.

In [None]:
# Load GPT-4's tokenizer (cl100k_base)
enc = tiktoken.get_encoding("cl100k_base")

print(f"Vocabulary size: {enc.n_vocab:,} tokens")

def show_tokens(text):
    """Display text with token boundaries visible."""
    token_ids = enc.encode(text)
    tokens = [enc.decode([t]) for t in token_ids]
    
    print(f"\nText: {repr(text)}")
    print(f"Token IDs: {token_ids}")
    print(f"Tokens:    {tokens}")
    print(f"Count: {len(token_ids)} tokens")
    return token_ids, tokens

# Simple example
show_tokens("Hello world!")

## Part 4: Surprising Edge Cases

These examples reveal important truths about how LLMs actually process text.

In [None]:
# --- Numbers are tricky ---
print("=" * 50)
print("NUMBERS")
print("=" * 50)
show_tokens("1")
show_tokens("100")
show_tokens("1000")
show_tokens("10000")
show_tokens("100000")
print("\n‚Üí This is why LLMs struggle with arithmetic!"
      "\n  '10000' is not one token - the model has to reason across multiple pieces.")

In [None]:
# --- Spaces matter ---
print("=" * 50)
print("SPACES CHANGE TOKENS")
print("=" * 50)
show_tokens("cat")
show_tokens(" cat")
show_tokens("  cat")
print("\n‚Üí The leading space is part of the token!")
print("  'cat' and ' cat' are different tokens with different IDs.")

In [None]:
# --- Capitalization ---
print("=" * 50)
print("CAPITALIZATION")
print("=" * 50)
show_tokens("python")
show_tokens("Python")
show_tokens("PYTHON")
print("\n‚Üí Same word, different tokens. The model learns each separately.")

In [None]:
# --- Rare words get split ---
print("=" * 50)
print("RARE vs COMMON WORDS")
print("=" * 50)
show_tokens("dog")
show_tokens("serendipity")
show_tokens("antidisestablishmentarianism")
show_tokens("supercalifragilisticexpialidocious")
print("\n‚Üí Common short words = 1 token")
print("  Rare/long words = many tokens (the model is less 'practiced' at these)")

In [None]:
# --- Code tokenization ---
print("=" * 50)
print("CODE")
print("=" * 50)
show_tokens("def hello_world():")
show_tokens("for i in range(10):")
show_tokens("import numpy as np")
print("\n‚Üí Common Python keywords and patterns are often single tokens.")
print("  This is why GPT-4 is good at code - it's been trained on a lot of it!")

In [None]:
# --- Other languages ---
print("=" * 50)
print("NON-ENGLISH TEXT")
print("=" * 50)
show_tokens("Hello, how are you?")       # English
show_tokens("Hola, ¬øc√≥mo est√°s?")         # Spanish
show_tokens("Bonjour, comment allez-vous?")  # French
show_tokens("ÏïàÎÖïÌïòÏÑ∏Ïöî")                   # Korean
show_tokens("ŸÖÿ±ÿ≠ÿ®ÿß")                      # Arabic
print("\n‚Üí English is the most token-efficient.")
print("  Other languages use more tokens for the same meaning.")
print("  This means LLMs are intrinsically better at English than other languages.")

In [None]:
# --- Emojis ---
print("=" * 50)
print("EMOJIS")
print("=" * 50)
show_tokens("I love üçï")
show_tokens("üòÄüòÇü§£")
show_tokens("üè≥Ô∏è‚Äçüåà")
print("\n‚Üí Emojis can be multiple tokens, especially complex compound ones.")

## Part 5: Token Efficiency - How Dense Is Your Text?

LLMs have a **context window** limit (measured in tokens, not words). Understanding token efficiency matters.

In [None]:
def token_efficiency(text):
    """Show tokens per word ratio."""
    tokens = enc.encode(text)
    words = text.split()
    ratio = len(tokens) / max(len(words), 1)
    print(f"{len(tokens):4d} tokens | {len(words):4d} words | ratio {ratio:.2f} | {repr(text[:50])}")

print("Token efficiency comparison (tokens per word):")
print("-" * 70)
token_efficiency("The quick brown fox jumps over the lazy dog.")
token_efficiency("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)")
token_efficiency("2 + 2 = 4, 100 + 200 = 300, 1234 + 5678 = 6912")
token_efficiency("http://www.example.com/path/to/resource?param1=value1&param2=value2")
token_efficiency("ÏïàÎÖïÌïòÏÑ∏Ïöî Ï†ÄÎäî ÌïúÍµ≠Ïñ¥Î•º Î∞∞Ïö∞Í≥† ÏûàÏäµÎãàÎã§")
token_efficiency("üòÄ üéâ üçï üöÄ üí° üî• ‚≠ê üåç")

print("\n‚Üí Prose English ‚âà 1.3 tokens/word")
print("  Code, URLs, numbers, non-English = much less efficient")

## Part 6: The Reversal Test

A famous LLM quirk: models struggle to reverse strings. Now you know why.

In [None]:
def reversal_test(word):
    """Show why reversing words is hard for LLMs."""
    tokens = enc.encode(word)
    token_strings = [enc.decode([t]) for t in tokens]
    
    print(f"Word: '{word}'")
    print(f"Tokens: {token_strings}")
    
    reversed_word = word[::-1]
    reversed_tokens = enc.encode(reversed_word)
    reversed_token_strings = [enc.decode([t]) for t in reversed_tokens]
    
    print(f"Reversed: '{reversed_word}'")
    print(f"Reversed tokens: {reversed_token_strings}")
    print()

print("Why is reversing a string hard for an LLM?")
print("=" * 50)
reversal_test("hello")
reversal_test("tokenizer")
reversal_test("python")

print("‚Üí To reverse 'tokenizer', the model can't just reverse the token order.")
print("  'tokenizer' might be 1 token, but 'rezikenot' is a completely different set.")
print("  The model has to reason at the character level, which it wasn't designed for.")

## Part 7: Your Turn - Explore!

Try your own examples.

In [None]:
# Try anything you want
my_text = "Type something here and see how it gets tokenized!"
show_tokens(my_text)

In [None]:
# How many tokens is your name?
names = ["John", "Mohammed", "Xiaoling", "Anastasia", "Bob"]
print("Token count per name:")
for name in names:
    tokens = enc.encode(name)
    decoded = [enc.decode([t]) for t in tokens]
    print(f"  {name:15} ‚Üí {decoded}  ({len(tokens)} token{'s' if len(tokens) > 1 else ''})")

## Summary: What Tokenization Tells Us About LLMs

| Observation | What It Means |
|---|---|
| Numbers split into multiple tokens | LLMs are bad at arithmetic by design |
| Spaces are part of tokens | Formatting and whitespace affects model behavior |
| Non-English uses more tokens | English-centric training = English advantage |
| Common words = 1 token | Frequent patterns are well-learned |
| Rare words = many tokens | LLMs struggle more with unusual vocabulary |
| Characters aren't the unit | String manipulation tasks (reversal, counting letters) are hard |

**Key insight:** The tokenizer is a hidden layer between you and the model. Understanding it explains many seemingly strange LLM behaviors.