# Lesson 1 — Tokens & Tokenizers (Build a Tiny BPE)
**Goal:** Understand how raw text becomes machine-friendly tokens. You'll implement a *mini* Byte Pair Encoding (BPE) tokenizer and watch merge rules form in front of you.

**Why this matters**
- Every LLM reads text as a sequence of integer IDs called **tokens**—similar to how LEGO instructions list part numbers.
- A tokenizer translates words, punctuation, and even emojis into those IDs. Different tokenizers create different "vocabularies."
- BPE is the technique many modern LLMs (GPT-2/3/4) use because it can flexibly break apart rare words while keeping common chunks whole.

**Key idea map**
1. Start with characters (like spelling out a word letter-by-letter).
2. Find the most common pairs of neighboring characters.
3. Glue that pair together into a new symbol, repeat, and build a vocabulary of multi-character pieces.

**Vocabulary check**
- **Token:** The basic chunk of text an LLM sees (could be a character, word, or syllable-like piece).
- **Corpus:** The collection of text you train the tokenizer on. Change the corpus → change which chunks appear.
- **Merge rule:** An instruction that says "whenever you see `a` next to `n`, treat them as the single token `an`."

> ⚙️ Tip: Edit the `data/*.txt` files (e.g., add your own space/dogs/Minecraft stories) and re-run! Changing the training text should change the learned tokens.

In [None]:

from pathlib import Path
import collections
import re

data_dir = Path("data")
corpus_files = ["space.txt", "animals.txt", "minecraft.txt"]
text = ""
for fname in corpus_files:
    text += (data_dir / fname).read_text(encoding="utf-8") + "\n"

print(f"Loaded {len(text)} characters from {corpus_files}")
print(text[:400])


## Step 1: Start from characters
We'll implement a very small BPE-like tokenizer:
1. **Split words into characters** and tack on a special end-of-word symbol `</w>`. This keeps track of where words stop so that merges don’t jump across spaces.
2. **Count the frequency of adjacent symbol pairs.** Think of scanning the text with a magnifying glass and tallying how often each pair shows up.
3. **Merge the most frequent pair into a new symbol.** For example, if `t` and `h` appear together constantly, create the new symbol `th`.
4. **Repeat the process for N merges.** Each merge teaches the tokenizer a new mini-word. After enough merges, common words become a single token, while rare words stay in smaller pieces.

> 📎 Analogy: Imagine compressing text by inventing shorthand. If you write "laugh out loud" many times, you eventually invent "lol". BPE does the same but automatically.

In [None]:

def get_vocab(text):
    # Split on whitespace; add </w> to mark word boundary
    words = re.findall(r"\S+", text.lower())
    vocab = collections.Counter([" ".join(list(w)) + " </w>" for w in words])
    return vocab

def get_stats(vocab):
    pairs = collections.Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = " ".join(pair)
    replacement = "".join(pair)
    for word, freq in v_in.items():
        new_word = word.replace(bigram, replacement)
        v_out[new_word] = v_out.get(new_word, 0) + freq
    return v_out

vocab = get_vocab(text)
print(f"Initial vocab size: {len(vocab)} (unique word spellings with char tokens)")
pairs = get_stats(vocab).most_common(10)
pairs[:5]


## Step 2: Learn merges
Now run a loop for a modest number of merges (try 50–200). After each merge:
- Print the pair you merged and its frequency to see *why* it was chosen.
- Keep a list of all merges. This is your tokenizer’s "recipe" for rebuilding tokens from characters.
- Notice how new merges create longer and longer chunks—letters → syllables → whole words.

**Reflection prompts**
- Which words quickly become single tokens? (Usually common words like "the" or names that show up repeatedly.)
- Which words stay split? (Often rare words or made-up ones like fantasy spells.)

In [None]:

num_merges = 100  # try 50, 100, 200 and compare
merges = []
for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs: break
    best = max(pairs, key=pairs.get)
    merges.append(best)
    vocab = merge_vocab(best, vocab)

print(f"Learned {len(merges)} merges. Top 10 merges:")
print(merges[:10])


## Step 3: Tokenize new text with the learned merges
With your merge list in hand, you can tokenize any new sentence by applying the merges greedily from first to last.

**What to observe**
- Try tokenizing a sentence that was *not* in the training corpus. Do the tokens still make sense?
- Compare the number of tokens before and after training. Fewer tokens for the same sentence usually means your tokenizer captured useful chunks.
- Peek at the actual token pieces—you should see familiar letter combinations that reflect the stories you fed it.

> 🧪 Mini-experiment: Tokenize the same sentence using (1) only characters and (2) your trained merges. Count tokens both ways.

In [None]:

def tokenize(word, merges):
    # start from characters + </w>
    symbols = list(word.lower()) + ["</w>"]
    # apply merges in order
    for (a,b) in merges:
        i = 0
        while i < len(symbols)-1:
            if symbols[i] == a and symbols[i+1] == b:
                symbols[i:i+2] = [a+b]
            else:
                i += 1
    if symbols and symbols[-1] == "</w>":
        symbols = symbols[:-1]
    return symbols

samples = [
    "Dogonauts", "creeper", "astronaut", "dogs", "minecraft", "portal", "village",
    "retriever", "falcon", "rocketship"
]
for s in samples:
    print(s, "->", tokenize(s, merges))


### Challenge
- **Corpus remix:** Add lots of Minecraft vocabulary ("Redstone", "Creeper", "Netherite") to the corpus, retrain, and inspect the merge list. Do those words become single tokens?
- **Compare corpora:** Train once on a space-themed story and once on an animal-themed story. Tokenize the same sentence with both tokenizers and compare the outputs.
- **Stretch goal:** Write a function that visualizes how token counts shrink as you increase the number of merges.