
# Lesson 1 — Tokens & Tokenizers (Build a Tiny BPE)
**Goal:** Understand how raw text becomes tokens. You'll implement a *mini* Byte Pair Encoding (BPE) tokenizer and see merges happen.

**What you'll learn**
- Why LLMs use *subword* tokens
- How frequency-based merges reduce token count
- How changing the training corpus changes merges and tokenization

> ⚙️ Tip: Edit the `data/*.txt` files (e.g., add your own space/dogs/Minecraft text) and re-run!


In [None]:

from pathlib import Path
import collections
import re

data_dir = Path("../data")
corpus_files = ["space.txt", "animals.txt", "minecraft.txt"]
text = ""
for fname in corpus_files:
    text += (data_dir / fname).read_text(encoding="utf-8") + "\n"

print(f"Loaded {len(text)} characters from {corpus_files}")
print(text[:400])



## Step 1: Start from characters
We'll implement a very small BPE-like tokenizer:
1. Start by splitting words into characters (with a special end-of-word symbol `</w>`).
2. Count the frequency of adjacent symbol pairs.
3. Merge the most frequent pair into a new symbol.
4. Repeat for N merges.


In [None]:

def get_vocab(text):
    # Split on whitespace; add </w> to mark word boundary
    words = re.findall(r"\S+", text.lower())
    vocab = collections.Counter([" ".join(list(w)) + " </w>" for w in words])
    return vocab

def get_stats(vocab):
    pairs = collections.Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = " ".join(pair)
    replacement = "".join(pair)
    for word, freq in v_in.items():
        new_word = word.replace(bigram, replacement)
        v_out[new_word] = v_out.get(new_word, 0) + freq
    return v_out

vocab = get_vocab(text)
print(f"Initial vocab size: {len(vocab)} (unique word spellings with char tokens)")
pairs = get_stats(vocab).most_common(10)
pairs[:5]



## Step 2: Learn merges
Run a small number of merges (e.g., 100). Inspect the most frequent pairs and the final merges list.


In [None]:

num_merges = 100  # try 50, 100, 200 and compare
merges = []
for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs: break
    best = max(pairs, key=pairs.get)
    merges.append(best)
    vocab = merge_vocab(best, vocab)

print(f"Learned {len(merges)} merges. Top 10 merges:")
print(merges[:10])



## Step 3: Tokenize new text with the learned merges
We'll apply merges greedily from first to last.


In [None]:

def tokenize(word, merges):
    # start from characters + </w>
    symbols = list(word.lower()) + ["</w>"]
    # apply merges in order
    for (a,b) in merges:
        i = 0
        while i < len(symbols)-1:
            if symbols[i] == a and symbols[i+1] == b:
                symbols[i:i+2] = [a+b]
            else:
                i += 1
    if symbols and symbols[-1] == "</w>":
        symbols = symbols[:-1]
    return symbols

samples = [
    "Dogonauts", "creeper", "astronaut", "dogs", "minecraft", "portal", "village",
    "retriever", "falcon", "rocketship"
]
for s in samples:
    print(s, "->", tokenize(s, merges))



### Challenge
- Edit the corpus files and add lots of *Minecraft* words, then re-run the merge training. Do you get different tokens?
- Compare token counts for the same sentence under different corpora.
