# Tokenization Demo Notebook

This notebook implements and compares several tokenizers to highlight their behavior and simple performance metrics.

**Tokenizers covered**:
- Word-level (regex-based)
- Character-level
- BPE (char-level with `</w>` marker)
- WordPiece (toy greedy longest-match with `##` continuation)
- SentencePiece-style **Unigram** (toy Viterbi segmentation with `▁` boundaries)
- **Byte-level BPE** (toy, UTF-8 bytes; fully reversible)

> These are **educational toy implementations**, intended to show the ideas. For production use, prefer libraries such as **Hugging Face `tokenizers`**, **SentencePiece**, or **tiktoken**.


In [1]:
# Imports
import re, time, math
from collections import Counter
from typing import List, Tuple, Dict, Any
import pandas as pd
import numpy as np
from IPython.display import display

print('Environment ready. pandas:', pd.__version__, '| numpy:', np.__version__)

Environment ready. pandas: 1.5.3 | numpy: 1.24.4


## Evaluation Sentences and Tiny Training Corpus
We evaluate on a few sentences (including multilingual and emoji) and train toy subword models on a tiny in-memory corpus.

In [2]:
SENT_FIXED = "I am thrilled to learn gen AI and build my own applications in @2025*"
sentences_eval = [
    "Hello world!",
    "State-of-the-art models work.",
    SENT_FIXED,
    "I love 🍕 and λ-calculus in 2025!",
    "नमस्ते दुनिया — Καλημέρα κόσμε — こんにちは世界"
]

corpus_train = [
    "We build generative AI systems in 2025.",
    "I am thrilled to learn gen AI and build my own applications in @2025*",
    "Tokenization balances vocabulary size and sequence length.",
    "Byte-level BPE helps with emojis like 🍕 and code like print(x).",
    "WordPiece and Unigram are popular subword methods.",
    "LLaMA and GPT use different tokenizers for efficiency.",
    "State-of-the-art models work well on multilingual data.",
]
print('Example & corpus loaded. Eval sentences:', len(sentences_eval), '| Train lines:', len(corpus_train))

Example & corpus loaded. Eval sentences: 5 | Train lines: 7


## Metrics Helper
We compute simple, comparable metrics across tokenizers: `chars`, `bytes`, `tokens`, `chars/token`, `bytes/token`, `OOV`, `detokenizes_exact`, `time_ms`.

In [3]:
def metrics_from_tokens(text: str, tokens: List[Any], detokenize_fn=None, oov_count:int=None):
    text_chars = len(text)
    text_bytes = len(text.encode('utf-8', errors='strict'))
    n_tokens = len(tokens)
    avg_chars_per_token = (text_chars / n_tokens) if n_tokens else np.nan
    def token_bytes_len(tok):
        if isinstance(tok, (bytes, bytearray)):
            return len(tok)
        if isinstance(tok, tuple) and all(isinstance(b, int) for b in tok):
            return len(tok)
        return len(str(tok).encode('utf-8', errors='ignore'))
    bytes_per_token = np.mean([token_bytes_len(t) for t in tokens]) if tokens else np.nan
    chars_per_token = avg_chars_per_token
    oov = int(oov_count) if oov_count is not None else 0
    recon_ok = None
    if detokenize_fn is not None:
        try:
            recon = detokenize_fn(tokens)
            recon_ok = (recon == text)
        except Exception:
            recon_ok = False
    return {
        "chars": text_chars,
        "bytes": text_bytes,
        "tokens": n_tokens,
        "chars/token": round(chars_per_token, 3) if isinstance(chars_per_token, (int, float)) else np.nan,
        "bytes/token": round(bytes_per_token, 3) if isinstance(bytes_per_token, (int, float)) else np.nan,
        "OOV": oov,
        "detokenizes_exact": recon_ok
    }
print('Metrics helper ready.')

Metrics helper ready.


## 1) Word-level Tokenizer (Regex-based)
Simple whitespace/punctuation splitting. Easy to read; suffers **OOV explosion** and large vocab.

In [4]:
class WordTokenizer:
    def __init__(self):
        self.pattern = re.compile(r"\w+(?:'\w+)?|[^\w\s]", flags=re.UNICODE)
    def tokenize(self, text: str) -> List[str]:
        return self.pattern.findall(text)
    def detokenize(self, tokens: List[str]) -> str:
        out = []
        prev_w = False
        for t in tokens:
            is_punct = re.match(r"[^\w\s]", t) is not None
            is_word = not is_punct and not t.isspace()
            if out and is_word and prev_w:
                out.append(" ")
            elif out and is_word and (not prev_w):
                out.append(" ")
            elif out and is_punct:
                pass
            out.append(t)
            prev_w = is_word
        return "".join(out)
print('WordTokenizer ready.')

WordTokenizer ready.


## 2) Character-level Tokenizer
Every character is a token. No OOV; very long sequences; weak per-token semantics.

In [5]:
class CharTokenizer:
    def tokenize(self, text: str) -> List[str]:
        return list(text)
    def detokenize(self, tokens: List[str]) -> str:
        return "".join(tokens)
print('CharTokenizer ready.')

CharTokenizer ready.


## 3) BPE (char-level) – helpers and tokenizer
Frequency-based **merging of most frequent adjacent pairs** until a target vocab size is reached.

In [6]:
def get_vocab_from_corpus(corpus: List[str]) -> List[List[str]]:
    vocab = []
    for line in corpus:
        for word in re.findall(r"\S+", line):
            vocab.append(list(word) + ["</w>"])
    return vocab

def count_pairs(vocab_seqs: List[List[str]]) -> Counter:
    pairs = Counter()
    for symbols in vocab_seqs:
        for a, b in zip(symbols, symbols[1:]):
            pairs[(a, b)] += 1
    return pairs

def merge_pair(vocab_seqs: List[List[str]], pair: Tuple[str, str]) -> List[List[str]]:
    a, b = pair
    merged = a + b
    new_vocab = []
    for symbols in vocab_seqs:
        i = 0
        new_symbols = []
        while i < len(symbols):
            if i < len(symbols)-1 and symbols[i] == a and symbols[i+1] == b:
                new_symbols.append(merged)
                i += 2
            else:
                new_symbols.append(symbols[i])
                i += 1
        new_vocab.append(new_symbols)
    return new_vocab

class BPETokenizer:
    def __init__(self, merges: List[Tuple[str, str]]):
        self.ranks = {tuple(m): i for i, m in enumerate(merges)}
    @staticmethod
    def train(corpus: List[str], num_merges: int = 200) -> "BPETokenizer":
        vocab = get_vocab_from_corpus(corpus)
        merges = []
        for _ in range(num_merges):
            pairs = count_pairs(vocab)
            if not pairs: break
            best = pairs.most_common(1)[0][0]
            merges.append(best)
            vocab = merge_pair(vocab, best)
        return BPETokenizer(merges)
    def bpe_word(self, word: str) -> List[str]:
        symbols = list(word) + ["</w>"]
        while True:
            pairs = [((symbols[i], symbols[i+1]), i) for i in range(len(symbols)-1)]
            ranked = [(self.ranks.get(p, 10**9), idx, p) for p, idx in pairs]
            if not ranked: break
            rank, idx, pair = min(ranked)
            if rank == 10**9: break
            a, b = pair
            symbols = symbols[:idx] + [a+b] + symbols[idx+2:]
        if symbols and symbols[-1] == "</w>":
            symbols = symbols[:-1]
        return symbols
    def tokenize(self, text: str) -> List[str]:
        words = re.findall(r"\S+", text)
        toks = []
        for w in words:
            toks.extend(self.bpe_word(w) + ["<space>"])
        if toks and toks[-1] == "<space>":
            toks = toks[:-1]
        return toks
    def detokenize(self, tokens: List[str]) -> str:
        words = []
        cur = []
        for t in tokens:
            if t == "<space>":
                words.append("".join(cur)); cur = []
            else:
                cur.append(t)
        if cur: words.append("".join(cur))
        return " ".join(words)
print('BPE helpers & tokenizer ready.')

BPE helpers & tokenizer ready.


## 4) WordPiece (toy)
Greedy longest-match over a learned subword vocabulary; uses `##` to mark continuation pieces. May emit `[UNK]` for unknowns.

In [7]:
class WordPieceTokenizer:
    def __init__(self, vocab: set, unk_token="[UNK]"):
        self.vocab = vocab
        self.unk = unk_token
    @staticmethod
    def train(corpus: List[str], vocab_size: int = 300) -> "WordPieceTokenizer":
        words = []
        for line in corpus:
            words.extend(re.findall(r"\S+", line))
        base = set(ch for w in words for ch in w)
        vocab = set(list(base))
        substr_freq = Counter()
        for w in words:
            L = len(w)
            for i in range(L):
                for j in range(i+1, min(L, i+6)+1):
                    substr_freq[w[i:j]] += 1
        for piece, _ in substr_freq.most_common():
            if len(vocab) >= vocab_size: break
            if piece not in vocab: vocab.add(piece)
        for w in Counter(words).most_common(50):
            if len(vocab) >= vocab_size: break
            vocab.add(w[0])
        return WordPieceTokenizer(vocab)
    def tokenize_word(self, word: str) -> List[str]:
        tokens, i = [], 0
        while i < len(word):
            match = None
            for j in range(len(word), i, -1):
                piece = word[i:j]
                if piece in self.vocab:
                    match = piece; break
            if match is None:
                if word[i] in self.vocab: match = word[i]
                else: return [self.unk]
            tokens.append(match if i==0 else ("##"+match))
            i += len(match)
        return tokens
    def tokenize(self, text: str) -> List[str]:
        toks = []
        for w in re.findall(r"\S+", text):
            toks.extend(self.tokenize_word(w) + ["<space>"])
        if toks and toks[-1] == "<space>": toks = toks[:-1]
        return toks
    def detokenize(self, tokens: List[str]) -> str:
        words, cur = [], ""
        for t in tokens:
            if t == "<space>":
                words.append(cur); cur = ""
            elif t.startswith("##"):
                cur += t[2:]
            elif t == self.unk:
                cur += ""
            else:
                if cur: words.append(cur)
                cur = t
        if cur: words.append(cur)
        return " ".join(words)
print('WordPiece (toy) ready.')

WordPiece (toy) ready.


## 5) SentencePiece-style Unigram (toy)
Probabilistic Unigram LM over candidate pieces; segmentation via Viterbi to maximize likelihood.

In [8]:
class UnigramTokenizer:
    def __init__(self, pieces_probs: Dict[str, float]):
        self.pieces_probs = pieces_probs
        self.pieces = set(pieces_probs.keys())
    @staticmethod
    def train(corpus: List[str], vocab_size: int = 300) -> "UnigramTokenizer":
        words = []
        for line in corpus:
            for w in re.findall(r"\S+", line):
                words.append("▁"+w)
        cand = Counter()
        for w in words:
            L = len(w)
            for i in range(L):
                for j in range(i+1, min(L, i+8)+1):
                    cand[w[i:j]] += 1
        most = cand.most_common(vocab_size-1)
        total = sum(cnt for _, cnt in most) + 1e-9
        probs = {p: cnt/total for p, cnt in most}
        for ch in set("".join(words)):
            if ch not in probs and len(probs) < vocab_size:
                probs[ch] = 1.0 / (total * 10_000)
        Z = sum(probs.values())
        for k in list(probs.keys()):
            probs[k] /= Z
        return UnigramTokenizer(probs)
    def viterbi_segment(self, w_with_bar: str) -> List[str]:
        n = len(w_with_bar)
        dp = [(-math.inf, -1)] * (n+1)
        dp[0] = (0.0, -1)
        for i in range(n):
            if dp[i][0] == -math.inf: continue
            for j in range(i+1, min(n, i+12)+1):
                piece = w_with_bar[i:j]
                if piece in self.pieces:
                    logp = math.log(self.pieces_probs[piece] + 1e-12)
                    if dp[i][0] + logp > dp[j][0]:
                        dp[j] = (dp[i][0] + logp, i)
        if dp[n][0] == -math.inf:
            return list(w_with_bar)
        out = []
        cur = n
        while cur > 0:
            prev = dp[cur][1]
            out.append(w_with_bar[prev:cur])
            cur = prev
        out.reverse()
        return out
    def tokenize(self, text: str) -> List[str]:
        toks = []
        for w in re.findall(r"\S+", text):
            toks.extend(self.viterbi_segment("▁"+w))
        return toks
    def detokenize(self, tokens: List[str]) -> str:
        return "".join(tokens).replace("▁"," ").lstrip()
print('Unigram (toy) ready.')

Unigram (toy) ready.


## 6) Byte-level BPE (toy)
Operate directly on UTF-8 bytes (0–255). Guarantees reversibility and no OOV; merges are learned on bytes.

In [9]:
def bytes_from_text(text: str) -> Tuple[int, ...]:
    return tuple(text.encode('utf-8', errors='strict'))
def text_from_bytes(bseq: List[int]) -> str:
    return bytes(bseq).decode('utf-8', errors='strict')
def byte_pairs(seq: Tuple[Any, ...]) -> Counter:
    return Counter(zip(seq, seq[1:]))
def merge_byte_pair(seq: Tuple[Any, ...], pair: Tuple[Any, Any]):
    merged, i = [], 0
    a, b = pair
    while i < len(seq):
        if i < len(seq)-1 and seq[i] == a and seq[i+1] == b:
            merged.append((a, b)); i += 2
        else:
            merged.append(seq[i]); i += 1
    return tuple(merged)

class ByteLevelBPETokenizer:
    def __init__(self, merges: List[Tuple[Any, Any]]):
        self.ranks = {m: i for i, m in enumerate(merges)}
    @staticmethod
    def train(corpus: List[str], num_merges: int = 300) -> "ByteLevelBPETokenizer":
        data = [bytes_from_text(line) for line in corpus]
        merges = []
        for _ in range(num_merges):
            total_pairs = Counter()
            for seq in data:
                total_pairs.update(byte_pairs(seq))
            if not total_pairs: break
            best = total_pairs.most_common(1)[0][0]
            merges.append(best)
            data = [merge_byte_pair(seq, best) for seq in data]
        return ByteLevelBPETokenizer(merges)
    def tokenize(self, text: str) -> List[Any]:
        seq = tuple(bytes_from_text(text))
        while True:
            pairs = list(zip(seq, seq[1:]))
            if not pairs: break
            ranks = [(self.ranks.get(p, 10**9), idx, p) for idx, p in enumerate(pairs)]
            rmin, idx, pair = min(ranks)
            if rmin == 10**9: break
            merged = []
            i = 0
            a, b = pair
            while i < len(seq):
                if i < len(seq)-1 and seq[i] == a and seq[i+1] == b:
                    merged.append((a, b)); i += 2
                else:
                    merged.append(seq[i]); i += 1
            seq = tuple(merged)
        def flatten(sym):
            if isinstance(sym, int): return (sym,)
            out = []
            for s in sym: out.extend(flatten(s))
            return tuple(out)
        tokens = [flatten(s) for s in seq]
        return tokens
    def detokenize(self, tokens: List[Tuple[int, ...]]) -> str:
        flat = []
        for tok in tokens: flat.extend(tok)
        return text_from_bytes(flat)

def pretty_byte_token(tok: Tuple[int, ...]) -> str:
    try:
        s = bytes(tok).decode('utf-8')
        if all(32 <= b < 127 for b in tok):
            return s
        else:
            return repr(s)
    except Exception:
        return "[" + " ".join(f"{b:02x}" for b in tok) + "]"
print('Byte-level BPE (toy) ready.')

Byte-level BPE (toy) ready.


## Train Tokenizers (toy models)
No training needed for word/char; train BPE/WordPiece/Unigram/Byte-BPE on the tiny corpus above.

In [10]:
word_tok = WordTokenizer()
char_tok = CharTokenizer()
bpe_tok = BPETokenizer.train(corpus_train, num_merges=200)
wp_tok = WordPieceTokenizer.train(corpus_train, vocab_size=300)
uni_tok = UnigramTokenizer.train(corpus_train, vocab_size=300)
byte_bpe_tok = ByteLevelBPETokenizer.train(corpus_train, num_merges=300)
print('All tokenizers instantiated.')

All tokenizers instantiated.


## Evaluate and Compare
Tokenize each evaluation sentence with each tokenizer and compute metrics.

In [11]:
def evaluate_tokenizer(name: str, tok_obj, sentences: List[str], is_byte_level=False):
    rows, samples = [], {}
    for s in sentences:
        t0 = time.perf_counter()
        toks = tok_obj.tokenize(s)
        dt = (time.perf_counter() - t0) * 1000.0
        oov = sum(1 for t in toks if isinstance(t, str) and t == "[UNK]") if isinstance(tok_obj, WordPieceTokenizer) else 0
        detok_fn = getattr(tok_obj, "detokenize", None)
        m = metrics_from_tokens(s, toks, detokenize_fn=detok_fn, oov_count=oov)
        m.update({"tokenizer": name, "text": s, "time_ms": round(dt, 2)})
        rows.append(m)
        samples[s] = [pretty_byte_token(t) for t in toks[:30]] if is_byte_level else toks[:30]
    df = pd.DataFrame(rows)[["tokenizer","text","chars","bytes","tokens","chars/token","bytes/token","OOV","detokenizes_exact","time_ms"]]
    return df, samples

dfs, samples_all = [], {}
df_w, samp_w = evaluate_tokenizer("Word", word_tok, sentences_eval)
dfs.append(df_w); samples_all["Word"] = samp_w
df_c, samp_c = evaluate_tokenizer("Char", char_tok, sentences_eval)
dfs.append(df_c); samples_all["Char"] = samp_c
df_bpe, samp_bpe = evaluate_tokenizer("BPE (char-level)", bpe_tok, sentences_eval)
dfs.append(df_bpe); samples_all["BPE"] = samp_bpe
df_wp, samp_wp = evaluate_tokenizer("WordPiece (toy)", wp_tok, sentences_eval)
dfs.append(df_wp); samples_all["WordPiece"] = samp_wp
df_uni, samp_uni = evaluate_tokenizer("Unigram (toy)", uni_tok, sentences_eval)
dfs.append(df_uni); samples_all["Unigram"] = samp_uni
df_bbpe, samp_bbpe = evaluate_tokenizer("Byte-level BPE (toy)", byte_bpe_tok, sentences_eval, is_byte_level=True)
dfs.append(df_bbpe); samples_all["Byte-BPE"] = samp_bbpe

df_all = pd.concat(dfs, ignore_index=True)
display(df_all)
df_all

Unnamed: 0,tokenizer,text,chars,bytes,tokens,chars/token,bytes/token,OOV,detokenizes_exact,time_ms
0,Word,Hello world!,12,12,3,4.0,3.667,0,True,0.01
1,Word,State-of-the-art models work.,29,29,10,2.9,2.7,0,False,0.01
2,Word,I am thrilled to learn gen AI and build my own...,69,69,16,4.312,3.5,0,False,0.01
3,Word,I love 🍕 and λ-calculus in 2025!,32,36,10,3.2,3.0,0,False,0.01
4,Word,नमस्ते दुनिया — Καλημέρα κόσμε — こんにちは世界,40,95,15,2.667,5.933,0,False,7.11
5,Char,Hello world!,12,12,12,1.0,1.0,0,True,0.0
6,Char,State-of-the-art models work.,29,29,29,1.0,1.0,0,True,0.0
7,Char,I am thrilled to learn gen AI and build my own...,69,69,69,1.0,1.0,0,True,0.0
8,Char,I love 🍕 and λ-calculus in 2025!,32,36,32,1.0,1.125,0,True,0.0
9,Char,नमस्ते दुनिया — Καλημέρα κόσμε — こんにちは世界,40,95,40,1.0,2.375,0,True,0.0


Unnamed: 0,tokenizer,text,chars,bytes,tokens,chars/token,bytes/token,OOV,detokenizes_exact,time_ms
0,Word,Hello world!,12,12,3,4.0,3.667,0,True,0.01
1,Word,State-of-the-art models work.,29,29,10,2.9,2.7,0,False,0.01
2,Word,I am thrilled to learn gen AI and build my own...,69,69,16,4.312,3.5,0,False,0.01
3,Word,I love 🍕 and λ-calculus in 2025!,32,36,10,3.2,3.0,0,False,0.01
4,Word,नमस्ते दुनिया — Καλημέρα κόσμε — こんにちは世界,40,95,15,2.667,5.933,0,False,7.11
5,Char,Hello world!,12,12,12,1.0,1.0,0,True,0.0
6,Char,State-of-the-art models work.,29,29,29,1.0,1.0,0,True,0.0
7,Char,I am thrilled to learn gen AI and build my own...,69,69,69,1.0,1.0,0,True,0.0
8,Char,I love 🍕 and λ-calculus in 2025!,32,36,32,1.0,1.125,0,True,0.0
9,Char,नमस्ते दुनिया — Καλημέρα κόσμε — こんにちは世界,40,95,40,1.0,2.375,0,True,0.0


## Sample Tokenizations for the Fixed Sentence
We print the first ~30 tokens from each tokenizer for:

> `I am thrilled to learn gen AI and build my own applications in @2025*`

In [12]:
print("Sentence:", SENT_FIXED, "\n")
for name, samples in samples_all.items():
    tok_list = samples[SENT_FIXED]
    print(f"[{name}] First ~30 tokens:")
    print(tok_list)
    print()

Sentence: I am thrilled to learn gen AI and build my own applications in @2025* 

[Word] First ~30 tokens:
['I', 'am', 'thrilled', 'to', 'learn', 'gen', 'AI', 'and', 'build', 'my', 'own', 'applications', 'in', '@', '2025', '*']

[Char] First ~30 tokens:
['I', ' ', 'a', 'm', ' ', 't', 'h', 'r', 'i', 'l', 'l', 'e', 'd', ' ', 't', 'o', ' ', 'l', 'e', 'a', 'r', 'n', ' ', 'g', 'e', 'n', ' ', 'A', 'I', ' ']

[BPE] First ~30 tokens:
['I</w>', '<space>', 'am</w>', '<space>', 'thrilled</w>', '<space>', 'to</w>', '<space>', 'learn</w>', '<space>', 'gen</w>', '<space>', 'AI</w>', '<space>', 'and</w>', '<space>', 'build</w>', '<space>', 'my</w>', '<space>', 'own</w>', '<space>', 'applications</w>', '<space>', 'in</w>', '<space>', '@2025*</w>']

[WordPiece] First ~30 tokens:
['I', '<space>', 'am', '<space>', 'thrill', '##ed', '<space>', 'to', '<space>', 'learn', '<space>', 'gen', '<space>', 'AI', '<space>', 'and', '<space>', 'build', '<space>', 'my', '<space>', 'own', '<space>', 'applic', '##ations

## Approximate Vocabulary Sizes
Rough estimates for the toy implementations (Byte/char-based add ~256 base tokens).

In [13]:
def est_vocab_size(tok_obj):
    if isinstance(tok_obj, WordPieceTokenizer):
        return len(tok_obj.vocab)
    if isinstance(tok_obj, UnigramTokenizer):
        return len(tok_obj.pieces)
    if isinstance(tok_obj, BPETokenizer):
        return len(tok_obj.ranks) + 256
    if isinstance(tok_obj, ByteLevelBPETokenizer):
        return len(tok_obj.ranks) + 256
    if isinstance(tok_obj, WordTokenizer):
        return np.nan
    if isinstance(tok_obj, CharTokenizer):
        return 1114112  # theoretical upper bound of Unicode scalar values
    return np.nan

vocab_rows = []
for name, obj in [("Word", word_tok), ("Char", char_tok), ("BPE", bpe_tok),
                  ("WordPiece", wp_tok), ("Unigram", uni_tok), ("Byte-BPE", byte_bpe_tok)]:
    vocab_rows.append({"tokenizer": name, "approx_vocab_size": est_vocab_size(obj)})
df_vocab = pd.DataFrame(vocab_rows)
display(df_vocab)
df_vocab

Unnamed: 0,tokenizer,approx_vocab_size
0,Word,
1,Char,1114112.0
2,BPE,456.0
3,WordPiece,300.0
4,Unigram,300.0
5,Byte-BPE,529.0


Unnamed: 0,tokenizer,approx_vocab_size
0,Word,
1,Char,1114112.0
2,BPE,456.0
3,WordPiece,300.0
4,Unigram,300.0
5,Byte-BPE,529.0
