# Exploring Tokenizers üßê
*(contact: arjo@stanford.edu)*

This notebook walks through what a tokenizer is and the state-of-the-art tokenizers currently used. Including where they might struggle!

## 1. What is a tokenizer? ü§î
Tokenization is the fundamental first step in nearly all Natural Language Processing (NLP) tasks. It's the process of breaking down a continuous stream of text into smaller, meaningful units called tokens. Think of it as digitally dicing a sentence into its core ingredients. These tokens could be words, characters, or parts of words (subwords).

For example, the sentence:
```"NLP is fascinating!"```

Might be tokenized into:
```["NLP", "is", "fascinating", "!"]```

These tokens are then converted into numerical representations that machine learning models, like Large Language Models (LLMs), can understand and process.

---

### 1.1 A Quick Primer on UTF-8 üî°
To understand how modern tokenizers work (and where they fail), it's essential to know how text is represented digitally. Computers don't see characters like 'A', '√©', or 'üòÇ'; they see bytes.

**Unicode** is the universal standard that assigns a unique number, called a **code point**, to every character in every language. For example, the code point for the emoji 'üòÇ' is `U+1F602`.

**UTF-8** is the dominant **encoding** that translates these Unicode code points into a sequence of bytes. Its most important feature is that it's a **variable-width encoding**. This means different characters take up a different number of bytes:
* **1 byte:** For all standard English characters and symbols, i.e. ASCII (`A-Z`, `0-9`, `!`, etc.).
* **2 bytes:** For many accented letters and symbols from other alphabets (e.g., `√©`, `√±`).
* **3 bytes:** For most common Chinese, Japanese, and Korean (CJK) characters (e.g., `Êó•`, `Êú¨`).
* **4 bytes:** For emojis and less common characters (e.g., `üòÇ`, `üß†`).

The crucial takeaway is that a single character a human sees can be composed of multiple bytes. This fact has profound implications for tokenization algorithms that operate at the byte level (i.e. multi-byte symbols could be chopped into multiple tokens!).
This is particularly an issue for non-english languages.

#### A Hands-On Look at UTF-8 Bytes

As we discussed, UTF-8 is a variable-width encoding. Let's see what that actually means by encoding a few characters and looking at their raw bytes.

In [None]:
# A standard ASCII character (1 byte)
char_1 = 'H'
bytes_1 = char_1.encode('utf-8')
print(f"'{char_1}' -> {bytes_1}  (Length: {len(bytes_1)} byte)")

# An accented Latin character (2 bytes)
char_2 = '√©'
bytes_2 = char_2.encode('utf-8')
print(f"'{char_2}' -> {bytes_2}  (Length: {len(bytes_2)} bytes)")

# A Japanese character (3 bytes)
char_3 = 'Âèã'
bytes_3 = char_3.encode('utf-8')
print(f"'{char_3}' -> {bytes_3}  (Length: {len(bytes_3)} bytes)")

# An emoji (4 bytes)
char_4 = 'üòä'
bytes_4 = char_4.encode('utf-8')
print(f"'{char_4}' -> {bytes_4}  (Length: {len(bytes_4)} bytes)")

As you can see, the number of bytes needed to represent a single character varies. Now, let's see what happens when we combine two characters. The bytes are simply concatenated.

In [None]:
text = "Hüòä"
text_bytes = text.encode('utf-8')
print(f"'{text}' -> {text_bytes}")
print(f"Individual bytes: {bytes_1} + {bytes_4}")

This is the key insight: to the computer, `"Hüòä"` is just the sequence of five bytes `b'H\xf0\x9f\x98\x8a'`. A simple byte-level algorithm has no inherent knowledge that the last four bytes represent a single smiley face.

---

### 1.2 Modern Tokenization Techniques
While early methods simply split text by spaces and punctuation, this approach is too rigid. It can't handle punctuation within words (like `O'Malley`), hyphenated terms (`state-of-the-art`), or languages without clear word boundaries (like Japanese or Chinese).

Today, the most successful and widely used techniques are based on subword tokenization. These methods break words into smaller, frequently occurring pieces. This approach cleverly balances the need for a manageable vocabulary size with the ability to represent any word, including rare, misspelled, or new ones.

The dominant subword algorithms include:
* **[Byte-Pair Encoding (BPE)](https://arxiv.org/abs/1508.07909):** This is a data-driven algorithm that starts with a vocabulary of individual characters. It iteratively counts the most frequent pair of adjacent tokens and merges them into a new, single token. This process is repeated for a set number of merges, resulting in a vocabulary of common subwords. For example, it might learn to merge `e` and `r` into `er`, then `er` and `s` into `ers`.
    * **Where it fails:** Because BPE often operates at the **byte level**, it has no inherent understanding of character boundaries. An emoji like `üòÇ` is made of four bytes. If the last byte of `üòÇ` and the first byte of the next character happen to be a frequent pair in the training data, BPE will happily merge them. This creates nonsensical "Frankenstein tokens" that **split a single character across two different tokens**, destroying its meaning. This is a common problem for emojis, CJK characters, and accented letters, leading to a less efficient and less meaningful vocabulary.
* **[WordPiece](https://huggingface.co/learn/llm-course/en/chapter6/5):** Used by Google's BERT model, WordPiece is very similar to BPE. However, instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data if it were added to the vocabulary. It essentially asks, "Which merge gives us the most bang for our buck in terms of explaining the text?"
* **[Unigram Language Model](https://huggingface.co/learn/llm-course/en/chapter6/7):** This approach, used by models like T5 and ALBERT, takes a different route. It starts with a very large set of possible subwords and iteratively removes the ones that contribute least to the overall probability of the corpus, gradually shrinking the vocabulary to the desired size. A key feature is that it's probabilistic, meaning a single word can have multiple valid tokenizations, adding flexibility.

BPE seems to be the favorite of many recent systems, e.g. [DeepSeek-v3](https://arxiv.org/pdf/2412.19437) uses a modified BPE implementation and Cohere labs [1M tokenizer](https://arxiv.org/pdf/2506.10766).

---

### 1.3 Where Tokenization Falters
Despite their sophistication, modern tokenizers have significant blind spots, especially when dealing with the complexity of global human language.
#### The Multilingual Challenge
The biggest failure point is handling multilingual text and code-switching (mixing languages in one sentence). Most large models are trained with a single, unified vocabulary. While massive, this vocabulary is inevitably biased towards the dominant language in the training data (usually English).

When a tokenizer built on an English-heavy vocabulary encounters a word from another language, it often fails to find known subwords. The only option is to fall back to the most basic units: individual characters or bytes.

Consider the German word `Lebensabschnittspartner` (a partner for a phase of your life).

* An ideal, German-aware tokenizer might split it meaningfully: `["Lebens", "abschnitts", "partner"]`.
* A typical English-centric tokenizer, however, might produce something nonsensical like: `["Leb", "ens", "abschnitt", "sp", "artner"]` or even worse, break it into individual letters if the subwords aren't in its vocabulary.

This "character-level" degradation destroys the word's semantic meaning before the model even sees it, making it incredibly difficult for the model to understand the text's intent.

#### Other Key Failure Points
* **Morphologically Rich Languages:** Languages like Turkish, Finnish, or Hungarian attach long strings of suffixes to a root word to convey meaning. Subword tokenizers can struggle to consistently identify the root word, often splitting these complex words in arbitrary ways.
* **Domain-Specific Jargon:** A tokenizer trained on general web text will perform poorly on specialized documents, like medical research papers or legal contracts. It will break down crucial terms like **immunosuppressant** or **mandamus** into less meaningful fragments.
* **Inconsistency:** Because the tokenization process is greedy and deterministic, a tiny change in a word can lead to a completely different tokenization, making the model brittle. For example, **(hello)** and **\[hello\]** might be tokenized differently in a way that loses the core meaning of **hello**.

#### BPE in Action: The `gpt2` Tokenizer and Frankenstein tokens
Now, let's see how a classic byte-level BPE tokenizer, like the one used for `gpt2`, handles this. We'll load it directly from Hugging Face.

In [None]:
# uncomment below for the HuggingFace transformer library. Preferably put this in an environment. e.g.
#!conda env create myenv
#!conda activate myenv
#!pip install transformers

In [None]:
from transformers import AutoTokenizer

# Load the gpt2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hüòä"
tokens = tokenizer.tokenize(text)

print(f"Original text: '{text}'")
print(f"Tokenized output: {tokens}")

Notice the result: `['H', '√∞≈Åƒ∫', 'ƒ¨']`.
The emoji `üòä` was not kept as a single unit. It was split into two bizarre-looking tokens: `'√∞≈Åƒ∫'` and `'ƒ¨'`.

#### Why Did This Happen?

This is the "Frankenstein token" problem in action. The `gpt2` tokenizer's learned vocabulary doesn't contain a single token for `üòä`. It only knows how to represent the individual bytes that make it up. Let's prove it by inspecting the raw bytes of those strange tokens.

In [None]:
from transformers import AutoTokenizer
# We need this utility to manually create the byte-to-char mapping
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode

tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "üòä"

# --- Step 1: Manually create the byte decoder ---
# The bytes_to_unicode() function gives us the mapping from an integer (0-255)
# to its string representation (e.g., '√∞'). We need to reverse it.
byte_encoder = bytes_to_unicode()
byte_decoder = {v: k for k, v in byte_encoder.items()}

# --- Step 2: Get the Token IDs ---
token_ids = tokenizer.encode(text)
print(f"Original text: '{text}'")
print(f"Token IDs: {token_ids}\n")

# --- Step 3: Create a map from IDs back to their token string representations ---
id_to_token_string = {v: k for k, v in tokenizer.get_vocab().items()}

# --- Step 4: Convert the weird token strings to their raw bytes ---
print("Inspecting the bytes of each token ID:")
all_reconstructed_bytes = b''
for i, token_id in enumerate(token_ids):
    # Find the string label for the ID (e.g., '√∞≈Åƒ∫')
    token_string = id_to_token_string[token_id]
    
    # Use our manually created byte_decoder to get the raw byte values (integers)
    byte_values = [byte_decoder[char] for char in token_string]
    
    # Convert the integer values into a Python bytes object
    token_as_bytes = bytes(byte_values)
    
    print(f"  - ID {token_id} -> '{token_string}' -> represents bytes: {token_as_bytes}")
    all_reconstructed_bytes += token_as_bytes

# --- Step 5: Compare with the original ---
original_bytes = text.encode('utf-8')
print(f"\nReconstructed bytes: {all_reconstructed_bytes}")
print(f"Original emoji bytes:  {original_bytes}")

if all_reconstructed_bytes == original_bytes:
    print("\n‚úÖ Success! The bytes from the broken tokens perfectly match the original.")

This corrected approach clearly demonstrates the "Frankenstein token" problem without causing errors:
* The Problem: The `gpt2` tokenizer did not have a single token ID in its vocabulary for `üòä`.
* The Fallback: Its fallback plan was to represent the emoji by its constituent UTF-8 bytes. The sequence of IDs `[240, 157, 156, 138]` is `gpt2`'s internal representation for the byte sequence `b'\xf0\x9f\x98\x8a'`.
* The Result: Each of those IDs decodes to a strange-looking string fragment (`'√∞'`, `'≈Å'`, `'ƒ∫'`, `'ƒ¨'`). These are just human-readable labels for the raw bytes. When we see `['H', '√∞', '≈Å', 'ƒ∫', 'ƒ¨']` as the tokenization of `"Hüòä",` we are seeing a single character (`üòä`) being split across four different tokens.

---

## Q: For you to try ‚≠ê
Find modern tokenizers on HuggingFace. Find examples of "Other Key Failure Points".
* **Exemplify:** Try with a couple of hardcoded words
* **Quantify:** Download some multilingual data (hint: see `data.py`) and quantify how well your tokenizer of choice works compared to English.
* **Discuss:** What failure modes do you find? How do you think that might impact language understanding for computers? (e.g. LLMs)

## 2. Beyond BPE: A Smarter, Language-Aware Tokenizer üß†
Standard tokenizers like BPE are powerful, but they're also **language-blind**. They treat text as a stream of data to be compressed, relying purely on character frequency. This works reasonably well for a single language like English, but when faced with multilingual text, they often resort to butchering unfamiliar words into meaningless letters.

This project introduces a different philosophy: a **linguistically-aware tokenizer** that acts less like a data compressor and more like a digital linguist. It learns from raw text, but its decisions are guided by pre-coded knowledge about the structure and patterns of human language. By combining data-driven statistics with linguistic rules, it learns a more robust and meaningful vocabulary, especially for multilingual corpora.

It frames tokenization not as a greedy merging task, but as an **optimization problem**: What is the best possible way to segment this sentence to minimize a total "cost"? This cost is a sophisticated blend of statistical likelihood and linguistic plausibility.

---

#### What Makes It Different?

This tokenizer's methodology deviates significantly from traditional approaches. It incorporates several key features that allow it to understand text at a deeper, more structural level.

* **Morphological Awareness:** Instead of just seeing frequent letters, this tokenizer learns the "shape" of words. It understands that `run` + `-ing` is a common English pattern and `yap` + `-mak` is a common Turkish one.
    * **Methodology:** It uses a `MorphologyEncoder` to create vector representations of words based on their character n-grams and known affixes (e.g., `-ed`, `re-`, `-ung`). This allows it to calculate a "morphological fit" score, rewarding tokens that look structurally correct for a given language. BPE/WordPiece have no concept of morphology.
* **Cross-Lingual Grammatical Links:** The model can recognize that the `-s` in English "cats" and the `-ler` in Turkish "kediler" serve the same grammatical purpose (pluralization).
    * **Methodology:** It uses a predefined map (`CROSS_EQUIV`) of grammatically equivalent suffixes across languages. This encourages the model to learn and reward tokens that exhibit these fundamental cross-lingual patterns, a feature entirely absent in standard tokenizers.
* **Global Optimization over Greedy Merging:** Finding the best segmentation is treated as a "shortest path problem," not a series of one-off greedy decisions.
    * **Methodology:** The core of the training algorithm is **column generation**, an optimization technique. In each iteration, it decodes the entire corpus with its current vocabulary. Then, it searches for new tokens ("columns") that will provide the greatest overall reduction in the segmentation cost. This holistic approach avoids the irreversible, sometimes sub-optimal, merges that BPE makes.
* **Explicit Linguistic Priors:** The tokenizer is bootstrapped with a set of explicit rules and heuristics that guide the learning process from the start.
    * **Methodology:** It uses a flexible `LinguisticModels` system that provides direct rewards or penalties. This includes protecting atomic units like **URLs and emails** from being split, rewarding common token sequences (e.g., a capitalized word following a period), and penalizing nonsensical ones. This injects common-sense knowledge directly into the cost function.