# Exploring the Multilingual Tokenizer üßê
*(contact: arjo@stanford.edu)*

This notebook walks through some of the core elements of the scalable, linguistically-aware tokenizer. 

## 1. The Building Blocks:
`utils.py` and `constants.py`

The tokenizer relies on several helper functions for text pre-processing, Unicode handling, and noise filtering. Let's see them in action.

In [1]:
# Import the modules
import utils
import numpy as np

# Let's start with "protected spans". The tokenizer identifies things like URLs,
# emails, and complex numbers that should *never* be split.
test_text_1 = "Contact me at sasha@example.com, see [https://example.org](https://example.org), or pay $1,234.56."
protected = utils.find_protected_spans(test_text_1)

print(f"Original text: '{test_text_1}'")
print("Protected Spans (start, end):", protected)
for start, end in protected:
    print(f" -> Found atomic unit: '{test_text_1[start:end]}'")

Original text: 'Contact me at sasha@example.com, see [https://example.org](https://example.org), or pay $1,234.56.'
Protected Spans (start, end): [(14, 31), (38, 80), (89, 97)]
 -> Found atomic unit: 'sasha@example.com'
 -> Found atomic unit: 'https://example.org](https://example.org),'
 -> Found atomic unit: '1,234.56'


---
A key feature is its awareness of **Unicode graphemes**. It knows not to split a character from its accent or an emoji from its modifier.

In [2]:
# An emoji with a skin tone modifier (üë© + ZWJ + üíª = woman technologist)
# and a character with a combining accent mark ('e' + ¬¥)
test_text_2 = "üë©‚Äçüíª is an √©moji."

# The `default_allowed_boundaries` function creates a boolean mask.
# `True` means a split is allowed BEFORE that character index.
boundaries = utils.default_allowed_boundaries(test_text_2)

print(f"Text: {test_text_2}")
print("Allowed Boundaries (1=OK, 0=NO):")

# Let's visualize where it avoids splitting
viz = ""
for i, char in enumerate(test_text_2):
    split_marker = "|" if boundaries[i] else "X"
    viz += f"{split_marker}{char}"
viz += "|" if boundaries[len(test_text_2)] else "X"
print(viz)
print("\nNotice it correctly prevents splits (X) inside the emoji and the accented '√©'.")

Text: üë©‚Äçüíª is an √©moji.
Allowed Boundaries (1=OK, 0=NO):
|üë©X‚ÄçXüíª| |i|s| |a|n| |√©|m|o|j|i|.|

Notice it correctly prevents splits (X) inside the emoji and the accented '√©'.


---
Finally, it includes special post-processing for **CJK (Chinese, Japanese, Korean)** languages, where words are often not space-separated.

In [3]:
# Imagine the tokenizer initially splits a Japanese word into characters.
initial_split = ['„Åì„Çå„ÅØ', 'Êó•', 'Êú¨', 'Ë™û', '„Åß„Åô', '„ÄÇ']
merged_split = utils.merge_cjk_runs(initial_split)

print(f"Initial (over-eager) split: {initial_split}")
print(f"After CJK merging rule:    {merged_split}")

Initial (over-eager) split: ['„Åì„Çå„ÅØ', 'Êó•', 'Êú¨', 'Ë™û', '„Åß„Åô', '„ÄÇ']
After CJK merging rule:    ['„Åì„Çå„ÅØ', 'Êó•Êú¨Ë™û„Åß„Åô', '„ÄÇ']


---

## Try yourself ‚≠ê
Explore the rest of the functionality in `utils.py`, in particular: `looks_like_redirect`, `clean_junk_runs`, `script_guess`, `_is_mixed_script`, and `ParagraphInfo`.

## 2. Linguistic Intelligence: linguistic_features.py

This tokenizer goes beyond statistics by incorporating linguistic knowledge. This is managed by the `LinguisticModels` and `MorphologyEncoder` classes.

**The Morphology Encoder**

This is the most sophisticated component. It learns vector representations of words based on their character patterns (n-grams) and known affixes (prefixes/suffixes). This helps it understand that "running" and "jumping" are similar, even if it has never seen "jumping" before.

In [4]:
from linguistic_features import MorphologyEncoder
from collections import defaultdict

# --- Create a mini-corpus to train the encoder ---
# A real corpus would have thousands of paragraphs.
mini_corpus_texts = [
    "She is running and jumping.",
    "He likes walking.",
    "They are runners."
]
mini_corpus_langs = ["en", "en", "en"]

# The encoder needs a map of all potential tokens and where they occur.
# We'll simulate this for a few key words.
tok_occurrences = defaultdict(list)
tok_occurrences["running"].append((0, 9))
tok_occurrences["jumping"].append((0, 21))
tok_occurrences["walking"].append((1, 9))
tok_occurrences["runners"].append((2, 12))
tok_occurrences["likes"].append((1, 3))

# A helper function to get the language of a paragraph
def para_lang(idx):
    return mini_corpus_langs[idx]

# --- Train the encoder ---
# We use k=4 for tiny 4-dimensional vectors for this demo.
encoder = MorphologyEncoder(k=4)
encoder.fit(mini_corpus_texts, tok_occurrences, para_lang)

print("‚úÖ Morphology Encoder trained on mini-corpus.")

‚úÖ Morphology Encoder trained on mini-corpus.


Now, let's inspect the results. The encoder has learned a **prototype vector** for English, representing the "average" morphological shape of an English word in our tiny corpus.

In [5]:
# Get the prototype vector for English
proto_en = encoder.lang_proto.get("en")
print(f"Learned 'en' prototype vector (shape {proto_en.shape}):\n{proto_en}\n")

# Let's check the similarity score of our words against this prototype.
# The score is the cosine similarity between the word's vector and the language prototype.
# A higher score means it's a more "typical" word for that language.
for word in ["potato", "running", "walking", "runners"]:
    score = encoder.score(word, "en")
    print(f"Morphological fit score for '{word}': {score:.4f}")

# Notice how words with common English suffixes ('ing', 'ers') get high scores.

Learned 'en' prototype vector (shape (4,)):
[ 4.3481302e-01 -7.1796298e-01  5.0539441e-15 -5.4356873e-01]

Morphological fit score for 'potato': 0.0000
Morphological fit score for 'running': 0.8069
Morphological fit score for 'walking': 0.4156
Morphological fit score for 'runners': 0.7271


**The Cost Model**

The `LinguisticModels` class aggregates all these hints into a single **cost**. The tokenizer's goal is to find the segmentation with the minimum total cost. A negative cost is a **reward**.

In [6]:
from linguistic_features import LinguisticModels

# Let's define some linguistic "hints"
lex = {"New York": 2.0} # Encourage "New York" to be one token
tb  = {("<BOS>", "InitCap"): -0.2} # Reward sentences that start with a capital letter

# We'll attach our trained morphology encoder
models = LinguisticModels(lexicon=lex, token_bigram=tb)
models.morph_encoder = encoder
models.paragraph_lang = para_lang

# --- Calculate the cost for a token in context ---
# Context: The token "Running" appears at the start of paragraph 0 (which is English).
# The previous token class is "<BOS>" (Begin Of Sentence).
cost = models.additive_cost(token="Running", prev_class="<BOS>", paragraph_idx=0)
print(f"Total additive cost for 'Running': {cost:.4f}\n")

print("Let's break down the cost:")
# 1. Bigram reward: Starts with InitCap after <BOS>
print(f" -> Bigram reward: {tb[('<BOS>', 'InitCap')]:.4f}")
# 2. Morphology reward: The morphology score is high, so the cost is negative
morph_cost = -models.mu_morph * encoder.score("Running", "en")
print(f" -> Morphology reward (cost): {morph_cost:.4f}")
# 3. Affix reward: has the "-ing" suffix
affix_cost = models._affix_bias("Running", "en")
print(f" -> Affix reward (cost): {affix_cost:.4f}")

Total additive cost for 'Running': -0.2000

Let's break down the cost:
 -> Bigram reward: -0.2000
 -> Morphology reward (cost): -0.0000
 -> Affix reward (cost): 0.0500


---

**3. The Core Algorithm:** `tokenizer.py`

Now we combine everything. The `ScalableTokenizer` class uses a **dynamic programming** algorithm (`_dp_decode`) to find the lowest-cost path through the text.

**Training: The Iterative Process**

Training is an iterative loop:

1. **Decode:** Segment the entire corpus using the current vocabulary.
2. **Price:** Find new candidate tokens that would lower the total segmentation cost the most.
3. **Update:** Add the best new tokens to the vocabulary.
4. Repeat until no more "good" tokens can be found.

In [7]:
from tokenizer import ScalableTokenizer

# Let's use a slightly larger, multilingual sample corpus for training.
corpus_texts = [
    "This tokenizer is learning.",
    "Good tokenizers are useful.",
    "Das ist ein deutscher Satz.",
    "Die Tokenisierung wird gelernt."
]
corpus_langs = ["en", "en", "de", "de"]

# Initialize the tokenizer.
# For this demo, we'll use a very low min_freq and a small vocab_budget.
tokenizer = ScalableTokenizer(
    min_freq=1,
    top_k_add=1,
    vocab_budget=20 # Target ~20 multi-character tokens
)

# Train for just a few iterations to see the process.
# In a real scenario, this would run for hundreds of iterations.
tokenizer.train(corpus_texts, corpus_langs, max_iterations=50)
print(tokenizer.vocab)

Step 1: Performing initial corpus analysis...
Analysis complete in 0.01s. Found 1029 potential tokens; seed vocab chars = 25.

Step 2: Starting training with batch pricing...
Iter 01: Added 1 tokens: ' tokenizer' (best summed RC=-69.7859)
Iter 02: Added 1 tokens: ' tokenize' (best summed RC=-64.1177)
Iter 03: Added 1 tokens: ' tokeniz' (best summed RC=-58.5165)
Iter 04: Added 1 tokens: 'deutscher Satz' (best summed RC=-50.8783)
Iter 05: Added 1 tokens: ' deutscher Sat' (best summed RC=-49.7266)
Iter 06: Added 1 tokens: 'n deutscher Sa' (best summed RC=-49.3962)
Iter 07: Added 1 tokens: 'in deutscher S' (best summed RC=-48.5323)
Iter 08: Added 1 tokens: 'okeni' (best summed RC=-48.4530)
Iter 09: Added 1 tokens: 'rung wird gele' (best summed RC=-48.3802)
Iter 10: Added 1 tokens: 'erung wird gel' (best summed RC=-48.2789)
Iter 11: Added 1 tokens: 'sierung wird g' (best summed RC=-47.9548)
Iter 12: Added 1 tokens: 'st ein deutsch' (best summed RC=-47.5049)
Iter 13: Added 1 tokens: 'ist ein

In [8]:
from tokenizer import ScalableTokenizer
from linguistic_features import LinguisticModels

# We'll use the same multilingual sample corpus.
corpus_texts = [
    "This tokenizer is learning.",
    "Good tokenizers are useful.",
    "Das ist ein deutscher Satz.",
    "Die Tokenisierung wird gelernt."
]
corpus_langs = ["en", "en", "de", "de"]

# --- 1. Initialize the Tokenizer ---
tokenizer = ScalableTokenizer(
    max_token_len=14,
    min_freq=1,
    top_k_add=5, # Let's add more tokens per iteration
    vocab_budget=30
)

# --- 2. ‚≠ê PRIME THE LINGUISTIC MODELS (The Missing Step) ---
print("Priming the tokenizer with linguistic hints...")

# Create some simple "hints" to guide the tokenizer
lexicon = {"tokenizer": 5.0, "tokenizers": 5.0, "Tokenisierung": 10.0}
token_bigrams = {
    ("<BOS>", "InitCap"): -0.3, # Reward sentences starting with a capital word
    ("lower", "EOS"): -0.2,     # Reward sentences ending in a lowercase word (before punct)
}

# Now, apply these models and tune their weights
# This is the crucial step that connects the brain to the engine.
tokenizer.set_feature_models(
    lexicon=lexicon,
    token_bigram=token_bigrams,
    mu_morph=0.25,      # Weight for morphology scores
)

# --- 3. Train the Tokenizer ---
# Now that it has linguistic guidance, it will learn much better tokens.
# A real scenario would use max_iterations=300 or more.
print("\nStarting the training process...")
tokenizer.train(corpus_texts, corpus_langs, max_iterations=30)

# --- 4. Inspect the Results ---
print("\n--- Learned Vocabulary ---")
# Filter for multi-character words to see the good stuff
meaningful_tokens = [tok for tok in tokenizer.vocab if len(tok) > 1]
print(sorted(meaningful_tokens))
for corpus_text, corpus_lang in zip(corpus_texts, corpus_langs):
    tokens = tokenizer.tokenize(corpus_text, corpus_lang)
    print(tokens)

Priming the tokenizer with linguistic hints...

Starting the training process...
Step 1: Performing initial corpus analysis...
Analysis complete in 0.02s. Found 1029 potential tokens; seed vocab chars = 25.

Step 2: Starting training with batch pricing...
Iter 01: Added 5 tokens: 'tokenizer', ' tokenizer', ' tokenize', ' tokeniz', 'tokenize' (best summed RC=-74.9256)
Iter 02: Added 5 tokens: 'Tokenisierung', 'deutscher Satz', 'eutscher Satz.', ' deutscher Sat', 'n deutscher Sa' (best summed RC=-54.8142)
Iter 03: Added 5 tokens: 'in deutscher S', 'okeni', ' tokeni', ' wird gelernt.', 'st ein deutsch' (best summed RC=-48.5437)
Iter 04: Added 5 tokens: 'ist ein deutsc', ' ist ein deuts', ' wird gelernt', 's are useful.', 'st ein deutsc' (best summed RC=-46.3124)
Iter 05: Added 5 tokens: ' is learning.', ' ist ein deut', ' wird gelern', 's are useful', 'is learning.' (best summed RC=-43.0362)
Iter 06: Added 5 tokens: ' token', ' is learning', 'Das ist ein ', ' ist ein deu', ' wird geler' (