# Table of Contents


1. **Core Components & Heuristics**
   - Regexes & Constants
   - Data Cleaning & Filtering
   - Unicode & Script Handling

2. **Class Deep Dive**
   - `MorphologyEncoder`: The Linguistic Model
   - `LinguisticModels`: The Feature Engineering Hub
   - `ParagraphInfo`: The Efficiency Cache
   - `ScalableTokenizer`: The Main Orchestrator

3. **How to Tune and Use**
   - The Training Workflow
   - Key Hyperparameter Tuning Guide

---
## 1. Core Components & Heuristics
---
These functions and constants handle data preprocessing, cleaning, and unicode normalization.

### Regexes & Constants (`URL_RE`, `EMAIL_RE`, `NUM_RE`)
- **What they do:** These regexes identify common patterns that should *never* be split. They define the "protected spans" of text.
- **How they work:** The `find_protected_spans` function uses these to find all occurrences and merges any overlapping matches. During DP decoding, these spans are treated as unbreakable, "atomic" tokens.
- **How to tune:** You can add new regexes to this section to protect other domain-specific patterns (e.g., phone numbers, chemical formulas).

### Data Cleaning & Filtering
- **`looks_like_redirect`:** Filters out boilerplate Wikipedia redirect pages. Crucial for cleaning web-scraped data. It handles multiple languages and even spaced-out text like "R E D I R E C T".
- **`WIKI_NOISE_RE`, `QUOTE_RUN_EDGE_RE`:** Filter out common markdown and wiki syntax noise (e.g., `''`, `==`, `**`).
- **`clean_junk_runs`:** A post-processing step that collapses repetitive junk tokens found during decoding (e.g., `['.', '.']` -> `['.']`).
- **`merge_cjk_runs`:** A post-processing step that merges adjacent single or bi-character CJK (Chinese/Japanese/Korean) tokens into more meaningful words. This is vital for logographic languages.

### Unicode & Script Handling
- **`is_mark`:** Checks if a character is a non-spacing mark (e.g., an accent).
- **`default_allowed_boundaries`:** Prevents splits within complex graphemes (like emojis with skin tones) by disallowing splits next to characters like Zero-Width Joiners (ZWJ).
- **`script_guess`:** Provides a quick guess of the primary script/language of a token.

---
## 2. Class Deep Dive
---

### `MorphologyEncoder`
- **Purpose:** This class scores how "well-formed" a token is for a given language. It helps the tokenizer learn meaningful subwords (like "running", "establishment") instead of nonsensical fragments.

- **How it works:**
  1. **Featurization (`_featurize`):** For each token, it extracts character n-grams (e.g., "ing", "run", "nn") and affix features (e.g., starts with "re-", ends with "-tion").
  2. **Embedding (`fit`):** It builds a large matrix of tokens vs. features, computes the Positive Pointwise Mutual Information (PPMI), and then uses SVD (a dimensionality reduction technique) to create a dense vector embedding for every potential token. It also computes a "prototype" vector for each language.
  3. **Scoring (`score`):** The final score for a (token, language) pair is the cosine similarity between the token's vector and the language's prototype vector. A high score means the token's features are highly characteristic of that language.

- **How to tune:**
  - **`AFFIXES` / `CROSS_EQUIV`:** You can add prefixes, suffixes, and cross-lingual morphological equivalents for new languages to improve performance.
  - **`k`:** The dimensionality of the token embeddings. Higher `k` can capture more nuance but is computationally more expensive.

### `LinguisticModels`
- **Purpose:** This class bundles all the non-statistical, feature-based costs that are applied during DP decoding. It's where you inject domain knowledge.

- **How it works:** The `additive_cost` method calculates a cost based on a variety of features. This cost is *added* to the base statistical cost of a token.

- **How to tune:**
  - **`lexicon`, `mwe`, `ne_gaz`:** Provide your own dictionaries. Add domain-specific terms, multi-word expressions ("New York"), or named entities to encourage the tokenizer to keep them whole.
  - **`token_bigram`:** This models the cost of transitioning between *classes* of tokens (e.g., from an "InitCap" word to a "lower" case word). You can add rules here to encourage or penalize certain grammatical patterns.
  - **`gamma_boundary` (float):** **Key parameter.** The penalty for changing token classes. A higher `gamma` encourages longer runs of the same token type (e.g., `['San', 'Jose']` instead of `['San', 'J', 'ose']`).
  - **`mu_morph` (float):** **Key parameter.** The weight of the `MorphologyEncoder` score. A higher `mu` makes the tokenizer prioritize morphologically sound tokens.
  - **`rho_group` (float):** **Key parameter.** A bias that makes tokens with common affixes slightly "cheaper," encouraging the model to learn them.

### `ScalableTokenizer`
- **Purpose:** This is the main class that manages the entire training pipeline, from initial data analysis to final vocabulary pruning and tokenization.

- **Core Training Loop (`train`):**
  1. **`_initialize_stats_and_vocab`:** Scans the entire corpus to find all possible substrings (up to `max_token_len`) and calculates their initial statistical scores (`_nll`, `_pmi_pen`).
  2. **Iterative Token Addition:** The loop begins. In each iteration:
     a. **`_dp_decode`:** The entire corpus is segmented using the current vocabulary.
     b. **`_find_best_new_tokens_batch`:** It calculates the "reduced cost" for all *potential* tokens not yet in the vocabulary. A negative reduced cost means adding that token would improve the overall segmentation quality.
     c. **Add Tokens:** The `top_k_add` best tokens are added to the vocabulary.
  3. **Loop Termination:** The loop stops when no more tokens have a negative reduced cost or `max_iterations` is reached.

- **Vocabulary Budgeting (`_enforce_vocab_budget_bisection`):**
  - **Purpose:** This is the key step to precisely hit the `vocab_budget`.
  - **How it works:** It performs a bisection search to find the optimal Lagrangian multiplier (`_lambda_global`). This `λ` is a global cost added to every multi-character token. A higher `λ` makes long tokens more "expensive," forcing the DP decoder to use fewer unique types, thus reducing the vocabulary size.

- **Tokenization (`tokenize`):**
  - **Purpose:** The final inference method to tokenize new text.
  - **How it works:** It runs the same `_dp_decode` algorithm on the input text using the final, trained vocabulary and cost models.

---
## 3. How to Tune and Use
---

### The Training Workflow
1.  **Prepare Data:** Provide a list of strings (`paragraphs_texts`) and corresponding language codes (`paragraphs_langs`).
2.  **Set Linguistic Features (Optional):** Instantiate the tokenizer and call `set_feature_models()` with your custom lexicons, bigram costs, and weights (`mu_morph`, `gamma_boundary`, etc.).
3.  **Train:** Call the `tokenizer.train(...)` method.
4.  **Tokenize:** Use the trained `tokenizer.tokenize(text)` method.

### Key Hyperparameter Tuning Guide
- **To control the statistical base cost:**
  - **`alpha`:** Weight for Negative Log-Likelihood (frequency). Higher `alpha` favors more frequent tokens.
  - **`beta`:** Weight for PMI-like cohesion score. Higher `beta` favors tokens that are more cohesive than their characters would suggest (e.g., "ing" is cohesive, "qjx" is not).
  - **`tau`:** Per-character length penalty. Higher `tau` favors shorter tokens.

- **To control the linguistic intelligence:**
  - **`mu_morph`:** How much to trust the morphology model. Increase this if you get lots of nonsensical subwords.
  - **`gamma_boundary`:** How much to penalize switching token types. Increase this if you want to keep runs of capitalized words or numbers together.

- **To control the vocabulary:**
  - **`min_freq`:** The frequency gate for a substring to even be considered. Higher values lead to a smaller initial candidate pool and faster training.
  - **`vocab_budget`:** The final target size for your multi-character vocabulary.
  - **`top_k_add`:** The number of new tokens added per training iteration. A smaller value leads to slower but potentially more stable convergence.

In [1]:
#!pip install -r requirements.txt

In [2]:
def main():
    lang_codes = {
        'en': 'English', 'da': 'Danish',
        #'de': 'German', 'fr': 'French',
        #'tr': 'Turkish', 'ru': 'Russian', 'ja': 'Japanese',
        #'ar': 'Arabic', 'ta': 'Tamil', 'xh': 'Xhosa',
        #'zu': 'Zulu', 'tk': 'Turkmen'
    }

    corpus_texts, corpus_langs = load_wikiann_corpus(lang_codes, per_lang=700)
    if not corpus_texts:
        return

    tokenizer = ScalableTokenizer(
        max_token_len=12, min_freq=7, top_k_add=8, vocab_budget=500
    )

    # Morphology-aware + TV knobs
    lex = {"New York": 2.0, "San Jose": 1.0, "’s": 0.5, "'s": 0.5}
    ne  = {"LOC": {"New York", "Berlin", "東京"}}
    tb  = {
        ("<BOS>", "InitCap"): -0.2,
        ("InitCap", "InitCap"): -0.3,
        ("NUM", "NUM"): -0.15,
    }
    tokenizer.set_feature_models(
        lexicon=lex,
        ne_gaz=ne,
        token_bigram=tb,
        gamma_boundary=0.06,
        mu_morph=0.25,
        rho_group=0.06
    )

    tokenizer.train(corpus_texts, corpus_langs, max_iterations=300)

    print("\n--- Tokenization Examples ---")
    tests = [
        ("This is a final test of the representations.", "en"),
        ("Die endgültige Prüfung der Darstellungen.", "de"),
        ("Temsilleriň soňky synagy.", "tk"),
        ("表現の最終テストです。", "ja"),
        ("Email me at alice@example.com or visit https://example.org/docs.", "en"),
        ("The price was 12,345.67 dollars on 2024-09-04.", "en"),
        ("#REDIRECT United States", "en"),
        ("# Weiterleitung Berlin", "de"),
        ("# Yönlendirme Türkiye", "tr"),
    ]
    for sentence, lang in tests:
        tokens = tokenizer.tokenize(sentence, lang=lang)
        print(f"   '{sentence}'\n   -> {tokens}\n")

if __name__ == "__main__":
    main()

Loading corpus from Hugging Face datasets hub...
-> Loading 'English' (en)...
-> Loading 'German' (de)...
-> Loading 'French' (fr)...
-> Loading 'Turkish' (tr)...
-> Loading 'Russian' (ru)...
-> Loading 'Japanese' (ja)...
-> Loading 'Arabic' (ar)...
-> Loading 'Tamil' (ta)...
-> Loading 'Xhosa' (xh)...
Could not load data for xh: BuilderConfig 'xh' not found. Available: ['ace', 'af', 'als', 'am', 'an', 'ang', 'ar', 'arc', 'arz', 'as', 'ast', 'ay', 'az', 'ba', 'bar', 'bat-smg', 'be', 'be-x-old', 'bg', 'bh', 'bn', 'bo', 'br', 'bs', 'ca', 'cbk-zam', 'cdo', 'ce', 'ceb', 'ckb', 'co', 'crh', 'cs', 'csb', 'cv', 'cy', 'da', 'de', 'diq', 'dv', 'el', 'eml', 'en', 'eo', 'es', 'et', 'eu', 'ext', 'fa', 'fi', 'fiu-vro', 'fo', 'fr', 'frr', 'fur', 'fy', 'ga', 'gan', 'gd', 'gl', 'gn', 'gu', 'hak', 'he', 'hi', 'hr', 'hsb', 'hu', 'hy', 'ia', 'id', 'ig', 'ilo', 'io', 'is', 'it', 'ja', 'jbo', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ksh', 'ku', 'ky', 'la', 'lb', 'li', 'lij', 'lmo', 'ln', 'lt', 'lv', 'map-bms',

KeyboardInterrupt: 