## Cantonese TN (tn.cantonese) — Quick examples

- Ensure your venv is active and `opencc-python-reimplemented` is installed (used to convert Simplified→Traditional where needed).
- Run the Python cells below to instantiate the Normalizer and try sample phrases.

In [2]:
# (Optional) install OpenCC if not already available in the notebook environment
# !pip install --user opencc-python-reimplemented

from tn.cantonese.normalizer import Normalizer

# Create a Normalizer instance
# - overwrite_cache=True will rebuild FSTs (useful the first time or after changes)
# - simple_to_traditional=True enables automatic Simplified->Traditional conversion (OpenCC s2hk)
# - tag_oov=True will tag OOVs which helps debug unknown characters
normalizer = Normalizer(overwrite_cache=True, simple_to_traditional=True, tag_oov=True)

# Quick sanity check
print(normalizer.normalize("尾号1702"))          # -> 尾號幺七零二
print(normalizer.normalize("这儿有只鸟儿"))    # -> 這有隻鳥 (after S->T)
print(normalizer.normalize("苹果宣布发布新ＩＰＨＯＮＥ"))  # -> 蘋果宣佈發佈新IPHONE
print(normalizer.normalize("价格是HKD13.5"))     # -> 價格是十三個半
print(normalizer.normalize("价格是HK$13.5"))     # -> 價格是十三個半
print(normalizer.normalize("价格是$13.5"))     # -> 價格是十三個半

2026-01-18 12:08:59,411 WETEXT INFO building fst for yue_normalizer ...
2026-01-18 12:08:59,411 WETEXT INFO building fst for yue_normalizer ...
2026-01-18 12:09:06,139 WETEXT INFO done
2026-01-18 12:09:06,139 WETEXT INFO done
2026-01-18 12:09:06,140 WETEXT INFO fst path: /home/joseph/projects/WeTextProcessing/tn/yue_tn_tagger.fst
2026-01-18 12:09:06,140 WETEXT INFO fst path: /home/joseph/projects/WeTextProcessing/tn/yue_tn_tagger.fst
2026-01-18 12:09:06,141 WETEXT INFO           /home/joseph/projects/WeTextProcessing/tn/yue_tn_verbalizer.fst
2026-01-18 12:09:06,141 WETEXT INFO           /home/joseph/projects/WeTextProcessing/tn/yue_tn_verbalizer.fst


尾號一七零二
這有隻鳥
蘋果宣佈發佈新IPHONE
價格是十三個半
價格是HK十三個半
價格是十三個半


In [None]:
# Explicit Cantonese example: assert expected Cantonese output and print confirmation
assert normalizer.normalize("价格是HKD13.5") == "價格是十三個半"
print("Cantonese example OK:", normalizer.normalize("价格是HKD13.5"))

In [2]:
# Batch normalization example
inputs = [
    "1:02:36am",
    "8:00 a.m.准时开会",
    "可以拨打12306来咨询",
    "重达25kg",
]
for s in inputs:
    print(s, "=>", normalizer.normalize(s))

1:02:36am => 上午一點零二分三十六秒
8:00 a.m.准时开会 => 上午八點準時開會
可以拨打12306来咨询 => 可以撥打一二三零六來諮詢
重达25kg => 重達二十五千克


In [3]:
# Use a rule-level Processor directly (e.g., cardinal)
from tn.cantonese.rules.cardinal import Cardinal
card = Cardinal()  # rule-level classes have simple->traditional enabled where appropriate

print(card.normalize("尾号为2349"))   # -> 尾號為二三四九  (depends on OpenCC mapping)
print(card.normalize("127.0.0.1"))    # -> 一二七點零點零點一

尾號為二三四九
一二七點零點零點一


Notes & tips:
- If you see unexpected OOV tags or variant characters, try re-running with `overwrite_cache=True` once more or add explicit phrase mappings to `tn/cantonese/data/char/simple_to_traditional.tsv` for deterministic conversions (e.g., "发布" -> "發佈").
- To run unit tests from notebook terminal: `./.venv/bin/python -m pytest -q tn/cantonese/test` ✅