<a href="https://colab.research.google.com/github/k2-fsa/colab/blob/master/sherpa-onnx/itn_zh_number.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

本 colab notebook 演示如何生成 rule fst, 把中文数字转成阿拉伯数字。
你可以用生成的 fst，结合 [kaldifst](https://github.com/k2-fsa/kaldifst) 进行部署。 提供 C++ 和 Python 等 API.

如何在语音识别中使用生成的 fst, 请参考 [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)



# Install pynini

In [1]:
%%shell

pip install --only-binary :all: pynini

Collecting pynini
  Downloading pynini-2.1.6-cp310-cp310-manylinux_2_28_x86_64.whl (154.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.5/154.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynini
Successfully installed pynini-2.1.6




# Generate rule fst

In [2]:
import pynini
from pynini.lib import utf8
from pynini.lib.pynutil import add_weight, delete, insert


sigma = utf8.VALID_UTF8_CHAR.star


zero_map = [
    ("零", "0")
]
one_to_nine_map = [
    ("一", "1"),
    ("二", "2"),
    ("三", "3"),
    ("四", "4"),
    ("五", "5"),
    ("六", "6"),
    ("七", "7"),
    ("八", "8"),
    ("九", "9"),
]

digit_map = zero_map + one_to_nine_map

zero = pynini.string_map(zero_map)
one_to_nine = pynini.string_map(one_to_nine_map).optimize()

# 个
digit = pynini.string_map(digit_map).optimize()

# 十
ten1 = pynini.cross('十', '1') + (digit | insert('0'))
ten2 = digit + delete('十') + (digit | insert('0'))
ten = ten1 | ten2

# 百
# 一百
# 一百零一, add a positive weight -> low priority
hundred = one_to_nine + delete('百') + (add_weight(insert('0')**2, 1) | (zero + one_to_nine) | ten)

# 千
thousand = one_to_nine + delete('千') +  \
         (add_weight(insert('0')**3, 10) | \
          (insert('0') + zero + one_to_nine) | \
          (zero + ten) |
          hundred
          )

# 万
wan = (one_to_nine | ten | hundred | thousand) + delete('万') + \
       (add_weight(insert('0')**4, 10) |
        (insert('0')**2 + zero + one_to_nine) |
        (insert('0') + zero + ten) |
        add_weight(zero + hundred, -2) |
        thousand
        )



number = add_weight(digit, 100) | add_weight(ten, 90) | add_weight(hundred, 80)  | add_weight(thousand, -1) | add_weight(wan, -100)
number = number.optimize()

rule = pynini.cdrewrite(number, "", "", sigma)

for d in ['零', '一', '十五', '十八岁', '十', '十九', '一十九', '九十九', '八十八', '六十', '一十',
          '九十', '一百', '一百零一', '一百一十', '一百一十一', '九百九十九', '九百', '九百零九',
          '一千', '一千零一', '一千零一十一', '九千零九', '九千零九十九',
          '九千九百', '九千九百零九', '九千九百九十九',
          '一万', '一万零一', '一万零九十九', '一万零九百九十九', '一万九千九百九十九',
          '十一万', '十万', '十万零一', '十万零十一', '一十万零十一', '一十万零一百', '十万一千',
          '一万一千', '九十九万一千', '九十九万一千零一', '九十九万一千零一十一',
          '一百万', '一百万零一', '一百万零一十', '一百万零一十一', '一百万零一百', '一百万零一百零一',
          '一百万零一百一十一','一百万一千', '一百万一千零一', '一百万一千零十一', '一百万一千一百',
          '一千万', '一千万零一', '一千万零一十一'
          ]:
  r = pynini.compose(d, rule)
  s = pynini.shortestpath(r, nshortest=1).paths()
  print(d, list(s.ostrings())[:3])

rule.write('itn_zh_number.fst')

! ls -lh itn_zh_number.fst

零 ['0']
一 ['1']
十五 ['15']
十八岁 ['18岁']
十 ['10']
十九 ['19']
一十九 ['19']
九十九 ['99']
八十八 ['88']
六十 ['60']
一十 ['10']
九十 ['90']
一百 ['100']
一百零一 ['101']
一百一十 ['110']
一百一十一 ['111']
九百九十九 ['999']
九百 ['900']
九百零九 ['909']
一千 ['1000']
一千零一 ['1001']
一千零一十一 ['1011']
九千零九 ['9009']
九千零九十九 ['9099']
九千九百 ['9900']
九千九百零九 ['9909']
九千九百九十九 ['9999']
一万 ['10000']
一万零一 ['10001']
一万零九十九 ['10099']
一万零九百九十九 ['10999']
一万九千九百九十九 ['19999']
十一万 ['110000']
十万 ['100000']
十万零一 ['100001']
十万零十一 ['100011']
一十万零十一 ['100011']
一十万零一百 ['100100']
十万一千 ['101000']
一万一千 ['11000']
九十九万一千 ['991000']
九十九万一千零一 ['991001']
九十九万一千零一十一 ['991011']
一百万 ['1000000']
一百万零一 ['1000001']
一百万零一十 ['1000010']
一百万零一十一 ['1000011']
一百万零一百 ['1000100']
一百万零一百零一 ['1000101']
一百万零一百一十一 ['1000111']
一百万一千 ['1001000']
一百万一千零一 ['1001001']
一百万一千零十一 ['1001011']
一百万一千一百 ['1001100']
一千万 ['10000000']
一千万零一 ['10000001']
一千万零一十一 ['10000011']
-rw-r--r-- 1 root root 26K Jun 17 03:46 itn_zh_number.fst


In [3]:

from google.colab import files
files.download('itn_zh_number.fst')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Usage with kaldifst

In [4]:
%%shell

pip install kaldifst

Collecting kaldifst
  Downloading kaldifst-1.7.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kaldifst
Successfully installed kaldifst-1.7.11




In [1]:
import kaldifst

InverseTextNormalizer = kaldifst.TextNormalizer

rule = "./itn_zh_number.fst"
normalizer = InverseTextNormalizer(rule)
text = "一百二十三是多少"
out = normalizer(text)
print(out)

123是多少
