This repository provides a sample toolkit for evaluation of Japanese automatic speech recognition (ASR). The focus of this evaluation kit is orthographic variants (表記ゆれ). The performance of ASR models may be measured more accurately by using this toolkit.
These examples and scripts can be used for academic research and education. Please use carefully because some bugs and errors may be remained.
- Support text normalization using solid dictionary
UniDic(withFugashiinterface ofMeCab) for orthographic variants (表記ゆれ) - Provide sample codes for CER evaluation of several datasets mainly available on Web
OS: Ubuntu 22.04 or 24.04
Python: 3.10 or later
Python libraries
- pyyaml
- fugashi[unidic]
Other toolkit or software: we need to install them by sudo apt install command.
- sctk (SCTK)
- ffmpeg (for reading audio file in ASR processing)
Required python libraries are different for each ASR model
- espnet
- torchaudio
- torchcodec
- transformers
- soxr
- espnet_model_zoo
- numpy
- sentencepiece
- nue-asr
- Cython
- openai-whisper
- funasr
The reference text and ASR result text are usually suffered from orthographic variants (表記ゆれ, different spelling), which may sometimes prevent us from evaluating ASR performance correctly.
- Japanese writing system: Hiragana, Katakana, Kanji, alphabet, symbol, numbers
- Reference (corpus/test set transcriptions)
- strict rules and checks are required when human transcribe speech data set
- such rules are also different among corpora
- Hypothesis (ASR result)
- recent end-to-end models are trained by various corpora/text data set, resulting in inconsistent representations of words
- word representations sometimes change according to the context
For example, CER of the following recognition results will be worse while the meaning of the sentence is almost the same.
Ref: 足立さん身長百八十五センチメートルなんだ物凄くおっきいね
Hyp: 安達さん身長185cmなんだものすごく大きいね
Even our human cannot determine '足立' or '安達' from audio signal.
Actual CER of the example above will be like
Scores: (#C #S #D #I) 11 10 7 2
REF: 足 立 さ ん 身 長 百 八 十 五 セ ン チ メ ー ト ル な ん だ ** ** 物 凄 く お っ き い ね
HYP: 安 達 さ ん 身 長 ** ** ** ** ** ** 1 8 5 C M な ん だ も の す ご く ** 大 き い ね
Eval: S S D D D D D D S S S S S I I S S D S
Other examples of 表記ゆれ are as follows:
- personal pronoun: ワタシ - わたし - 私
- name: 渡辺 - 渡部 (ワタナベ)
- number: 一万 - 1万 - 10000
- symbol, unit: % - パーセント, g - グラム
- proper noun, English word: Twitter - ツイッター, Tower - タワー
- okurigana: 行なった - 行った (おこなった)
- adverb: ようやく - 漸く
- onomatopoeia: きらきら - キラキラ
Orthographic variants of major words have been well maintained, and they are summarized as dictionary database. By analyzing Japanese sentence with such dictionary, we can resolve the orthographic variants between reference and hypothesis to some extent (exceptions always exist).
The following sentences are [lemmna and surface] of words after applying morphological analyzer (fugashi with UniDic).
Ref: [["アダチ", "足立"], ["さん", "さん"], ["身長", "身長"], ["百八十五", "185"], ["センチメートル", "センチメートル"], ["だ", "な"], ["の", "ん"], ["だ", "だ"], ["物凄い", "物凄く"], ["大きい", "おっきい"], ["ね", "ね"]]
Hyp: [["アダチ", "安達"], ["さん", "さん"], ["身長", "身長"], ["185", "185"], ["センチメートル", "cm"], ["だ", "な"], ["の", "ん"], ["だ", "だ"], ["物凄い", "ものすごく"], ["大きい", "大きい"], ["ね", "ね"]]
We can see that lemma entry can be hint for resolving orthographic variants. If we adjust the surface of hypothesis words to that of reference words, the character distance between ref. and hyp. will be reduced.
Ref: 足立さん身長185センチメートルなんだ物凄くおっきいね
Hyp: 足立さん身長185センチメートルなんだ物凄くおっきいね
Here, numbers are also normalized by other processing.
Actual CER of the example above will be like
Scores: (#C #S #D #I) 27 0 0 0
REF: 足 立 さ ん 身 長 1 8 5 セ ン チ メ ー ト ル な ん だ 物 凄 く お っ き い ね
HYP: 足 立 さ ん 身 長 1 8 5 セ ン チ メ ー ト ル な ん だ 物 凄 く お っ き い ね
Eval:
Of course, the normalization policy above (e.g., handling of name) is not perfect and should be improved. However, ASR performance and its comparison in CER or WER will be more reliable than before by introducing this processing.
Please also see other tag information in UniDic manual.
The actual procedure is listed below.
- Character normalization:
unicode.normalizewithNFKCoption - String replacement using Rules for exceptions (e.g., OOV of
UniDic) - Apply morphological analyzer (
Fugashi:MecabwithUnidic)- Numbers are also normalized to the mixed representation of number-Kanji, e.g., 1万, with best effort
- Align hypothesis words with reference words using
lemmaentry in POS tag by DP - Replace the matched words into the
surfacerepresentations of words in the reference text - Reconstruct text of hypothesis and reference
- Run evaluation tools, e.g., for CER
sctk sclite(SCTK) is used for CER calculation in our toolkit
The Rules are constructed by manual/hands after performance evaluation (by checking alignment between reference and hypothesis text). This force adjustment to reference text may be applied to OOV (no-entry word) of the dictionary or (our)insufficient text processing implementation. For example,
- English sentences
- OOV of dict: Netflix - ネットフリックス
- Some words: 易しい - やさしい,
- Kanji numbers: 二十一二十二: 2, 11, 20, 2 or 21, 22, or 2, 10, 2, 12, etc...
- Numbers: 121314 - 12, 13, 14 or 1, 2, 1, 3, 1, 4 or 12万1314 or 12-1314
- Missing patterns caused by insufficient implementations
You can calculate CER by preparing reference and hypothesis list files. Here, we show its example using sample files.
Only sctk, fugashi and unidic are required to use pyscliteja.py.
The minimum installation is as follows:
sudo apt install sctk
python3 -m venv venv/fugashi
. venv/fugashi/bin/activate
python3 -m pip install 'fugashi[unidic]' pyyaml
python3 -m unidic download
You can use setup_for_tool.sh for automatic installation.
sh setup_for_tool.sh
Sample lists are in the sample directory.
asr-ja_evalkit$ ls sample
egs_hyplist.txt egs_reflist.txt numlist.txt
There are two key-value list file (delimiter is TAB) corresponding to reference and hypothesis.
asr-ja_evalkit$ cat sample/egs_reflist.txt
spkr01-uttr01 足立さん身長百八十五センチメートルなんだ物凄くおっきいね
asr-ja_evalkit$ cat sample/egs_hyplist.txt
spkr01-uttr01 安達さん身長185cmなんだものすごく大きいね
. venv/fugashi/bin/activate
We can calculate CER from the two list files. ref_trn.txt and hyp_trn.txt are saved, and they are used as input files for sctk sclite commands. --scorefile option specifies the CER score filename as output.
python3 pyscliteja.py sample/egs_reflist.txt sample/egs_hyplist.txt sample/ref_trn.txt sample/hyp_trn.txt --scorefile sample/score_with_normalize.txt
[LOG]: Namespace(ref_list='sample/egs_reflist.txt', hyp_list='sample/egs_hyplist.txt', ref_trn='sample/ref_trn.txt', hyp_trn='sample/hyp_trn.txt', scorefile='sample/score_with_normalize.txt', charnorm='charnorm-v1', pre_rulefile=None, tagger='fugashi-v1', disable_adjust=False, trnfmt='char', post_rulefile=None, ref_pre=None, hyp_pre=None, ref_adj=None, hyp_adj=None, ref_lemma=None, hyp_lemma=None)
[LOG]: load "CharNormalizerV1()" in PreProcessor
[LOG]: load "FugashiLemmaTaggerV1()"
[LOG]: load "CharFormatter()" in PostProcessor
We can confirm aliments of each sentence pair and total CER.
tail -n 6 sample/score_with_normalize.txt | head -n 4
Scores: (#C #S #D #I) 27 0 0 0
REF: 足 立 さ ん 身 長 1 8 5 セ ン チ メ ー ト ル な ん だ 物 凄 く お っ き い ね
HYP: 足 立 さ ん 身 長 1 8 5 セ ン チ メ ー ト ル な ん だ 物 凄 く お っ き い ね
Eval:
grep -e "Sum/" -e "Corr" sample/score_with_normalize.txt | head -n 2
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg| 1 27 |100.0 0.0 0.0 0.0 0.0 0.0 |
We can disable the normalization by using option --disable_adjust.
python3 pyscliteja.py sample/egs_reflist.txt sample/egs_hyplist.txt sample/ref_trn.txt sample/hyp_trn.txt --scorefile sample/
score_with_normalize.txt --disable_adjust
[LOG]: Namespace(ref_list='sample/egs_reflist.txt', hyp_list='sample/egs_hyplist.txt', ref_trn='sample/ref_trn.txt', hyp_trn='sample/hyp_trn.txt', scorefile='sample/score_with_normalize.txt', charnorm='charnorm-v1', pre_rulefile=None, tagger='fugashi-v1', disable_adjust=True, trnfmt='char', post_rulefile=None, ref_pre=None, hyp_pre=None, ref_adj=None, hyp_adj=None, ref_lemma=None, hyp_lemma=None)
[LOG]: load "CharNormalizerV1()" in PreProcessor
[LOG]: load "CharFormatter()" in PostProcessor
We can also confirm aliments of each sentence pair and total CER.
tail -n 6 sample/score_without_normalize.txt | head -n 4
Scores: (#C #S #D #I) 11 10 7 2
REF: 足 立 さ ん 身 長 百 八 十 五 セ ン チ メ ー ト ル な ん だ *** *** 物 凄 く お っ き い ね
HYP: 安 達 さ ん 身 長 *** *** *** *** *** *** 1 8 5 C M な ん だ も の す ご く *** 大 き い ね
Eval: S S D D D D D D S S S S S I I S S D S
grep -e "Sum/" -e "Corr" sample/score_without_normalize.txt | head -n 2
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg| 1 28 | 39.3 35.7 25.0 7.1 67.9 100.0 |
Run setup.sh in the top directory "asr-ja_evalkit" to setup all models automatically.
sh setup.sh
If you want to select models for installation, please modify the options in the shell script. The following is a minimum setting.
###
stage=0
###
espnet=false
whisper=false
reazon=false
nue=false
funasr=false
fugashi=true
Move into the corpus directory.
asr-ja_evalkit$ cd spreds-d1
Run the corpus setup.sh to download corpus and to generate lists of transcription and audio filepath.
asr-ja_evalkit/spreds-d1$ sh setup.sh
Run run_all.sh to start recognition by each ASR model and CER computation.
asr-ja_evalkit/spreds-d1$ sh run_all.sh
Run gentbl.sh at the top directory.
asr-ja_evalkit$ sh gentbl.sh
Summary of CER scores will be saved under the result directory.
We need to install sctk for Ubuntu environment.
sudo apt install sctk
python3 -m venv venv
. venv/bin/activate
python3 -m pip install git+https://github.com/ouktlab/asr-ja_evalkit.git
We also need to install UniDic for fugashi.
python3 -m unidic download
If we do not download the unidic in advance, fugashi will output the following error message.
RuntimeError:
Failed initializing MeCab.
python3
>>> import pysctkja
>>> print(pysctkja.__version__)
0.1
>>>
We can run directly pyscliteja as command instead of pyscliteja.py after pip istall.
pyscliteja
usage: pyscliteja [-h] [--scorefile SCOREFILE] [--charnorm CHARNORM]
[--pre_rulefile PRE_RULEFILE] [--tagger TAGGER]
[--disable_adjust] [--trnfmt TRNFMT]
[--post_rulefile POST_RULEFILE] [--ref_pre REF_PRE]
[--hyp_pre HYP_PRE] [--ref_adj REF_ADJ] [--hyp_adj HYP_ADJ]
[--ref_lemma REF_LEMMA] [--hyp_lemma HYP_LEMMA]
ref_list hyp_list ref_trn hyp_trn
pyscliteja: error: the following arguments are required: ref_list, hyp_list, ref_trn, hyp_trn
For example, after preparing the "sample" directory, we can calculate CER by the following command.
pyscliteja sample/egs_reflist.txt sample/egs_hyplist.txt sample/ref_trn.txt sample/hyp_trn.txt --scorefile sample/score_with_normalize.txt
Our sample scripts output three types of CERs:
- CER of raw text
- CER of normalized text using fugashi (objective processing)
- CER of normalized text using rules + fugashi (rules may be subjective)
Data set used for evaluation examples are:
- SPREDS-D1 (NICT ASTREC. License - CC BY 4.0)
- many fillers are included
- segmented data are used
- SPREDS-D2 (NICT ASTREC. License - CC BY 4.0)
- many fillers are included
- segmented data are used
- long audio files (over 30 sec.) are separated in advance
- SPREDS-P1 (NICT ASTREC. License - CC BY 4.0)
- segmented data are used
- SPREDS-U1 (NICT ASTREC. License - CC BY 4.0)
- JSUT (S. Takamichi. License of tags - CC-BY-SA 4.0, audio data is only for research by academic institutions, non-commercial research, and personal use) (modified)
- CSJ (please purchase this corpus)
- many fillers are included
- assume eval1, eval2 and eval3 sets built by ESPnet CSJ recipe
Some data sets are automatically downloaded by shell scripts.
Sample ASR models are:
- ESPnet models (character-based ASR) in our lab.
- ESPnet(CSJ core) -- (core set)
- ESPnet(CSJ full) -- (full except D*)
- ESPnet(Corpus10)
- JSUT is semi-closed set for Corpus10 model (jvs corpus used for training)
- ESPnet models (character-based ASR) for streaming in our lab.
- ESPnet-st(Corpus10) (0.25 sec. segment in inference)
- JSUT is semi-closed set for Corpus10 model (jvs corpus used for training)
- ESPnet-st(Corpus10) (0.25 sec. segment in inference)
- ESPnet models (character-based ASR)
- ESPnet(Laborotv)
- Syllable-ASR & syllable-to-character translation in our lab.
- SASR+SCT(Corpus10) with 1-best search = cascaded process
- JSUT is semi-closed set for Corpus10 model (jvs corpus used for training)
- SASR+SCT(Corpus10) with 1-best search = cascaded process
- Reazon speech
- FunASR (SenseVoiceSmall)
- Whisper (large-v3)
- Nue
Rules were added by checking CER results of ASR models. The order of the check is Whisper, Nue, Reazon, ESPnet, and SASR-SCT.
CERs difference between raw text and with fugashi sometimes become 3~5 points.
Because dialogue corpora includes many fillers, their deletion errors affect CERs of some ASR models.
Note that these results will slightly change after updates of this toolkit.
Because the reference text is also modified (such as numbers, words in rules), the total number of characters also change.
asr-ja_evalkit$ cat result/summary_score_charnorm-v1_rawtext.txt
| CER (%) | csj | jsut | spreds-d1 | spreds-d2 | spreds-p1 | spreds-u1 |
|---|---|---|---|---|---|---|
| 01:ESPnet(CSJ core) | 6.32 | 21.00 | 17.11 | 28.94 | 18.34 | 19.28 |
| 02:ESPnet(CSJ full) | 4.05 | 12.29 | 14.50 | 21.95 | 15.05 | 9.21 |
| 03:ESPnet(Corpus10) | 3.76 | 6.78 | 10.56 | 17.01 | 5.24 | 5.63 |
| 04:SASR+SCT(Corpus10) | 4.02 | 6.18 | 10.33 | 17.02 | 5.24 | 4.80 |
| 11:ESPnet-st(Corpus10) | 4.37 | 8.84 | 13.27 | 19.55 | 7.06 | 7.11 |
| 20:ESPnet(Laborotv) | 19.69 | 11.06 | 20.98 | 28.95 | 12.84 | 7.88 |
| 21:Whisper(large-v3) | 17.34 | 6.87 | 17.43 | 19.96 | 4.96 | 4.79 |
| 22:Nue | 29.05 | 8.76 | 25.50 | 31.66 | 7.95 | 5.43 |
| 23:Reazon | 18.47 | 6.85 | 15.87 | 20.07 | 3.25 | 4.77 |
| 24:FunASR | 13.71 | 7.89 | 13.58 | 20.36 | 3.52 | 5.43 |
| csj | jsut | spreds-d1 | spreds-d2 | spreds-p1 | spreds-u1 | |
|---|---|---|---|---|---|---|
| # of characters | 115744 | 205375 | 49406 | 23099 | 9747 | 26360 |
asr-ja_evalkit$ cat result/summary_score_charnorm-v1_fugashi-v1_rule-none.txt
| CER (%) | csj | jsut | spreds-d1 | spreds-d2 | spreds-p1 | spreds-u1 |
|---|---|---|---|---|---|---|
| 01:ESPnet(CSJ core) | 5.72 | 17.08 | 13.65 | 26.20 | 15.13 | 16.46 |
| 02:ESPnet(CSJ full) | 3.57 | 7.91 | 10.98 | 18.82 | 12.03 | 6.01 |
| 03:ESPnet(Corpus10) | 3.25 | 3.53 | 6.88 | 13.97 | 2.76 | 3.07 |
| 04:SASR+SCT(Corpus10) | 3.44 | 2.24 | 7.08 | 14.12 | 2.48 | 1.69 |
| 11:ESPnet-st(Corpus10) | 3.71 | 4.82 | 9.86 | 16.45 | 4.05 | 4.11 |
| 20:ESPnet(Laborotv) | 16.75 | 6.00 | 18.76 | 27.04 | 10.20 | 5.75 |
| 21:Whisper(large-v3) | 13.31 | 3.63 | 15.99 | 18.17 | 3.49 | 2.00 |
| 22:Nue | 25.50 | 4.43 | 24.03 | 29.72 | 6.17 | 2.38 |
| 23:Reazon | 14.21 | 3.09 | 13.99 | 17.38 | 1.30 | 1.52 |
| 24:FunASR | 10.95 | 5.07 | 11.93 | 18.65 | 2.20 | 2.86 |
| csj | jsut | spreds-d1 | spreds-d2 | spreds-p1 | spreds-u1 | |
|---|---|---|---|---|---|---|
| # of characters | 115672 | 205449 | 49408 | 23103 | 9747 | 26382 |
asr-ja_evalkit$ cat result/summary_score_charnorm-v1_fugashi-v1_rule-lax.txt
| CER (%) | csj | jsut | spreds-d1 | spreds-d2 | spreds-p1 | spreds-u1 |
|---|---|---|---|---|---|---|
| 01:ESPnet(CSJ core) | 5.64 | 17.04 | 12.16 | 26.03 | 15.16 | 16.22 |
| 02:ESPnet(CSJ full) | 3.51 | 7.86 | 9.39 | 18.62 | 12.07 | 5.68 |
| 03:ESPnet(Corpus10) | 3.20 | 3.47 | 5.31 | 13.75 | 2.70 | 2.63 |
| 04:SASR+SCT(Corpus10) | 3.36 | 2.17 | 5.39 | 13.90 | 2.51 | 1.32 |
| 11:ESPnet-st(Corpus10) | 3.68 | 4.74 | 8.43 | 16.29 | 4.02 | 3.70 |
| 20:ESPnet(Laborotv) | 16.43 | 5.92 | 18.34 | 26.96 | 9.98 | 5.33 |
| 21:Whisper(large-v3) | 12.75 | 3.52 | 14.74 | 18.02 | 3.18 | 1.43 |
| 22:Nue | 25.10 | 4.34 | 23.66 | 29.55 | 5.94 | 1.84 |
| 23:Reazon | 13.71 | 3.00 | 13.54 | 17.25 | 1.04 | 1.06 |
| 24:FunASR | 10.30 | 4.99 | 11.49 | 18.56 | 1.96 | 2.48 |
| csj | jsut | spreds-d1 | spreds-d2 | spreds-p1 | spreds-u1 | |
|---|---|---|---|---|---|---|
| # of characters | 115676 | 205483 | 49377 | 23119 | 9747 | 26378 |