Added phonetisaurus-based g2p scripts #2730

huangruizhe · 2018-09-21T15:51:06Z

Phonetisaurus-based g2p was added, and corresponding changes are made to multi_en recipe.

The correctness of language modeling script was verified by the same *.arpa file generated for multi_en recipe.
The correctness of g2p result was verified by running multi_en recipe. Our new scripts can generate the same lexicon as the old script, except few entries (5 out of 35,709) which may be due to ties of scores or floating point rounding:

$ diff s5_new/data/local/dict_nosp/lexicon.txt s5_old/data/local/dict_nosp/lexicon.txt
46990c46990
< consubstantial	k ah n s ah b s t ae n sh ah l
---
> consubstantial	k ah n s ah b s t ae n ch ah l
153708c153708
< necticut	n eh t ah k ah t
---
> necticut	n eh t ih k ah t
168558c168558
< pennyless	p eh n iy l ih s
---
> pennyless	p eh n iy l ah s
189949c189949
< rhamni	r ae m iy
---
> rhamni	r ae m ay
236164c236164
< unsubstantial	ah n s ah b s t ae n sh ah l
---
> unsubstantial	ah n s ah b s t ae n ch ah l

number of original lexicon: 217,594
number of missing words in corpus: 35,709
number of newly generated pronunciation: 35,708
(WARNING:phonetisaurus-apply:2018-09-20 22:03:04: No pronunciation for word: 'eux')

xiaohui-zhang · 2018-09-21T17:14:05Z

Can you remove my scripts multi_en/s5/local/g2p/{train, apply}_g2p.sh, and move make_kn_lm.py from steps/dict to utils/lang? @danpovey Can you add him as an authorized user?

danpovey · 2018-09-21T17:24:55Z

@xiaohui-zhang there isn't really a relevant concept of authorized user here.

xiaohui-zhang · 2018-09-21T18:09:41Z

@danpovey never mind. something wrong on my end

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

huangruizhe · 2018-09-23T05:00:48Z

We did the following fixes:

Added a column of probabilities in the lexicon generated by Phonetisaurus. Note that we have to specify --nbest and --pmass options together to get probabilities that make sense, which is actually the probability of a pronunciation in an event space of nbest pronunciations instead of all possible pronunciations. When nbest specified alone, pmass is set to 1.0, and when pmass specified alone, nbest is set to 20 implicitly.
Other minor fixes.

xiaohui-zhang · 2018-09-23T04:57:02Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+
+[ ! -z $nbest ] && [[ ! $nbest =~ ^[0-9]+$ ]] && echo "$0: nbest should be a positive integer." && exit 1
+[ ! -z $pmass ] && ! { [[ $pmass =~ ^[0-9]+\.?[0-9]*$ ]] && [ $(bc <<< "$pmass >= 0") -eq 1 -a $(bc <<< "$pmass <= 1") -eq 1 ]; } \
+  && echo "$0: pmass should be within [0, 1]." && exit 1


doesn't have to check pmass here, since phonetisaurus checks it inside.

xiaohui-zhang · 2018-09-23T04:59:53Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+[ ! -z $nbest ] && [[ ! $nbest =~ ^[0-9]+$ ]] && echo "$0: nbest should be a positive integer." && exit 1
+[ ! -z $pmass ] && ! { [[ $pmass =~ ^[0-9]+\.?[0-9]*$ ]] && [ $(bc <<< "$pmass >= 0") -eq 1 -a $(bc <<< "$pmass <= 1") -eq 1 ]; } \
+  && echo "$0: pmass should be within [0, 1]." && exit 1
+[ -z $pmass ] && [ -z $nbest ] && nbest=1


don't allow this case. if the user specified nothing, just throw an error

xiaohui-zhang · 2018-09-23T05:00:06Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+  echo "main options (for others, see top of script file)"
+  echo "  --nbest <int>    # Maximum number of hypotheses to produce. By default, nbest=1."
+  echo "  --pmass <float>  # Select the maximum number of hypotheses summing to a total mass of pmass amount, within [0, 1], for a word."
+  echo "  --nbest <int> --pmass <float>  # When specified together, we generate the intersection of these two options."


remove this line

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

xiaohui-zhang · 2018-09-23T05:05:22Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+  echo "e.g.: $0 exp/g2p/model.fst exp/g2p/oov_words.txt data/local/dict_nosp/lexicon.txt"
+  echo ""
+  echo "main options (for others, see top of script file)"
+  echo "  --nbest <int>    # Maximum number of hypotheses to produce. By default, nbest=1."


the default value is 20 I think

xiaohui-zhang · 2018-09-23T16:06:35Z

@danpovey looks fine to me now

danpovey · 2018-09-23T16:47:00Z

Great! @jtrmal, would you mind doing a quick pass of review and then merging it? Since you wrote the original g2p scripts it would probably be good for you to review this.

jtrmal · 2018-09-23T19:59:12Z

Will do tomorrow morning - it's getting late here. Y.

…

On Sun, Sep 23, 2018, 18:47 Daniel Povey ***@***.***> wrote: Great! @jtrmal <https://github.com/jtrmal>, would you mind doing a quick pass of review and then merging it? Since you wrote the original g2p scripts it would probably be good for you to review this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2730 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXwzpXBwuD0oO7VG7fyNTBwLkAkMwks5ud7sWgaJpZM4W0bcH> .

jtrmal

FOr now just high-level comments. The code looks good on overall.

jtrmal · 2018-09-24T17:04:44Z

egs/wsj/s5/utils/lang/make_kn_lm.py

+    if args.lm is None:
+        ngram_counts.print_as_arpa()
+    else:
+        with open(args.lm, 'w', encoding="utf-8") as f:


not sure about this -- I think the encoding should be latin1 as in any other encoding-agnostic script (similarly in the whole file).

Thanks. I added this because there was some warning or error jumped out.
I will make sure of this matter.

jtrmal · 2018-09-24T17:05:23Z

egs/wsj/s5/utils/lang/make_kn_lm.py

+
+    def add_raw_counts_from_file(self, filename):
+        lines_processed = 0
+        with open(filename, encoding="utf-8") as fp:


encoding=latin1?

I agree, we should use latin-1 which will make it work with things like gbk; but you have to be very careful about your use of strip() and split() in that case, because there is a latin-1 whitespace character (nbsp) which is part of the unicode encoding range. Please see other Kaldi scripts, e.g. in the rnnlm/ directory, for examples.

Many thanks, I will look into the encoding issue.

jtrmal · 2018-09-24T17:07:44Z

egs/wsj/s5/steps/dict/train_g2p_phonetisaurus.sh

+    awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i; if(!(s in a)) print $1" "s}' \
+      $silence_phones $lexicon | \
+      awk '{printf("%s\t",$1); for (i=2;i<NF;i++){printf("%s ",$i);} printf("%s\n",$NF);}' | \
+      uconv -f utf-8  -t utf-8 -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt


-f "$encoding" -t "$encoding"

I guess it's a matter of design decisions if we want to put NFC there -- IMO the user should be responsible for that. Also, not sure how that would work for any other encodings than the unicode ones.

I agree that the unicode normalization should be something that is the user's responsibility as the data preparation stage, before this script gets called.

jtrmal · 2018-09-24T17:07:53Z

egs/wsj/s5/steps/dict/train_g2p_phonetisaurus.sh

+      uconv -f utf-8  -t utf-8 -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt
+  else
+    awk '{printf("%s\t",$1); for (i=2;i<NF;i++){printf("%s ",$i);} printf("%s\n",$NF);}' $lexicon | \
+      uconv -f utf-8  -t utf-8 -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt


jtrmal · 2018-09-24T17:15:40Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+[ ! -z $nbest ] && [[ ! $nbest =~ ^[1-9][0-9]*$ ]] && echo "$0: nbest should be a positive integer." && exit 1;
+[ -z $pmass ] && [ -z $nbest ] && echo "$0: nbest or/and pmass should be specified." && exit 1;
+
+if [ -z $pmass ]; then


this part of the code would get simpler if you set nbest=20 in the beginning of the script.
Things to consider:
nbest=20 seems too much (I think Dan knows about some thesis showing that extra variants are actually harmful)
pmass=1.0 might be too big, causing too large graph during the generation of the variants, perhaps 0.95 might be good enough?

I don't claim I know the right answers, just thinking aloud.

actually there doesn't seem to be a default. I agree 20 is too much-- normally 3 would be a reasonable limit.

Actually the reason we used 20 by default is that, when the user only sets pmass, in order to get correct pron-probs, we need to specify an "nbest" value large enough, since phonetisaurus only computes pron-probs on the nbest list. When the user wants to rely on nbest, we always leave the responsibility to the user for setting the proper value, and that's why we didn't set it at the beginning. In summary, we want to allow all three ways of specifying constraints (pmass/nbest or both), and let the user to determine the proper values needed.

Thanks! I will look into phonetisaurus codes again and make sure why we chose to set these default values. Will also consider how to make the code simpler.

The code has been made simpler, but we keep the "nbest=20, pmass=1.0" stuff. The justification is as follows (an elaboration of what Xiaohui said above):

Users have three options here:

only set nbest, e.g. nbest=3
In this case, pmass needs to be implicitly set to 1.0 (instead of 0.0 by phonetisaurus's default), which would never affect our option nbest=3.

only set pmass, e.g. pmass=0.95
In this case, nbest is implicitly set to 20, because phonetisaurus computes probability over nbest list (PhonetisaurusScript.h: 166~186), thus we need to specify a large enough nbest value here.

set both nbest and pmass
In this case, the user takes the intersections of the two options and has full control of them.

What we meant by "default" was a bit misleading. They are actually some implicitly-set values, due to implementation reasons.

jtrmal · 2018-09-24T17:21:11Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+phonetisaurus-apply $options --model $model --thresh 5 --accumulate --verbose --prob --word_list $word_list 1>$out_lexicon
+
+if [ $(tr -d [:space:] < $out_lexicon | wc -c) -eq 0 ]; then
+  echo "$0: Did not generate the lexicon $out_lexicon for new words succesfully, which has no content." && exit 1;


I don't really understand this message.

Anyway, the fact that some of the words fail to obtain a pronunciation is not so uncommon, can happen when a new character is seen in the new set of words -- IMO the preferred solution would be a warning message and provide a list of words for which process failed (for example in a single file and link to that file).

But again, perhaps it's a design decision and user should filter out the problematic words before (by running this script several times).

Maybe a good compromise would be to generate the list of failed words and exit. That might be useful even during creating a new system/lexicon

that's definitely useful. @huangruizhe can you look at whether such information can be found in the log file generated, and make sure it's accessible by the user? I remember I saw warnings like that.

Thanks! I agree with the design suggested. Yes, Phonetisaurus will output warning of words for which process failed.

I wrote this line because of this situation:

When user input --pmass 3.0, which is illegal, Phonetisaurus will output an error and terminate -- and generate an empty lexicon file. However, we cannot capture Phonetisaurus's exit status, due to its imperfect implementation. As a result, the subsequent pipeline might be ignorant of the failure of lexicon generation.

I am not sure how to handle this... Perhaps we can check pmass by ourselves -- reducing any risk for Phonetisaurus.

Fixed.
I have removed this message. I have also checked that there are warning messages for failed words. Thus, now it is possible that the extended lexicon is empty, or phonetisaurus exit with error -- but users will see warnings or error messages, and they are responsible to make sure things/commands are correct by running this script several times.

jtrmal · 2018-09-24T17:24:55Z

egs/multi_en/s5/run.sh

+
+  steps/dict/apply_g2p_phonetisaurus.sh --nbest 1 exp/g2p/model.fst $g2p_tmp_dir/missing_onlywords.txt $g2p_tmp_dir/missing_lexicon.txt || exit 1;
+
+  expanded_lexicon=$dict_dir/lexicon.txt


another design decision I guess -- copy the lexicon into a separate local/ script and instead of generating a single file, a new dict directory could be generated -- I think that would make a nice and coherent interface

BTW, expanded lexicon has a specific meaning for babel scripts and we have even published a paper with that nomenclature, so maybe some other word would be more suitable to prevent confusion?

how about "extended"?

also, did you mean doing line 114-116 inside steps/dict/apply_g2p_phonetisaurus.sh ? That's indeed nice in most cases but in some cases we just want to generate prons for a word list rather than producing a valid dict dir. What do you think? @jtrmal

I'm leaning slightly towards "extended", but feel free to decide on your own.
Ad second question -- perhaps its ok as it is.

Naming: fixed.
Dict directory issue: keep it as it is.

huangruizhe · 2018-10-01T07:00:28Z

My apologies for the delay. I have fixed the above issues and this PR is now ready for another pass of review or merge. Thanks for the patient suggestions! @jtrmal @danpovey @xiaohui-zhang

jtrmal · 2018-10-01T09:31:22Z

cool, thanks. I'll check once more, but I won't get to it today, probably -- please ping me in one or two days if I won't get to it until then. y.

…

On Mon, Oct 1, 2018 at 9:00 AM huangruizhe ***@***.***> wrote: My apologies for the delay. I have fixed the issues above and this PR is now ready for another pass of review or merge. Thanks for the patient suggestions! @jtrmal <https://github.com/jtrmal> @danpovey <https://github.com/danpovey> @xiaohui-zhang <https://github.com/xiaohui-zhang> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2730 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXzcmfo9s4wbvVP4CUr2Ge5mzmY74ks5ugb2kgaJpZM4W0bcH> .

huangruizhe · 2018-10-01T11:14:05Z

cool, thanks. I'll check once more, but I won't get to it today, probably -- please ping me in one or two days if I won't get to it until then. y.
…
On Mon, Oct 1, 2018 at 9:00 AM huangruizhe @.***> wrote: My apologies for the delay. I have fixed the issues above and this PR is now ready for another pass of review or merge. Thanks for the patient suggestions! @jtrmal https://github.com/jtrmal @danpovey https://github.com/danpovey @xiaohui-zhang https://github.com/xiaohui-zhang — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2730 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisXzcmfo9s4wbvVP4CUr2Ge5mzmY74ks5ugb2kgaJpZM4W0bcH .

Sure, thanks!

xiaohui-zhang · 2018-10-03T02:20:27Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+stage=0
+nbest=          # Generate up to $nbest variants
+pmass=          # Generate so many variants to produce $pmass ammount, like 90%, of the prob mass
+# End configuration section.


@huangruizhe Can you add "thresh" as an option here? Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/apply_g2p.sh (Sorry I just realized today that I already wrote a script like the current one 2 years ago.. ) Also, please explain a bit more about the nbest and pmass options, also by referring to the above script. Thanks!

xiaohui-zhang · 2018-10-03T02:23:09Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+word_list=$2
+out_lexicon=$3
+out_lexicon_failed="${out_lexicon}.failed"
+


check whether phonetisaurus is installed here. Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/apply_g2p.sh also.

xiaohui-zhang · 2018-10-03T02:23:33Z

egs/wsj/s5/steps/dict/train_g2p_phonetisaurus.sh

+      uconv -f "$encoding" -t "$encoding" -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt
+  fi
+fi
+


check whether phonetisaurus is installed here. Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/train_g2p.sh also.

xiaohui-zhang · 2018-10-03T02:26:25Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+  1>$out_lexicon 
+
+echo "$0: Completed. Synthesized lexicon for new words is in $out_lexicon"
+


@huangruizhe Can you address Yenda's earlier comment: generating a list of failed words in a file and point it to the user in the echo message? The warning message from phonetisaurus is not consolidated into a file. So the user may miss it and want to find those words in a file. Actually I noticed your "out_lexicon_failed" is not used at all.

xiaohui-zhang · 2018-10-03T04:39:13Z

egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh

+model=$1
+word_list=$2
+out_lexicon=$3
+out_lexicon_failed="${out_lexicon}.failed"


Also, to keep the convention, probably you should ask the user to specify $outdir and write the output lexicon as $outdir/lexicon.lex as was done in the current apply_g2p.sh. And then put the list of failed words as $outdir/lexicon.failed

Fixed. Many thanks for all the above suggestions!

huangruizhe · 2018-10-03T16:48:16Z

Fixed all suggestions from Xiaohui. Ready for another pass of review or merge. @jtrmal

xiaohui-zhang · 2018-10-05T05:24:26Z

looks good. When @jtrmal says ok it could be merged. thanks

xiaohui-zhang · 2018-10-08T22:49:20Z

@danpovey can this be merged soon? I'm pulling my lexicon-learning PR which uses scripts in this PR.

@cbtpkzm

* [build] Allow configure script to handle package-based OpenBLAS (kaldi-asr#2618) * [egs] updating local/make_voxceleb1.pl so that it works with newer versions of VoxCeleb1 (kaldi-asr#2684) * [egs,scripts] Remove unused --nj option from some scripts (kaldi-asr#2679) * [egs] Fix to tedlium v3 run.sh (rnnlm rescoring) (kaldi-asr#2686) * [scripts,egs] Tamil OCR with training data from yomdle and testing data from slam (kaldi-asr#2621) note: this data may not be publicly available at the moment. we'll work on that. * [egs] mini_librispeech: allow relative pathnames in download_and_untar.sh (kaldi-asr#2689) * [egs] Updating SITW recipe to account for changes to VoxCeleb1 (kaldi-asr#2690) * [src] Fix nnet1 proj-lstm bug where gradient clipping not used; thx:@cbtpkzm (kaldi-asr#2696) * [egs] Update aishell2 recipe to allow online decoding (no pitch for ivector) (kaldi-asr#2698) * [src] Make cublas and cusparse use per-thread streams. (kaldi-asr#2692) This will reduce synchronization overhead when we actually use multiple cuda devices in one process go down drastically, since we no longer synchronize on the legacy default stream. More details here: https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html * [src] improve handling of low-rank covariance in ivector-compute-lda (kaldi-asr#2693) * [egs] Changes to IAM handwriting-recognition recipe, including BPE encoding (kaldi-asr#2658) * [scripts] Make sure pitch is not included in i-vector feats, in online decoding preparation (kaldi-asr#2699) * [src] fix help message in post-to-smat (kaldi-asr#2703) * [scripts] Fix to steps/cleanup/debug_lexicon.sh (kaldi-asr#2704) * [egs] Cosmetic and file-mode fixes in HKUST recipe (kaldi-asr#2708) * [scripts] nnet1: remove the log-print of args in 'make_nnet_proto.py', thx:mythilisharan@gmail.com (kaldi-asr#2706) * [egs] update README in AISHELL-2 (kaldi-asr#2710) * [src] Make constructor of CuDevice private (kaldi-asr#2711) * [egs] fix sorting issue in aishell v1 (kaldi-asr#2705) * [egs] Add soft links for CNN+TDNN scripts (kaldi-asr#2715) * [build] Add missing packages in extras/check_dependencies.sh (kaldi-asr#2719) * [egs] madcat arabic: clean scripts, tuning, use 6-gram LM (kaldi-asr#2718) * [egs] Update WSJ run.sh: comment out outdated things, add run_tdnn.sh. (kaldi-asr#2723) * [scripts,src] Fix potential issue in scripts; minor fixes. (kaldi-asr#2724) The use of split() in latin-1 encoding (which might be used for other ASCII-compatible encoded data like utf-8) is not right because character 160 (expressed here in decimal) is a NBSP in latin-8 encoding and is also in the range UTF-8 uses for encoding. The same goes for strip(). Thanks @ChunChiehChang for finding the issue. * [egs] add example script for RNNLM lattice rescoring for WSJ recipe (kaldi-asr#2727) * [egs] add rnnlm example on tedlium+lm1b; add rnnlm rescoring results (kaldi-asr#2248) * [scripts] Small fix to utils/data/convert_data_dir_to_whole.sh (RE backups) (kaldi-asr#2735) * [src] fix memory bug in kaldi::~LatticeFasterDecoderTpl(), (kaldi-asr#2737) - found it when running 'latgen-faster-mapped-parallel', - core-dumps from the line: decoder/lattice-faster-decoder.cc:52 -- the line is doing 'delete &(FST*)', i.e. deleting the pointer to FST, instead of deleting the FST itslef, -- bug was probably introduced by refactoring commit d0c68a6 from 2018-09-01, -- after the change the code runs fine... (the unit tests for src/decoder are missing) * [egs] Remove per-utt option from nnet3/align scripts (kaldi-asr#2717) * [egs] Small Librispeech example fix, thanks: Yasasa Tennakoon. (kaldi-asr#2738) * [egs] Aishell2 recipe: turn off jieba's new word discovery in word segmentation (kaldi-asr#2740) * [egs] Add missing file local/join_suffix.py in TEDLIUM s5_r3; thx:anand@sayint.ai (kaldi-asr#2741) * [egs,scripts] Add Tunisian Arabic (MSA) recipe; cosmetic fixes to pbs.pl (kaldi-asr#2725) * [scripts] Fix missing import in utils/langs/grammar/augment_words_txt.py (kaldi-asr#2742) * [scripts] Fix build_const_arpa_lm.sh w.r.t. where <s> appears inside words (kaldi-asr#2745) * [scripts] Slight improvements to decode_score_fusion.sh usability (kaldi-asr#2746) * [build] update configure to support cuda 10 (kaldi-asr#2747) * [scripts] Fix bug in utils/data/resample_data_dir.sh (kaldi-asr#2749) * [scripts] Fix bug in cleanup after steps/cleanup/clean_and_segment_data*.sh (kaldi-asr#2750) * [egs] several updates of the tunisian_msa recipe (kaldi-asr#2752) * [egs] Small fix to Tunisian MSA TDNN script (RE train_stage) (kaldi-asr#2757) * [src,scripts] Batched nnet3 computation (kaldi-asr#2726) This PR adds the underlying utilities for much faster nnet3 inference on GPU, and a command-line binary (and script support) for nnet3 decoding and posterior computation. TBD: a binary for x-vector computation. This PR also contains unrelated decoder speedups (skipping range checks for transition ids... this may cause segfaults when graphs are mismatched). * [build] Add python3 compatibility to install scripts (kaldi-asr#2748) * [scripts] tfrnnlm: Modify TensorFlow flag format for compatibility with recent versions (kaldi-asr#2760) * [egs] fix old style perl regex in egs/chime1/s5/local/chime1_prepare_data.sh (kaldi-asr#2762) * [scripts] Fix bug in steps/cleanup/debug_lexicon.sh (kaldi-asr#2763) * [egs] Add example for Yomdle Farsi OCR (kaldi-asr#2702) * [scripts] debug_lexicon.sh: Fix bug introduced in kaldi-asr#2763. (kaldi-asr#2764) * [egs] add missing online cmvn config in aishell2 (kaldi-asr#2767) * [egs] Add CNN-TDNN-F script for Librispeech (kaldi-asr#2744) * [src] Some minor cleanup/fixes regarding CUDA memory allocation; other small fixes. (kaldi-asr#2768) * [scripts] Update reverberate_data_dir.py so that it works with python3 (kaldi-asr#2771) * [egs] Chime5: fix total number of words for WER calculation (kaldi-asr#2772) * [egs] RNNLMs on Tedlium w/ Google 1Bword: Increase epochs, update results (kaldi-asr#2775) * [scripts,egs] Added phonetisaurus-based g2p scripts (kaldi-asr#2730) Phonetisaurus is much faster to train then sequitur. * [egs] madcat arabic: clean scripts, tuning, rescoring, text localization (kaldi-asr#2716) * [scripts] Enhancements & minor bugfix to segmentation postprocessing (kaldi-asr#2776) * [src] Update gmm-decode-simple to accept ConstFst (kaldi-asr#2787) * [scripts] Update documentation of train_raw_dnn.py (kaldi-asr#2785) * [src] nnet3: extend what descriptors can be parsed. (kaldi-asr#2780) * [src] Small fix to 'fstrand' (make sure args are parsed) (kaldi-asr#2777) * [src,scripts] Minor, mostly cosmetic updates (kaldi-asr#2788) * [src,scripts] Add script to compare alignment directories. (kaldi-asr#2765) * [scripts] Small fixes to script usage messages, etc. (kaldi-asr#2789) * [egs] Update ami_download.sh after changes on Edinburgh website. (kaldi-asr#2769) * [scripts] Update compare_alignments.sh to allow different lang dirs. (kaldi-asr#2792) * [scripts] Change make_rttm.py so output is in determinstic order (kaldi-asr#2794) * [egs] Fixes to yomdle_zh RE encoding direction, etc. (kaldi-asr#2791) * [src] Add support for context independent phones in gmm-init-biphone (for e2e) (kaldi-asr#2779) * [egs] Simplifying multi-condition version of AMI recipe (kaldi-asr#2800) * [build] Fix openblas build for aarch64 (kaldi-asr#2806) * [build] Make CUDA_ARCH configurable at configure-script level (kaldi-asr#2807) * [src] Print maximum memory stats in CUDA allocator (kaldi-asr#2799) * [src,scripts] Various minor code cleanups (kaldi-asr#2809) * [scripts] Fix handling of UTF-8 in filenames, in wer_per_spk_details.pl (kaldi-asr#2811) * [egs] Update AMI chain recipes (kaldi-asr#2817) * [egs] Improvements to multi_en tdnn-opgru/lstm recipes (kaldi-asr#2824) * [scripts] Fix initial prob of silence when lexicon has silprobs. Thx:@agurianov (kaldi-asr#2823) * [scripts,src] Fix to multitask nnet3 training (kaldi-asr#2818); cosmetic code change. (kaldi-asr#2827) * [scripts] Create shared versions of get_ctm_conf.sh, add get_ctm_conf_fast.sh (kaldi-asr#2828) * [src] Use cuda streams in matrix library (kaldi-asr#2821) * [egs] Add online-decoding recipe to aishell1 (kaldi-asr#2829) * [egs] Add DIHARD 2018 diarization recipe. (kaldi-asr#2822) * [egs] add nnet3 online result for aishell1 (kaldi-asr#2836) * [scripts] RNNLM scripts: don't die when features.txt is not present (kaldi-asr#2837) * [src] Optimize cuda allocator for multi-threaded case (kaldi-asr#2820) * [build] Add cub library for cuda projects (kaldi-asr#2819) not needed now but will be in future. * [src] Make Cuda allocator statistics visible to program (kaldi-asr#2835) * [src] Fix bug affecting scale in GeneralDropoutComponent (non-continuous case) (kaldi-asr#2815) * [build] FIX kaldi-asr#2842: properly check $use_cuda against false. (kaldi-asr#2843) * [doc] Add note about OOVs to data-prep. (kaldi-asr#2844) * [scripts] Allow segmentation with nnet3 chain models (kaldi-asr#2845) * [build] Remove -lcuda from cuda makefiles which breaks operation when no driver present (kaldi-asr#2851) * [scripts] Fix error in analyze_lats.sh for long lattices (replace awk with perl) (kaldi-asr#2854) * [egs] add rnnlm recipe for librispeech (kaldi-asr#2830) * [build] change configure version from 9 to 10 (kaldi-asr#2853) (kaldi-asr#2855) * [src] fixed compilation errors when built with --DOUBLE_PRECISION=1 (kaldi-asr#2856) * [build] Clarify instructions if cub is not found (kaldi-asr#2858) * [egs] Limit MFCC feature extraction job number in Dihard recipe (kaldi-asr#2865) * [egs] Added Bentham handwriting recognition recipe (kaldi-asr#2846) * [src] Share roots of different tones of phones aishell (kaldi-asr#2859) * [egs] Fix path to sequitur in commonvoice egs (kaldi-asr#2868) * [egs] Update reverb recipe (kaldi-asr#2753) * [scripts] Fix error while analyzing lattice (parsing bugs) (kaldi-asr#2873) * [src] Fix memory leak in OnlineCacheFeature; thanks @Worldexe (kaldi-asr#2872) * [egs] TIMIT: fix mac compatibility of sed command (kaldi-asr#2874) * [egs] mini_librispeech: fixing some bugs and limiting repeated downloads (kaldi-asr#2861) * [src,scripts,egs] Speedups to GRU-based networks (special components) (kaldi-asr#2712) * [src] Fix infinite recursion with -DDOUBLE_PRECISION=1. Thx: @hwiorn (kaldi-asr#2875) (kaldi-asr#2876) * Revert "[src] Fix infinite recursion with -DDOUBLE_PRECISION=1. Thx: @hwiorn (kaldi-asr#2875) (kaldi-asr#2876)" (kaldi-asr#2877) This reverts commit 84435ff. * Revert "Revert "[src] Fix infinite recursion with -DDOUBLE_PRECISION=1. Thx: @hwiorn (kaldi-asr#2875) (kaldi-asr#2876)" (kaldi-asr#2877)" (kaldi-asr#2878) This reverts commit b196b7f. * Revert "[src] Fix memory leak in OnlineCacheFeature; thanks @Worldexe" (kaldi-asr#2882) the fix was buggy. apologies. * [src] Remove unused code that caused Windows compile failure. Thx:@btiplitz (kaldi-asr#2881) * [src] Really fix memory leak in online decoding; thx:@Worldexe (kaldi-asr#2883) * [src] Fix Windows cuda build failure (use C++11 standard include) (kaldi-asr#2880) * [src] Add #include that caused build failure on Windows (kaldi-asr#2886) * [scripts] Fix max duration check in sad_to_segments.py (kaldi-asr#2889) * [scripts] Fix speech duration calculation in sad_to_segments.py (kaldi-asr#2891) * [src] Fix Windows build problem (timer.h) (kaldi-asr#2888) * [egs] add HUB4 spanish tdnn-f and cnn-tdnn script (kaldi-asr#2895) * [egs] Fix Aishell2 dict prepare bug; should not affect results (kaldi-asr#2890) * [egs] Self-contained example for KWS for mini_librispeech (kaldi-asr#2887) * [egs,scripts] Fix bugs in Dihard 2018 (kaldi-asr#2897) * [scripts] Check last character of files to match with newline (kaldi-asr#2898) * [egs] Update Librispeech RNNLM results; use correct training data (kaldi-asr#2900) * [scripts] RNNLM: old iteration model cleanup; save space (kaldi-asr#2885) * [scripts] Make prepare_lang.sh cleanup beforehand (prevents certain failures) (kaldi-asr#2906) * [scripts] Expose dim-range-node at xconfig level (kaldi-asr#2903) * [scripts] Fix bug related to multi-task in train_raw_rnn.py (kaldi-asr#2907) [scripts] Fix bug related to multi-task in train_raw_rnn.py. Thx:tessfu2001@gmail.com * [scripts] Cosmetic fix/clarification to utils/prepare_lang.sh (kaldi-asr#2912) * [scripts,egs] Added a new lexicon learning (adaptation) recipe for tedlium, in accordance with the IS17 paper. (kaldi-asr#2774) * [egs] TDNN+LSTM example scripts, with RNNLM, for Librispeech (kaldi-asr#2857) * [src] cosmetic fix in nnet1 code (kaldi-asr#2921) * [src] Fix incorrect invocation of mutex in nnet-batch-compute code (kaldi-asr#2932) * [egs,minor] Fix typo in comment in voxceleb script (kaldi-asr#2926) * [src,egs] Mostly cosmetic changes; add some missing includes (kaldi-asr#2936) * [egs] Fix path of rescoring binaries used in tfrnnlm scripts (kaldi-asr#2941) * [src] Fix bug in nnet3-latgen-faster-batch for determinize=false (kaldi-asr#2945) thx: Maxim Korenevsky. * [egs] Add example for rimes handwriting database; Madcat arabic script cleanup (kaldi-asr#2935) * [egs] Add scripts for yomdle korean (kaldi-asr#2942) * [build] Refactor/cleanup build system, easier build on ubuntu 18.04. (kaldi-asr#2947) note: if this breaks someone's build we'll have to debug it then. * [scripts,egs] Changes for Python 2/3 compatibility (kaldi-asr#2925) * [egs] Add more modern DNN recipe for fisher_callhome_spanish (kaldi-asr#2951) * [scripts] switch from bc to perl to reduce dependencies (diarization scripts) (kaldi-asr#2956) * [scripts] Further fix for Python 2/3 compatibility (kaldi-asr#2957) * [egs] Remove no-longer-existing option in tedlium_r3 recipe (kaldi-asr#2959) * [build] Handle dependencies for .cu files in addition to .cc files (kaldi-asr#2944) * [src] remove duplicate test mode option from class GeneralDropoutComponent (kaldi-asr#2960) * [egs] Fix minor bugs in WSJ's flat-start/e2e recipe (kaldi-asr#2968) * [egs] Fix to BSD compatibility of TIMIT data prep (kaldi-asr#2966) * [scripts] Fix RNNLM training script problem (chunk_length was ignored) (kaldi-asr#2969) * [src] Fix bug in lattice-1best.cc RE removing insertion penalty (kaldi-asr#2970) * [src] Compute a separate avg (start, end) interval for each sausage word (kaldi-asr#2972) * [build] Move nvcc verbose flag to proper location (kaldi-asr#2962) * [egs] Fix mini_librispeech download_lm.sh crash; thx:chris.keith.johnson@gmail.com (kaldi-asr#2974) * [egs] minor fixes related to python2 vs python3 differences (kaldi-asr#2977) * [src] Small fix in test code, avoid spurious failure (kaldi-asr#2978) * [egs] Fix CSJ data-prep; minor path fix for USB version of data (kaldi-asr#2979) * [egs] Add paper ref to README.txt in reverb example (kaldi-asr#2982) * [egs] Minor fixes to sitw recipe (fix problem introdueced in kaldi-asr#2925) (kaldi-asr#2985) * [scripts] Fix bug introduced in kaldi-asr#2957, RE integer division (kaldi-asr#2986) * [egs] Update WSJ flat-start chain recipes to use TDNN-F not TDNN+LSTM (kaldi-asr#2988) * [scripts] Fix typo introduced in kaldi-asr#2925 (kaldi-asr#2989) * [build] Modify Makefile and travis script to fix Travis failures (kaldi-asr#2987) * [src] Simplification and efficiency improvement in ivector-plda-scoring-dense (kaldi-asr#2991) * [egs] Update madcat Arabic and Chinese egs, IAM (kaldi-asr#2964) * [src] Fix overflow bug in convolution code (kaldi-asr#2992) * [src] Fix nan issue in ctm times introduced in kaldi-asr#2972, thx: @vesis84 (kaldi-asr#2993) * [src] Fix 'sausage-time' issue which occurs with disabled MBR decoding. (kaldi-asr#2996) * [egs] Add scripts for yomdle Russian (OCR task) (kaldi-asr#2953) * [egs] Simplify lexicon preparation in Fisher callhome Spanish (kaldi-asr#2999) * [egs] Update GALE Arabic recipe (kaldi-asr#2934) * [egs] Remove outdated NN results from Gale Arabic recipe (kaldi-asr#3002) * [egs] Add RESULTS file for the tedlium s5_r3 (release 3) setup (kaldi-asr#3003) * [src] Fixes to grammar-fst code to handle LM-disambig symbols properly (kaldi-asr#3000) thanks: armando.muscariello@gmail.com * [src] Cosmetic change to mel computation (fix option string) (kaldi-asr#3011) * [src] Fix Visual Studio error due to alternate syntactic form of noreturn (kaldi-asr#3018) * [egs] Fix location of sequitur installation (kaldi-asr#3017) * [src] Fix w/ ifdef Visual Studio error from alternate syntactic form noreturn (kaldi-asr#3020) * [egs] Some fixes to getting data in heroico recipe (kaldi-asr#3021) * [egs] BABEL script fix: avoid make_L_align.sh generating invalid files (kaldi-asr#3022) * [src] Fix to older online decoding code in online/ (OnlineFeInput; was broken by commit cc2469e). (kaldi-asr#3025) * [script] Fix unset bash variable in make_mfcc.sh (kaldi-asr#3030) * [scripts] Extend limit_num_gpus.sh to support --num-gpus 0. (kaldi-asr#3027) * [scripts] fix bug in utils/add_lex_disambig.pl when sil-probs and pron-probs used (kaldi-asr#3033) bug would likely have resulted in determinization failure (only when not using word-position-dependent phones). * [egs] Fix path in Tedlium r3 rnnlm training script (kaldi-asr#3039) * [src] Thread-safety for GrammarFst (thx:armando.muscariello@gmail.com) (kaldi-asr#3040) * [scripts] Cosmetic fix to get_degs.sh (kaldi-asr#3045) * [egs] Small bug fixes for IAM and UW3 recipes (kaldi-asr#3048) * [scripts] Nnet3 segmentation: fix default params (kaldi-asr#3051) * [scripts] Allow perturb_data_dir_speed.sh to work with utt2lang (kaldi-asr#3055) * [scripts] Make beam in monophone training configurable (kaldi-asr#3057) * [scripts] Allow reverberate_data_dir.py to support unicode filenames (kaldi-asr#3060) * [scripts] Make some cleanup scripts work with python3 (kaldi-asr#3054) * [scripts] bug fix to nnet2->3 conversion, fixes kaldi-asr#886 (kaldi-asr#3071) * [src] Make copies occur in per-thread default stream (for GPUs) (kaldi-asr#3068) * [src] Add GPU version of MergeTaskOutput().. relates to batch decoding (kaldi-asr#3067) * [src] Add device options to enable tensor core math mode. (kaldi-asr#3066) * [src] Log nnet3 computation to VLOG, not std::cout (kaldi-asr#3072) * [src] Allow upsampling in compute-mfcc-feats, etc. (kaldi-asr#3014) * [src] fix problem with rand_r being undefined on Android (kaldi-asr#3037) * [egs] Update swbd1_map_words.pl, fix them_1's -> them's (kaldi-asr#3052) * [src] Add const overload OnlineNnet2FeaturePipeline::IvectorFeature (kaldi-asr#3073) * [src] Fix syntax error in egs/bn_music_speech/v1/local/make_musan.py (kaldi-asr#3074) * [src] Memory optimization for online feature extraction of long recordings (kaldi-asr#3038) * [build] fixed a bug in linux_configure_redhat_fat when use_cuda=no (kaldi-asr#3075) * [scripts] Add missing '. ./path.sh' to get_utt2num_frames.sh (kaldi-asr#3076) * [src,scripts,egs] Add count-based biphone tree tying for flat-start chain training (kaldi-asr#3007) * [scripts,egs] Remove sed from various scripts (avoid compatibility problems) (kaldi-asr#2981) * [src] Rework error logging for safety and cleanliness (kaldi-asr#3064) * [src] Change warp-synchronous to cub::BlockReduce (safer but slower) (kaldi-asr#3080) * [src] Fix && and || uses where & and | intended, and other weird errors (kaldi-asr#3087) * [build] Some fixes to Makefiles (kaldi-asr#3088) clang is unhappy with '-rdynamic' in compile-only step, and the switch is really unnecessary. Also, the default location for MKL 64-bit libraries is intel64/. The em64t/ was explained already obsolete by an Intel rep in 2010: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/285973 * [src] Fixed -Wreordered warnings in feat (kaldi-asr#3090) * [egs] Replace bc with perl -e (kaldi-asr#3093) * [scripts] Fix python3 compatibility issue in data-perturbing script (kaldi-asr#3084) * [doc] fix some typos in doc. (kaldi-asr#3097) * [build] Make sure expf() speed probe times sensibly (kaldi-asr#3089) * [scripts] Make sure merge_targets.py works in python3 (kaldi-asr#3094) * [src] ifdef to fix compilation failure on CUDA 8 and earlier (kaldi-asr#3103) * [doc] fix typos and broken links in doc. (kaldi-asr#3102) * [scripts] Fix frame_shift bug in egs/swbd/s5c/local/score_sclite_conf.sh (kaldi-asr#3104) * [src] Fix wrong assertion failure in nnet3-am-compute (kaldi-asr#3106) * [src] Cosmetic changes to natural-gradient code (kaldi-asr#3108) * [src,scripts] Python2 compatibility fixes and code cleanup for nnet1 (kaldi-asr#3113) * [doc] Small documentation fixes; update on Kaldi history (kaldi-asr#3031) * [src] Various mostly-cosmetic changes (copying from another branch) (kaldi-asr#3109) * [scripts] Simplify text encoding in RNNLM scripts (now only support utf-8) (kaldi-asr#3065) * [egs] Add "formosa_speech" recipe (Taiwanese Mandarin ASR) (kaldi-asr#2474) * [egs] python3 compatibility in csj example script (kaldi-asr#3123) * [egs] python3 compatibility in example scripts (kaldi-asr#3126) * [scripts] Bug-fix for removing deleted words (kaldi-asr#3116) The type of --max-deleted-words-kept-when-merging in segment_ctm_edits.py was a string, which prevented the mechanism from working altogether. * [scripts] Add fix regarding num-jobs for segment_long_utterances*.sh(kaldi-asr#3130) * [src] Enable allow_{upsample,downsample} with online features (kaldi-asr#3139) * [src] Fix bad assert in fstmakecontextsyms (kaldi-asr#3142) * [src] Fix to "Fixes to grammar-fst & LM-disambig symbols" (kaldi-asr#3000) (kaldi-asr#3143) * [build] Make sure PaUtils exported from portaudio (kaldi-asr#3144) * [src] cudamatrix: fixing a synchronization bug in 'normalize-per-row' (kaldi-asr#3145) was only apparent using large matrices * [src] Fix typo in comment (kaldi-asr#3147) * [src] Add binary that functions as a TCP server (kaldi-asr#2938) * [scripts] Fix bug in comment (kaldi-asr#3152) * [scripts] Fix bug in steps/segmentation/ali_to_targets.sh (kaldi-asr#3155) * [scripts] Avoid holding out more data than the requested num-utts (due to utt2uniq) (kaldi-asr#3141) * [src,scripts] Add support for two-pass agglomerative clustering. (kaldi-asr#3058) * [src] Disable unget warning in PeekToken (and other small fix) (kaldi-asr#3163) * [build] Add new nvidia tools to windows build (kaldi-asr#3159) * [doc] Fix documentation errors and add more docs for tcp-server decoder (kaldi-asr#3164)

add phonetisaurus-based g2p

61d9560

huangruizhe force-pushed the g2p_phonetisaurus branch from 8688f77 to f34449d Compare September 21, 2018 18:08

xiaohui-zhang reviewed Sep 21, 2018

View reviewed changes

do fixes according to code review

771a556

huangruizhe force-pushed the g2p_phonetisaurus branch from f34449d to 771a556 Compare September 23, 2018 04:49

xiaohui-zhang reviewed Sep 23, 2018

View reviewed changes

minor fixes for robustness of the script

55227df

huangruizhe changed the title ~~add phonetisaurus-based g2p~~ Added phonetisaurus-based g2p scripts Sep 23, 2018

jtrmal reviewed Sep 24, 2018

View reviewed changes

fixed: encoding; code simplicity; naming conventions

4e75be7

huangruizhe force-pushed the g2p_phonetisaurus branch from 40d1a05 to 4e75be7 Compare October 1, 2018 06:21

xiaohui-zhang reviewed Oct 3, 2018

View reviewed changes

modified based on Xiaohui's g2p scripts

a5d60d6

huangruizhe force-pushed the g2p_phonetisaurus branch from 2ee4c52 to a5d60d6 Compare October 3, 2018 16:43

danpovey merged commit 735e2a5 into kaldi-asr:master Oct 9, 2018

huangruizhe deleted the g2p_phonetisaurus branch October 9, 2018 21:39

chenzhehuai mentioned this pull request Jun 4, 2019

update (#32) chenzhehuai/kaldi#33

Closed


		steps/dict/apply_g2p_phonetisaurus.sh --nbest 1 exp/g2p/model.fst $g2p_tmp_dir/missing_onlywords.txt $g2p_tmp_dir/missing_lexicon.txt \|\| exit 1;

		expanded_lexicon=$dict_dir/lexicon.txt

		1>$out_lexicon

		echo "$0: Completed. Synthesized lexicon for new words is in $out_lexicon"

Added phonetisaurus-based g2p scripts #2730

Added phonetisaurus-based g2p scripts #2730

Conversation

huangruizhe commented Sep 21, 2018

xiaohui-zhang commented Sep 21, 2018

danpovey commented Sep 21, 2018

xiaohui-zhang commented Sep 21, 2018

huangruizhe commented Sep 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaohui-zhang commented Sep 23, 2018

danpovey commented Sep 23, 2018

jtrmal commented Sep 23, 2018 via email

jtrmal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huangruizhe commented Oct 1, 2018 • edited Loading

jtrmal commented Oct 1, 2018 via email

huangruizhe commented Oct 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huangruizhe commented Oct 3, 2018

xiaohui-zhang commented Oct 5, 2018

xiaohui-zhang commented Oct 8, 2018

huangruizhe commented Sep 23, 2018 •

edited

Loading

huangruizhe commented Oct 1, 2018 •

edited

Loading