-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update arpa2fst invocations in common (wsj/{utils,steps}) files #682
Conversation
fstisstochastic $dir/phone_graph/G_phones.fst || echo "[info]: G_phones not stochastic." | ||
gunzip -c $dir/phone_graph/arpa_noug.gz | \ | ||
arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \ | ||
fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed an issue in the arpa2fst usage:
Typo in neither:
--keep-symbols : Store symbol table with FST. Forced true if symbol tables are neiter read or written (bool, default = false)
I think the usage could also do with a little more text explaining the effect of the various symbol table options, and what the choices are. I don't think the behavior is very obvious or intuitive from the current usage message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long a usage line is acceptable?
Would this statement be more clear:
Store symbol table with FST. Symbols are always stored if symbol tables are neither read or written otherwise
This complication is intended to keep arpa2fst backward-compatible. The old behavior of just arpa2fst input | fstprint
is to print the FST with symbols. I could have set the default to true always, but that would be a wholesale inconvenience in the new invocation pattern that would have to carry --keep_symbols=false over every call. Hence I took the tradeoff of a more complex default. Arguably, this makes some intuitive sense: if an ARPA file is read, but no words.txt is provided and the symbols are not saved to a separate file, force save the symbols to FST, otherwise they will be lost and the FST will be useless. In another program I would just print an error message if an invalid combination of switches is provided, but in this case it does not cut it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Mon, Apr 11, 2016 at 2:19 PM, Kirill Katsnelson <notifications@github.com
wrote:
In egs/wsj/s5/steps/make_phone_graph.sh
#682 (comment):@@ -92,17 +92,14 @@ fi
if [ $stage -le 3 ]; then
echo "$0: creating G_phones.fst from ARPA"
- gunzip -c $dir/phone_graph/arpa_noug.gz | arpa2fst - - | fstprint | \
- utils/eps2disambig.pl | utils/s2eps.pl | \
- awk '{if (NF < 5 || $5 < 100.0) { print; }}' | \
- fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/phones.txt \
--keep_isymbols=false --keep_osymbols=false | \
- fstconnect | \
- fstrmepsilon > $dir/phone_graph/G_phones.fst
- fstisstochastic $dir/phone_graph/G_phones.fst || echo "[info]: G_phones not stochastic."
- gunzip -c $dir/phone_graph/arpa_noug.gz | \
- arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \
- fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \
How long a usage line is acceptable?
Would this statement be more clear:
Store symbol table with FST. Symbols are always stored if symbol tables
are neither read or written otherwiseI wasn't talking about the options-registering string, but about the usage
message (after "usage:") for the tool as a whole. And example command
lines (after the usage, after "e.g.:"-- you could have 'new' and 'old'
example command lines, and explain that in the 'new' one, the symbol tables
are not output, and explain the differences in the output. Since the
behavior of this tool is much more complex than most Kaldi tools, it
deserves more explanation in the usage message. E.g. would it be clear to
you yourself in 10 years what the program was doing?This complication is intended to keep arpa2fst backward-compatible. The
old behavior of just arpa2fst input | fstprint is to print the FST with
symbols. I could have set the default to true always, but that would be a
wholesale inconvenience in the new invocation pattern that would have to
carry --keep_symbols=false over every call. Hence I took the tradeoff of a
more complex default. Arguably, this makes some intuitive sense: if an ARPA
file is read, but no words.txt is provided and the symbols are not saved to
a separate file, force save the symbols to FST, otherwise they will be lost
and the FST will be useless. In another program I would just print an error
message if an invalid combination of switches is provided, but in this case
it does not cut it.—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/682/files/6e8dd7a3e849f08d04b4fb28cd3fbd772297d4bd#r59283535
gunzip -c $dir/phone_graph/arpa_noug.gz | \ | ||
arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \ | ||
fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \ | ||
fstconnect > $dir/phone_graph/G_phones.fst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fstrmepsilon was in the old command line. I think this may have been important but I forget the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was in every pipeline because of utils/s2eps.pl. The new invocation of arpa2fst with --disambig-symbol produces inherently epsilon-free machines. fstconnect
does not add arcs, so fstrmepsilon
is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks.
On Wed, Apr 13, 2016 at 2:48 PM, Kirill Katsnelson <notifications@github.com
wrote:
In egs/wsj/s5/steps/make_phone_graph.sh
#682 (comment):@@ -92,17 +92,14 @@ fi
if [ $stage -le 3 ]; then
echo "$0: creating G_phones.fst from ARPA"
- gunzip -c $dir/phone_graph/arpa_noug.gz | arpa2fst - - | fstprint | \
- utils/eps2disambig.pl | utils/s2eps.pl | \
- awk '{if (NF < 5 || $5 < 100.0) { print; }}' | \
- fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/phones.txt \
--keep_isymbols=false --keep_osymbols=false | \
- fstconnect | \
- fstrmepsilon > $dir/phone_graph/G_phones.fst
- fstisstochastic $dir/phone_graph/G_phones.fst || echo "[info]: G_phones not stochastic."
- gunzip -c $dir/phone_graph/arpa_noug.gz | \
- arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \
- fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \
- fstconnect > $dir/phone_graph/G_phones.fst
It was in every pipeline because of utils/s2eps.pl. The new invocation of
arpa2fst with --disambig-symbol produces inherently epsilon-free machines.
fstconnect does not add arcs, so fstrmepsilon is not needed.—
You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/682/files/6e8dd7a3e849f08d04b4fb28cd3fbd772297d4bd#r59632010
Just confirmed the
|
OK thanks. On Wed, Apr 13, 2016 at 3:19 PM, Kirill Katsnelson <notifications@github.com
|
Only 3 files, but changes are not quite as trivial as in egs scripts.
steps/make_phone_graph.sh
andutils/format_lm_sri.sh
are both invoked fromswbd/s5{b,c}
, could be tested by this recipe.egs/wsj/s5/utils/reverse_lm.sh
is called only from aurora4/s5 and sprakbanken/s5.