Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update arpa2fst invocations in common (wsj/{utils,steps}) files #682

Merged
merged 1 commit into from
Apr 13, 2016

Conversation

kkm000
Copy link
Contributor

@kkm000 kkm000 commented Apr 11, 2016

Only 3 files, but changes are not quite as trivial as in egs scripts.

steps/make_phone_graph.sh and utils/format_lm_sri.sh are both invoked from swbd/s5{b,c}, could be tested by this recipe.

egs/wsj/s5/utils/reverse_lm.sh is called only from aurora4/s5 and sprakbanken/s5.

fstisstochastic $dir/phone_graph/G_phones.fst || echo "[info]: G_phones not stochastic."
gunzip -c $dir/phone_graph/arpa_noug.gz | \
arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \
fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed an issue in the arpa2fst usage:
Typo in neither:
--keep-symbols : Store symbol table with FST. Forced true if symbol tables are neiter read or written (bool, default = false)

I think the usage could also do with a little more text explaining the effect of the various symbol table options, and what the choices are. I don't think the behavior is very obvious or intuitive from the current usage message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long a usage line is acceptable?

Would this statement be more clear:

Store symbol table with FST. Symbols are always stored if symbol tables are neither read or written otherwise

This complication is intended to keep arpa2fst backward-compatible. The old behavior of just arpa2fst input | fstprint is to print the FST with symbols. I could have set the default to true always, but that would be a wholesale inconvenience in the new invocation pattern that would have to carry --keep_symbols=false over every call. Hence I took the tradeoff of a more complex default. Arguably, this makes some intuitive sense: if an ARPA file is read, but no words.txt is provided and the symbols are not saved to a separate file, force save the symbols to FST, otherwise they will be lost and the FST will be useless. In another program I would just print an error message if an invalid combination of switches is provided, but in this case it does not cut it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mon, Apr 11, 2016 at 2:19 PM, Kirill Katsnelson <notifications@github.com

wrote:

In egs/wsj/s5/steps/make_phone_graph.sh
#682 (comment):

@@ -92,17 +92,14 @@ fi

if [ $stage -le 3 ]; then
echo "$0: creating G_phones.fst from ARPA"

  • gunzip -c $dir/phone_graph/arpa_noug.gz | arpa2fst - - | fstprint | \
  • utils/eps2disambig.pl | utils/s2eps.pl | \
  • awk '{if (NF < 5 || $5 < 100.0) { print; }}' | \
  • fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/phones.txt \
  •   --keep_isymbols=false --keep_osymbols=false | \
    
  • fstconnect | \
  • fstrmepsilon > $dir/phone_graph/G_phones.fst
  • fstisstochastic $dir/phone_graph/G_phones.fst || echo "[info]: G_phones not stochastic."
  • gunzip -c $dir/phone_graph/arpa_noug.gz | \
  • arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \
  • fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \

How long a usage line is acceptable?

Would this statement be more clear:

Store symbol table with FST. Symbols are always stored if symbol tables
are neither read or written otherwise

I wasn't talking about the options-registering string, but about the usage
message (after "usage:") for the tool as a whole. And example command
lines (after the usage, after "e.g.:"-- you could have 'new' and 'old'
example command lines, and explain that in the 'new' one, the symbol tables
are not output, and explain the differences in the output. Since the
behavior of this tool is much more complex than most Kaldi tools, it
deserves more explanation in the usage message. E.g. would it be clear to
you yourself in 10 years what the program was doing?

This complication is intended to keep arpa2fst backward-compatible. The
old behavior of just arpa2fst input | fstprint is to print the FST with
symbols. I could have set the default to true always, but that would be a
wholesale inconvenience in the new invocation pattern that would have to
carry --keep_symbols=false over every call. Hence I took the tradeoff of a
more complex default. Arguably, this makes some intuitive sense: if an ARPA
file is read, but no words.txt is provided and the symbols are not saved to
a separate file, force save the symbols to FST, otherwise they will be lost
and the FST will be useless. In another program I would just print an error
message if an invalid combination of switches is provided, but in this case
it does not cut it.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/682/files/6e8dd7a3e849f08d04b4fb28cd3fbd772297d4bd#r59283535

kkm000 pushed a commit to kkm000/kaldi that referenced this pull request Apr 12, 2016
kkm000 pushed a commit to kkm000/kaldi that referenced this pull request Apr 12, 2016
gunzip -c $dir/phone_graph/arpa_noug.gz | \
arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \
fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \
fstconnect > $dir/phone_graph/G_phones.fst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fstrmepsilon was in the old command line. I think this may have been important but I forget the reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was in every pipeline because of utils/s2eps.pl. The new invocation of arpa2fst with --disambig-symbol produces inherently epsilon-free machines. fstconnect does not add arcs, so fstrmepsilon is not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks.

On Wed, Apr 13, 2016 at 2:48 PM, Kirill Katsnelson <notifications@github.com

wrote:

In egs/wsj/s5/steps/make_phone_graph.sh
#682 (comment):

@@ -92,17 +92,14 @@ fi

if [ $stage -le 3 ]; then
echo "$0: creating G_phones.fst from ARPA"

  • gunzip -c $dir/phone_graph/arpa_noug.gz | arpa2fst - - | fstprint | \
  • utils/eps2disambig.pl | utils/s2eps.pl | \
  • awk '{if (NF < 5 || $5 < 100.0) { print; }}' | \
  • fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/phones.txt \
  •   --keep_isymbols=false --keep_osymbols=false | \
    
  • fstconnect | \
  • fstrmepsilon > $dir/phone_graph/G_phones.fst
  • fstisstochastic $dir/phone_graph/G_phones.fst || echo "[info]: G_phones not stochastic."
  • gunzip -c $dir/phone_graph/arpa_noug.gz | \
  • arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \
  • fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \
  • fstconnect > $dir/phone_graph/G_phones.fst

It was in every pipeline because of utils/s2eps.pl. The new invocation of
arpa2fst with --disambig-symbol produces inherently epsilon-free machines.
fstconnect does not add arcs, so fstrmepsilon is not needed.


You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/682/files/6e8dd7a3e849f08d04b4fb28cd3fbd772297d4bd#r59632010

@danpovey danpovey merged commit 502bf4a into kaldi-asr:master Apr 13, 2016
@kkm000
Copy link
Contributor Author

kkm000 commented Apr 13, 2016

Just confirmed the reverse_lm script produced an apparently identical G.fst with a relatively small ARPA model (the G has ~200K arcs).

$ fstequivalent --random --npath=2000 data/lang.reverse.orig/G.fst data/lang.reverse/G.fst; echo $?
0

@danpovey
Copy link
Contributor

OK thanks.

On Wed, Apr 13, 2016 at 3:19 PM, Kirill Katsnelson <notifications@github.com

wrote:

Just confirmed the reverse_lm script produced an apparently identical
G.fst with a relatively small ARPA model (the G has ~200K arcs).

$ fstequivalent --random --npath=2000 data/lang.reverse.orig/G.fst data/lang.reverse/G.fst; echo $?
0


You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#682 (comment)

@kkm000 kkm000 deleted the arpa-invoke-common branch April 13, 2016 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants