-LM language model: some comments #15

colbec · 2015-12-24T09:16:38Z

I recently used IRSTLM to successfully build a working LM for Julius, but it was a bit of a struggle. Here are some observations:

what is a dictionary when dealing with a grammar and an LM? Coming from the grammar side I naively assumed that the references to the .dict were the same format for both, but that is not so. I think the docs are a bit thin on what format the dict should be in for a LM; the grammar is well covered already with mkdfa.
In particular it is important to note that the text corpus used to build the model must contain the silences ~~and~~ . IRSTLM will go ahead and happily build iARPA model and convert it to ARPA without them, but when mkbingram tries to make a binary from the result, the process fails with

Stat: init_ngram: reading in ARPA forward n-gram from jdata2.ilm
Stat: ngram_read_arpa: this is 3-gram file
Stat: ngram_read_arpa: reading 1-gram part...
Stat: ngram_read_arpa: read 10 1-gram entries
Stat: ngram_read_arpa: reading 2-gram part...
Stat: ngram_read_arpa: 2-gram read 0 (0%)
Error: ngram_read_arpa: 2-gram test #1: "-1.07918 ONE -0.30103": "" not exist in 2-gram
Stat: ngram_read_arpa: 2-gram read 0 (0%)
Error: init_ngram: failed to read "jdata2.ilm"

I think at one point it was possible to omit the silence tags but not now, it seems.
If the proper standard is for the silences to be in the data, then this becomes an issue for IRSTLM and I will post an issue there.

LeeAkinobu · 2016-01-08T06:39:30Z

Hi,

I've posted doc about dictionary format, see this:
#18

For the second point, it's true, LM should contains explicit start and end silence symbol in Julius.

That's because the search algorithm restricts start/end word as "~~" and "~~". The first pass starts with a single hypothesis "~~", and the second backward pass starts with the last found "~~" hypothesis until any "~~" is found.~~

~~You can change the string of head/tail silence word "~~" and "~~" by -silhead and -siltail option.~~

colbec · 2016-01-08T13:53:27Z

@nitslp-ri Thanks for this, very helpful. In my exploration of various LM generators (IRSTLM, KenLM, MITLM, SRILM) I found some variability in the treatment of language corpuses with and without the explicit silence markers. The best overall solution (SRILM excluded due to restrictive license) I found was MITLM which seems to be very forgiving about presence or absence of the markers, and in addition outputs the n-gram sections in a collected order which Julius will not complain about; the order of items in the LM in various sections seems to be important.

I wonder if for completeness in the new documentation it would be helpful to add a note about the arrangement of the lines in the dict file? The current explanation covers individual lines well, but the dict overall is a collection of lines, and Julius may expect them to be in some sort of order. From experimentation I found that an order with the silence markers at the top followed by alphabetic order works, but other arrangement might be better.

LeeAkinobu · 2016-03-04T09:29:29Z

Julius may expect them to be in some sort of order

The word order should be sorted in N-gram (.arpa), but not so in recognition dictionary file (.dict). Any order should work in .dict file. Did you find any order-dependent issue in the dictionary?

colbec · 2016-03-04T12:18:19Z

@nitslp-ri The only issue for me was that in trying a number of different model generators it seemed that they each had a different idea of what the ideal output should be; some were well recognized by Julius and others not, requiring an extra parameter in the LM generator to fix. Since the format of the output generated by the LM apps is outside the control of the Julius team, I just thought it might be helpful if the docs had a note on what if anything the format should be.

Thanks for the additional note on the sorting of the dic file; at least here it is in a retrievable format and should be helpful to users in the future.

colbec mentioned this issue Dec 24, 2015

Should IRSTLM stop if no <s> in text corpus? irstlm-team/irstlm#3

Closed

LeeAkinobu added the confirming label Mar 4, 2016

colbec closed this as completed Mar 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

-LM language model: some comments #15

-LM language model: some comments #15

colbec commented Dec 24, 2015

LeeAkinobu commented Jan 8, 2016

colbec commented Jan 8, 2016

LeeAkinobu commented Mar 4, 2016

colbec commented Mar 4, 2016

-LM language model: some comments #15

-LM language model: some comments #15

Comments

colbec commented Dec 24, 2015

LeeAkinobu commented Jan 8, 2016

colbec commented Jan 8, 2016

LeeAkinobu commented Mar 4, 2016

colbec commented Mar 4, 2016