Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-LM language model: some comments #15

Closed
colbec opened this issue Dec 24, 2015 · 4 comments
Closed

-LM language model: some comments #15

colbec opened this issue Dec 24, 2015 · 4 comments

Comments

@colbec
Copy link

colbec commented Dec 24, 2015

I recently used IRSTLM to successfully build a working LM for Julius, but it was a bit of a struggle. Here are some observations:

  1. what is a dictionary when dealing with a grammar and an LM? Coming from the grammar side I naively assumed that the references to the .dict were the same format for both, but that is not so. I think the docs are a bit thin on what format the dict should be in for a LM; the grammar is well covered already with mkdfa.

  2. In particular it is important to note that the text corpus used to build the model must contain the silences and . IRSTLM will go ahead and happily build iARPA model and convert it to ARPA without them, but when mkbingram tries to make a binary from the result, the process fails with

    Stat: init_ngram: reading in ARPA forward n-gram from jdata2.ilm
    Stat: ngram_read_arpa: this is 3-gram file
    Stat: ngram_read_arpa: reading 1-gram part...
    Stat: ngram_read_arpa: read 10 1-gram entries
    Stat: ngram_read_arpa: reading 2-gram part...
    Stat: ngram_read_arpa: 2-gram read 0 (0%)
    Error: ngram_read_arpa: 2-gram test #1: "-1.07918 ONE -0.30103": "" not exist in 2-gram
    Stat: ngram_read_arpa: 2-gram read 0 (0%)
    Error: init_ngram: failed to read "jdata2.ilm"

I think at one point it was possible to omit the silence tags but not now, it seems.
If the proper standard is for the silences to be in the data, then this becomes an issue for IRSTLM and I will post an issue there.

@LeeAkinobu
Copy link
Member

Hi,

I've posted doc about dictionary format, see this:
#18

For the second point, it's true, LM should contains explicit start and end silence symbol in Julius.

That's because the search algorithm restricts start/end word as "" and "". The first pass starts with a single hypothesis "", and the second backward pass starts with the last found "" hypothesis until any "" is found.

You can change the string of head/tail silence word "" and "" by -silhead and -siltail option.

@colbec
Copy link
Author

colbec commented Jan 8, 2016

@nitslp-ri Thanks for this, very helpful. In my exploration of various LM generators (IRSTLM, KenLM, MITLM, SRILM) I found some variability in the treatment of language corpuses with and without the explicit silence markers. The best overall solution (SRILM excluded due to restrictive license) I found was MITLM which seems to be very forgiving about presence or absence of the markers, and in addition outputs the n-gram sections in a collected order which Julius will not complain about; the order of items in the LM in various sections seems to be important.

I wonder if for completeness in the new documentation it would be helpful to add a note about the arrangement of the lines in the dict file? The current explanation covers individual lines well, but the dict overall is a collection of lines, and Julius may expect them to be in some sort of order. From experimentation I found that an order with the silence markers at the top followed by alphabetic order works, but other arrangement might be better.

@LeeAkinobu
Copy link
Member

Julius may expect them to be in some sort of order

The word order should be sorted in N-gram (.arpa), but not so in recognition dictionary file (.dict). Any order should work in .dict file. Did you find any order-dependent issue in the dictionary?

@colbec
Copy link
Author

colbec commented Mar 4, 2016

@nitslp-ri The only issue for me was that in trying a number of different model generators it seemed that they each had a different idea of what the ideal output should be; some were well recognized by Julius and others not, requiring an extra parameter in the LM generator to fix. Since the format of the output generated by the LM apps is outside the control of the Julius team, I just thought it might be helpful if the docs had a note on what if anything the format should be.

Thanks for the additional note on the sorting of the dic file; at least here it is in a retrievable format and should be helpful to users in the future.

@colbec colbec closed this as completed Mar 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants