-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building intermediate models with a predefined vocabulary leads to "poison" error #177
Comments
I just had the same issue, if I add |
Confirmed I can reproduce with:
(abusing text files as corpora). Looking into it. |
I'm getting this error without using
I'm trying to train KenLM for the first time. I'm running on an AWS g3.4xlarge, with Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-1066-aws x86_64) |
I think @nmatthews-asapp has a different problem. Regarding this bug, apparently I hadn't thought about limit_vocab when permuting the vocabulary ids to put them in a consistent order. Fun. Planning to put the vocab words to keep in the [0,n) range and words to discard in [n,|vocab|). |
@kpu any estimate when the fix will be ready? Thanks! |
@kpu we really need this fix, please! |
@kpu A similar problem I met is like this when I use a 105G corpus with totally default settings. Last input should have been poison. The program should end soon with an error. If it doesn't, there's a bug. Is that just because the corpus is big so the disk is not enough? |
@R1ckShi This is not relevant to the issue at hand. You need more disk space. |
how much space needed to train big corpus like 100G? |
How did you deal with it then? I met the same problem. |
@kpu I met the same error and wonder how much space needed to train big corpus like 100G? |
@ritwikmishra Please don't spam unrelated issues with duplicate posts. |
Has this problem officially been solved at some point? |
--limit_vocab_file的作用是什么,格式是什么样的呢 |
I’d like to build intermediate models with a predefined vocabulary and then interpolate them. However, I’m getting the following error at the last step of building the intermediate models.
The models actually build fine if I do not use a vocabulary file and do not use the --prune option. A vocabulary file that contains all tokens in the corpus (so no effect) is also OK. Arpa output with any pruning/vocabulary works as expected.
I tried lower LM order (down to 2), decreasing the corpus size (to ~10k words) and a different dataset but none of these make a difference. The machine has ~500GB disk space (path set using -T) and ~50GB RAM.
Example command:
kenlm lmplz -o 2 --text corpus.txt --intermediate my_new_lm -T /path_to_500GB --limit_vocab_file vocab.txt
Any thoughts on how to address this?
The text was updated successfully, but these errors were encountered: