Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building intermediate models with a predefined vocabulary leads to "poison" error #177

Open
geniki opened this issue Aug 29, 2018 · 14 comments

Comments

@geniki
Copy link

geniki commented Aug 29, 2018

I’d like to build intermediate models with a predefined vocabulary and then interpolate them. However, I’m getting the following error at the last step of building the intermediate models.


...
=== 4/4 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:8904 2:50640
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
----------------------------------------------------------------------------------------------------
Last input should have been poison.

The models actually build fine if I do not use a vocabulary file and do not use the --prune option. A vocabulary file that contains all tokens in the corpus (so no effect) is also OK. Arpa output with any pruning/vocabulary works as expected.

I tried lower LM order (down to 2), decreasing the corpus size (to ~10k words) and a different dataset but none of these make a difference. The machine has ~500GB disk space (path set using -T) and ~50GB RAM.

Example command:
kenlm lmplz -o 2 --text corpus.txt --intermediate my_new_lm -T /path_to_500GB --limit_vocab_file vocab.txt
Any thoughts on how to address this?

@e-matusov
Copy link

I just had the same issue, if I add limit_vocab_file, I get the "Last input should have been poison" message.

@kpu
Copy link
Owner

kpu commented Sep 3, 2018

Confirmed I can reproduce with:

build/bin/lmplz -o 2 --text README.md --intermediate delme --limit_vocab_file LICENSE

(abusing text files as corpora). Looking into it.

@nmatthews-asapp
Copy link

nmatthews-asapp commented Sep 5, 2018

I'm getting this error without using --limit_vocab_file

=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:29104080 2:684897168 3:3846225240 4:9279470400 5:14419969256
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
----------------------------------------------------------------------------------------------------Last input should have been poison.
[1]    17259 abort (core dumped)  ~/kenlm/build/bin/lmplz -o 5 -S 80% -T ~/gramm/tmp < 1b.txt > 1b.arpa

I'm trying to train KenLM for the first time. I'm running on an AWS g3.4xlarge, with Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-1066-aws x86_64)

@kpu
Copy link
Owner

kpu commented Sep 5, 2018

I think @nmatthews-asapp has a different problem.

Regarding this bug, apparently I hadn't thought about limit_vocab when permuting the vocabulary ids to put them in a consistent order. Fun. Planning to put the vocab words to keep in the [0,n) range and words to discard in [n,|vocab|).

@e-matusov
Copy link

@kpu any estimate when the fix will be ready? Thanks!

@e-matusov
Copy link

@kpu we really need this fix, please!

@R1ckShi
Copy link

R1ckShi commented May 6, 2019

@kpu A similar problem I met is like this when I use a 105G corpus with totally default settings.
command i used: lmplz --prune 0 5 30 -o 3 < corpus.txt > arpa.arpa
Error:
=== 1/5 Counting and sorting n-grams ===
Reading /home/user/corpus/corpus-lm/lm_total.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


Last input should have been poison. The program should end soon with an error. If it doesn't, there's a bug.
/home/user/kenlm/util/file.cc:228 in void util::WriteOrThrow(int, const void*, std::size_t) threw FDException because `ret < 1'.
No space left on device in /tmp/GJTweJ (deleted) while writing 5044497740 bytes
______ (__________)

Is that just because the corpus is big so the disk is not enough?
The size of the file to be written barely changes a little when I use different pruning threshold.

@kpu
Copy link
Owner

kpu commented May 6, 2019

@R1ckShi This is not relevant to the issue at hand. You need more disk space.

@nonva
Copy link

nonva commented Jul 26, 2019

@kpu A similar problem I met is like this when I use a 105G corpus with totally default settings.
command i used: lmplz --prune 0 5 30 -o 3 < corpus.txt > arpa.arpa
Error:
=== 1/5 Counting and sorting n-grams ===
Reading /home/user/corpus/corpus-lm/lm_total.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Last input should have been poison. The program should end soon with an error. If it doesn't, there's a bug.
/home/user/kenlm/util/file.cc:228 in void util::WriteOrThrow(int, const void*, std::size_t) threw FDException because `ret < 1'.
No space left on device in /tmp/GJTweJ (deleted) while writing 5044497740 bytes
______ (__________)

Is that just because the corpus is big so the disk is not enough?
The size of the file to be written barely changes a little when I use different pruning threshold.

how much space needed to train big corpus like 100G?

@houxy12
Copy link

houxy12 commented Jul 26, 2019

@kpu A similar problem I met is like this when I use a 105G corpus with totally default settings.
command i used: lmplz --prune 0 5 30 -o 3 < corpus.txt > arpa.arpa
Error:
=== 1/5 Counting and sorting n-grams ===
Reading /home/user/corpus/corpus-lm/lm_total.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Last input should have been poison. The program should end soon with an error. If it doesn't, there's a bug.
/home/user/kenlm/util/file.cc:228 in void util::WriteOrThrow(int, const void*, std::size_t) threw FDException because `ret < 1'.
No space left on device in /tmp/GJTweJ (deleted) while writing 5044497740 bytes
______ (__________)

Is that just because the corpus is big so the disk is not enough?
The size of the file to be written barely changes a little when I use different pruning threshold.

How did you deal with it then? I met the same problem.

@wizardk
Copy link

wizardk commented Sep 4, 2020

@kpu I met the same error and wonder how much space needed to train big corpus like 100G?

@kpu
Copy link
Owner

kpu commented Nov 8, 2020

@ritwikmishra Please don't spam unrelated issues with duplicate posts.

@fquirin
Copy link

fquirin commented Aug 5, 2021

Has this problem officially been solved at some point?
I'm experiencing strange behavior where on my Raspberry Pi with Arm64 OS --limit_vocab_file is not a problem but on my Arm32 OS the flag creates an empty ARPA model 😕 . I'm using pre-built binaries and I'm not sure if there is a version mismatch but I believe they are built from the same source 🤔

fquirin referenced this issue in fquirin/kaldi-adapt-lm Aug 6, 2021
@wwfcnu
Copy link

wwfcnu commented Aug 29, 2023

--limit_vocab_file的作用是什么,格式是什么样的呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants