Building intermediate models with a predefined vocabulary leads to "poison" error #177

geniki · 2018-08-29T08:42:41Z

I’d like to build intermediate models with a predefined vocabulary and then interpolate them. However, I’m getting the following error at the last step of building the intermediate models.


...
=== 4/4 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:8904 2:50640
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
----------------------------------------------------------------------------------------------------
Last input should have been poison.

The models actually build fine if I do not use a vocabulary file and do not use the --prune option. A vocabulary file that contains all tokens in the corpus (so no effect) is also OK. Arpa output with any pruning/vocabulary works as expected.

I tried lower LM order (down to 2), decreasing the corpus size (to ~10k words) and a different dataset but none of these make a difference. The machine has ~500GB disk space (path set using -T) and ~50GB RAM.

Example command:
kenlm lmplz -o 2 --text corpus.txt --intermediate my_new_lm -T /path_to_500GB --limit_vocab_file vocab.txt
Any thoughts on how to address this?

The text was updated successfully, but these errors were encountered:

e-matusov · 2018-09-03T16:45:15Z

I just had the same issue, if I add limit_vocab_file, I get the "Last input should have been poison" message.

kpu · 2018-09-03T16:54:09Z

Confirmed I can reproduce with:

build/bin/lmplz -o 2 --text README.md --intermediate delme --limit_vocab_file LICENSE

(abusing text files as corpora). Looking into it.

nmatthews-asapp · 2018-09-05T21:02:30Z

I'm getting this error without using --limit_vocab_file

=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:29104080 2:684897168 3:3846225240 4:9279470400 5:14419969256
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
----------------------------------------------------------------------------------------------------Last input should have been poison.
[1]    17259 abort (core dumped)  ~/kenlm/build/bin/lmplz -o 5 -S 80% -T ~/gramm/tmp < 1b.txt > 1b.arpa

I'm trying to train KenLM for the first time. I'm running on an AWS g3.4xlarge, with Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-1066-aws x86_64)

kpu · 2018-09-05T21:10:44Z

I think @nmatthews-asapp has a different problem.

Regarding this bug, apparently I hadn't thought about limit_vocab when permuting the vocabulary ids to put them in a consistent order. Fun. Planning to put the vocab words to keep in the [0,n) range and words to discard in [n,|vocab|).

e-matusov · 2018-09-18T09:33:36Z

@kpu any estimate when the fix will be ready? Thanks!

e-matusov · 2018-12-07T15:49:21Z

@kpu we really need this fix, please!

R1ckShi · 2019-05-06T03:13:08Z

@kpu A similar problem I met is like this when I use a 105G corpus with totally default settings.
command i used: lmplz --prune 0 5 30 -o 3 < corpus.txt > arpa.arpa
Error:
=== 1/5 Counting and sorting n-grams ===
Reading /home/user/corpus/corpus-lm/lm_total.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Last input should have been poison. The program should end soon with an error. If it doesn't, there's a bug.
/home/user/kenlm/util/file.cc:228 in void util::WriteOrThrow(int, const void*, std::size_t) threw FDException because `ret < 1'.
No space left on device in /tmp/GJTweJ (deleted) while writing 5044497740 bytes
______ (__________)

Is that just because the corpus is big so the disk is not enough?
The size of the file to be written barely changes a little when I use different pruning threshold.

kpu · 2019-05-06T10:52:10Z

@R1ckShi This is not relevant to the issue at hand. You need more disk space.

nonva · 2019-07-26T02:31:55Z

@kpu A similar problem I met is like this when I use a 105G corpus with totally default settings.
command i used: lmplz --prune 0 5 30 -o 3 < corpus.txt > arpa.arpa
Error:
=== 1/5 Counting and sorting n-grams ===
Reading /home/user/corpus/corpus-lm/lm_total.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Last input should have been poison. The program should end soon with an error. If it doesn't, there's a bug.
/home/user/kenlm/util/file.cc:228 in void util::WriteOrThrow(int, const void*, std::size_t) threw FDException because `ret < 1'.
No space left on device in /tmp/GJTweJ (deleted) while writing 5044497740 bytes
______ (__________)

Is that just because the corpus is big so the disk is not enough?
The size of the file to be written barely changes a little when I use different pruning threshold.

how much space needed to train big corpus like 100G？

houxy12 · 2019-07-26T11:46:02Z

@kpu A similar problem I met is like this when I use a 105G corpus with totally default settings.
command i used: lmplz --prune 0 5 30 -o 3 < corpus.txt > arpa.arpa
Error:
=== 1/5 Counting and sorting n-grams ===
Reading /home/user/corpus/corpus-lm/lm_total.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Last input should have been poison. The program should end soon with an error. If it doesn't, there's a bug.
/home/user/kenlm/util/file.cc:228 in void util::WriteOrThrow(int, const void*, std::size_t) threw FDException because `ret < 1'.
No space left on device in /tmp/GJTweJ (deleted) while writing 5044497740 bytes
______ (__________)

Is that just because the corpus is big so the disk is not enough?
The size of the file to be written barely changes a little when I use different pruning threshold.

How did you deal with it then? I met the same problem.

wizardk · 2020-09-04T03:36:42Z

@kpu I met the same error and wonder how much space needed to train big corpus like 100G？

kpu · 2020-11-08T22:21:40Z

@ritwikmishra Please don't spam unrelated issues with duplicate posts.

fquirin · 2021-08-05T21:14:28Z

Has this problem officially been solved at some point?
I'm experiencing strange behavior where on my Raspberry Pi with Arm64 OS --limit_vocab_file is not a problem but on my Arm32 OS the flag creates an empty ARPA model 😕 . I'm using pre-built binaries and I'm not sure if there is a version mismatch but I believe they are built from the same source 🤔

wwfcnu · 2023-08-29T04:56:25Z

--limit_vocab_file的作用是什么，格式是什么样的呢

nmatthews-asapp mentioned this issue Sep 5, 2018

KenLM setup seems to be broken chrisjbryant/lmgec-lite#1

Closed

besimali mentioned this issue Feb 28, 2020

getting error on interpolate: Streams were not the same size during merging #259

Closed

fquirin mentioned this issue Aug 5, 2021

KenLM versions? synesthesiam/prebuilt-apps#1

Closed

fquirin referenced this issue in fquirin/kaldi-adapt-lm Aug 6, 2021

Update README.md

8414712

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building intermediate models with a predefined vocabulary leads to "poison" error #177

Building intermediate models with a predefined vocabulary leads to "poison" error #177

geniki commented Aug 29, 2018

e-matusov commented Sep 3, 2018

kpu commented Sep 3, 2018

nmatthews-asapp commented Sep 5, 2018 •

edited

Loading

kpu commented Sep 5, 2018

e-matusov commented Sep 18, 2018

e-matusov commented Dec 7, 2018

R1ckShi commented May 6, 2019 •

edited

Loading

kpu commented May 6, 2019

nonva commented Jul 26, 2019

houxy12 commented Jul 26, 2019

wizardk commented Sep 4, 2020

kpu commented Nov 8, 2020 •

edited

Loading

fquirin commented Aug 5, 2021

wwfcnu commented Aug 29, 2023

Building intermediate models with a predefined vocabulary leads to "poison" error #177

Building intermediate models with a predefined vocabulary leads to "poison" error #177

Comments

geniki commented Aug 29, 2018

e-matusov commented Sep 3, 2018

kpu commented Sep 3, 2018

nmatthews-asapp commented Sep 5, 2018 • edited Loading

kpu commented Sep 5, 2018

e-matusov commented Sep 18, 2018

e-matusov commented Dec 7, 2018

R1ckShi commented May 6, 2019 • edited Loading

kpu commented May 6, 2019

nonva commented Jul 26, 2019

houxy12 commented Jul 26, 2019

wizardk commented Sep 4, 2020

kpu commented Nov 8, 2020 • edited Loading

fquirin commented Aug 5, 2021

wwfcnu commented Aug 29, 2023

nmatthews-asapp commented Sep 5, 2018 •

edited

Loading

R1ckShi commented May 6, 2019 •

edited

Loading

kpu commented Nov 8, 2020 •

edited

Loading