Number of Sentences in corpusfile don't match trained sentences. #2693

tshrjn · 2019-12-02T10:11:29Z

Problem description

I'm training a fasttext model (CBOW) over a corpus, for instance enwik8.
The number of sentences trained (or example_count as referred in log methods) on doesn't equal the number of sentences in the file (wc -l or len(f.readlines()), referred as expected_count or total_examples ).
Why is this happening? Also, in the method here, this warning has been suppressed for corpus mode.

Versions

Linux-4.4.0-1096-aws-x86_64-with-debian-stretch-sid
Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0]
NumPy 1.17.2
SciPy 1.3.1
gensim 3.8.1
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

gojomo · 2019-12-02T13:43:39Z

How much of a discrepancy between these numbers did you see in your case, and exactly which two outputs were you comparing?

You can find some discussion of why the individual threads' stopping conditions are approximate, and thus exact word/text counts not necessarily expected to line-up, in the #2127 PR that added the corpus_file feature at:

#2127 (comment)

(You will have to click Github's "Load more..." link to reveal hidden items to get the leading/following context.) As the expected behavior in corpus_file mode with approximate thread-range-ends, as designed, is for the counts to vary (a little) from what would be seen by a single thread's full run-through, the warning for a mismatch is suppressed.

The approximate nature of this approach seemed a bit fishy to me at the time, in that it might risk some (tiny?) ranges/contexts of the file being trained-on multiple times, while other ranges get missed. But in largish corpuses perhaps such little discrepancies along the "seams" between shards don't matter much.

enwik8 might be an especially challenging case – as, if I recall correctly, it has no newlines and all its millions-of-words of text is essentially on "one line".

@persiyanov may be able to comment further.

gojomo mentioned this issue Feb 26, 2020

Doc2vec corpus_file mode skips some documents during training #2757

Open

gojomo mentioned this issue Jul 7, 2020

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

gojomo mentioned this issue Aug 21, 2020

misc ways to improve infer_vector #515

Open

gojomo mentioned this issue Oct 20, 2021

Partial support of compressed corpora in FastText model #3246

Closed

gojomo mentioned this issue Nov 7, 2023

standardize 'corpus_iterable' (over 'sentences') everywhere #3152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of Sentences in corpusfile don't match trained sentences. #2693

Number of Sentences in corpusfile don't match trained sentences. #2693

tshrjn commented Dec 2, 2019 •

edited

Loading

gojomo commented Dec 2, 2019 •

edited

Loading

Number of Sentences in corpusfile don't match trained sentences. #2693

Number of Sentences in corpusfile don't match trained sentences. #2693

Comments

tshrjn commented Dec 2, 2019 • edited Loading

Problem description

Versions

gojomo commented Dec 2, 2019 • edited Loading

tshrjn commented Dec 2, 2019 •

edited

Loading

gojomo commented Dec 2, 2019 •

edited

Loading