Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of Sentences in corpusfile don't match trained sentences. #2693

Open
tshrjn opened this issue Dec 2, 2019 · 1 comment
Open

Number of Sentences in corpusfile don't match trained sentences. #2693

tshrjn opened this issue Dec 2, 2019 · 1 comment

Comments

@tshrjn
Copy link

tshrjn commented Dec 2, 2019

Problem description

I'm training a fasttext model (CBOW) over a corpus, for instance enwik8.
The number of sentences trained (or example_count as referred in log methods) on doesn't equal the number of sentences in the file (wc -l or len(f.readlines()), referred as expected_count or total_examples ).
Why is this happening? Also, in the method here, this warning has been suppressed for corpus mode.

Versions

Linux-4.4.0-1096-aws-x86_64-with-debian-stretch-sid
Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0]
NumPy 1.17.2
SciPy 1.3.1
gensim 3.8.1
FAST_VERSION 1
@gojomo
Copy link
Collaborator

gojomo commented Dec 2, 2019

How much of a discrepancy between these numbers did you see in your case, and exactly which two outputs were you comparing?

You can find some discussion of why the individual threads' stopping conditions are approximate, and thus exact word/text counts not necessarily expected to line-up, in the #2127 PR that added the corpus_file feature at:

#2127 (comment)

(You will have to click Github's "Load more..." link to reveal hidden items to get the leading/following context.) As the expected behavior in corpus_file mode with approximate thread-range-ends, as designed, is for the counts to vary (a little) from what would be seen by a single thread's full run-through, the warning for a mismatch is suppressed.

The approximate nature of this approach seemed a bit fishy to me at the time, in that it might risk some (tiny?) ranges/contexts of the file being trained-on multiple times, while other ranges get missed. But in largish corpuses perhaps such little discrepancies along the "seams" between shards don't matter much.

enwik8 might be an especially challenging case – as, if I recall correctly, it has no newlines and all its millions-of-words of text is essentially on "one line".

@persiyanov may be able to comment further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants