-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc2vec corpus_file mode skips some documents during training #2757
Comments
Thanks for the compact test case! Are you sure the problem isn't limited to 1-word texts? As in the analogous |
Document vector is always a part of a context, so context is not empty even for one-word documents. If I used iterable corpus instead of corpus file, all vectors would be trained.
output
I've also changed corpusfile a bit to get rid of one-word documents.
Output of the first script:
So I'm sure the problem arises because of inexact file splitting. Both workers start from the beginning of the file and process half of all words. |
Thanks for updating your examples to show it's not just a problem with 1-word texts! It seems like #2693 is a report of similar discrepancies in This is the same as, or very similar to, the sort of missed-range error with So in addition to fixing the |
Problem description
During training of Doc2Vec on corpusfile, some documents are skipped. I think it is because of the way how corpusfile is partitioned. Some lines are processed by two or more workers while some are not processed at all. This behavior could be acceptable for Word2Vec and FasText as the same word occurs several times in different lines. But that is not the case with Doc2Vec where each document corresponds to exactly one line and if that line is skipped, corresponding document vector will not be trained.
Steps to reproduce
documents.txt
script.py
output
The text was updated successfully, but these errors were encountered: