Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2vec corpus_file mode skips some documents during training #2757

Open
pavellevap opened this issue Feb 20, 2020 · 3 comments
Open

Doc2vec corpus_file mode skips some documents during training #2757

pavellevap opened this issue Feb 20, 2020 · 3 comments
Labels
bug Issue described a bug impact MEDIUM Big annoyance for affected users

Comments

@pavellevap
Copy link

Problem description

During training of Doc2Vec on corpusfile, some documents are skipped. I think it is because of the way how corpusfile is partitioned. Some lines are processed by two or more workers while some are not processed at all. This behavior could be acceptable for Word2Vec and FasText as the same word occurs several times in different lines. But that is not the case with Doc2Vec where each document corresponds to exactly one line and if that line is skipped, corresponding document vector will not be trained.

Steps to reproduce

documents.txt

a very long document with huge number of words in it
several
short
documents

script.py

from gensim.models import Doc2Vec
import copy
offsets, start_lines = Doc2Vec._get_offsets_and_start_doctags_for_corpusfile('documents.txt', 2)
print("Offsets for workers: ", offsets)
model = Doc2Vec(sample=0, workers=2, min_count=1, vector_size=5, seed=1)
model.build_vocab(corpus_file='documents.txt')
old_vectors = copy.copy(model.docvecs.vectors_docs)
model.train(corpus_file='documents.txt', total_examples=model.corpus_count, 
            total_words=model.corpus_total_words, epochs=10)
new_vectors = copy.copy(model.docvecs.vectors_docs)
for i in range(len(old_vectors)):
    if all(old_vectors[i] == new_vectors[i]):
        print("vector {} did not change".format(i))
    else:
        print("vector {} changed".format(i))

output

Offsets for workers:  [0, 0]
vector 0 changed
vector 1 did not change
vector 2 did not change
vector 3 did not change
@gojomo
Copy link
Collaborator

gojomo commented Feb 25, 2020

Thanks for the compact test case!

Are you sure the problem isn't limited to 1-word texts? As in the analogous Word2Vec mode (CBOW, sg=0), in Doc2Vec's default PV-DM mode (dm=1), training texts with just a single word have no surrounding 'context', thus generate no (context->target_word) training pairs, thus are no-ops. For example, what if you supply those same texts via the traditional iterable-corpus parameter?

@pavellevap
Copy link
Author

Document vector is always a part of a context, so context is not empty even for one-word documents. If I used iterable corpus instead of corpus file, all vectors would be trained.
script.py

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import copy
documents = [TaggedDocument('a very long document with huge number of words in it'.split(), ['0']),
             TaggedDocument('several'.split(), ['1']),
             TaggedDocument('short'.split(), ['2']),
             TaggedDocument('documents'.split(), ['3'])]
model = Doc2Vec(sample=0, workers=2, min_count=1, vector_size=5, seed=1)
model.build_vocab(documents=documents)
old_vectors = copy.copy(model.docvecs.vectors_docs)
model.train(documents=documents, epochs=10, total_examples=model.corpus_count)
new_vectors = copy.copy(model.docvecs.vectors_docs)
for i in range(len(old_vectors)):
    if all(old_vectors[i] == new_vectors[i]):
        print("vector {} did not change".format(i))
    else:
        print("vector {} changed".format(i))

output

vector 0 changed
vector 1 changed
vector 2 changed
vector 3 changed

I've also changed corpusfile a bit to get rid of one-word documents.
documents.txt

a very long document with huge number of words in it
several short documents

Output of the first script:

Offsets for workers:  [0, 0]
vector 0 changed
vector 1 did not change

So I'm sure the problem arises because of inexact file splitting. Both workers start from the beginning of the file and process half of all words.
https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/doc2vec_corpusfile.pyx#L307
Which means the second half of the file is completely skipped.

@gojomo
Copy link
Collaborator

gojomo commented Feb 26, 2020

Thanks for updating your examples to show it's not just a problem with 1-word texts!

It seems like #2693 is a report of similar discrepancies in corpus_file for FastText (though it's not clear from that report whether anything is remaining untrained in that scenario.

This is the same as, or very similar to, the sort of missed-range error with corpus_file that I was concerned about in my 2018 comment here. The followup discussion claimed that a precise set of starting-points was being found & used in Doc2Vec (though not other algorithms) to prevent any problems – but this report/recipe suggests the necessary precise-starting points aren't being chosen. Any thoughts, @persiyanov & @menshikh-iv?

So in addition to fixing the Doc2Vec error here, the other modes should get a deep consistency check, for example by giving them large training data where every word only appears once – so that it's obvious, after any training epoch, if any words are completely skipped, and where.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug impact MEDIUM Big annoyance for affected users
Projects
None yet
Development

No branches or pull requests

2 participants