Doc2vec corpus_file mode skips some documents during training #2757

pavellevap · 2020-02-20T13:57:10Z

Problem description

During training of Doc2Vec on corpusfile, some documents are skipped. I think it is because of the way how corpusfile is partitioned. Some lines are processed by two or more workers while some are not processed at all. This behavior could be acceptable for Word2Vec and FasText as the same word occurs several times in different lines. But that is not the case with Doc2Vec where each document corresponds to exactly one line and if that line is skipped, corresponding document vector will not be trained.

Steps to reproduce

documents.txt

a very long document with huge number of words in it
several
short
documents

script.py

from gensim.models import Doc2Vec
import copy
offsets, start_lines = Doc2Vec._get_offsets_and_start_doctags_for_corpusfile('documents.txt', 2)
print("Offsets for workers: ", offsets)
model = Doc2Vec(sample=0, workers=2, min_count=1, vector_size=5, seed=1)
model.build_vocab(corpus_file='documents.txt')
old_vectors = copy.copy(model.docvecs.vectors_docs)
model.train(corpus_file='documents.txt', total_examples=model.corpus_count, 
            total_words=model.corpus_total_words, epochs=10)
new_vectors = copy.copy(model.docvecs.vectors_docs)
for i in range(len(old_vectors)):
    if all(old_vectors[i] == new_vectors[i]):
        print("vector {} did not change".format(i))
    else:
        print("vector {} changed".format(i))

output

Offsets for workers:  [0, 0]
vector 0 changed
vector 1 did not change
vector 2 did not change
vector 3 did not change

The text was updated successfully, but these errors were encountered:

gojomo · 2020-02-25T00:21:07Z

Thanks for the compact test case!

Are you sure the problem isn't limited to 1-word texts? As in the analogous Word2Vec mode (CBOW, sg=0), in Doc2Vec's default PV-DM mode (dm=1), training texts with just a single word have no surrounding 'context', thus generate no (context->target_word) training pairs, thus are no-ops. For example, what if you supply those same texts via the traditional iterable-corpus parameter?

pavellevap · 2020-02-25T09:19:44Z

Document vector is always a part of a context, so context is not empty even for one-word documents. If I used iterable corpus instead of corpus file, all vectors would be trained.
script.py

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import copy
documents = [TaggedDocument('a very long document with huge number of words in it'.split(), ['0']),
             TaggedDocument('several'.split(), ['1']),
             TaggedDocument('short'.split(), ['2']),
             TaggedDocument('documents'.split(), ['3'])]
model = Doc2Vec(sample=0, workers=2, min_count=1, vector_size=5, seed=1)
model.build_vocab(documents=documents)
old_vectors = copy.copy(model.docvecs.vectors_docs)
model.train(documents=documents, epochs=10, total_examples=model.corpus_count)
new_vectors = copy.copy(model.docvecs.vectors_docs)
for i in range(len(old_vectors)):
    if all(old_vectors[i] == new_vectors[i]):
        print("vector {} did not change".format(i))
    else:
        print("vector {} changed".format(i))

output

vector 0 changed
vector 1 changed
vector 2 changed
vector 3 changed

I've also changed corpusfile a bit to get rid of one-word documents.
documents.txt

a very long document with huge number of words in it
several short documents

Output of the first script:

Offsets for workers:  [0, 0]
vector 0 changed
vector 1 did not change

So I'm sure the problem arises because of inexact file splitting. Both workers start from the beginning of the file and process half of all words.
https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/doc2vec_corpusfile.pyx#L307
Which means the second half of the file is completely skipped.

gojomo · 2020-02-26T22:37:40Z

Thanks for updating your examples to show it's not just a problem with 1-word texts!

It seems like #2693 is a report of similar discrepancies in corpus_file for FastText (though it's not clear from that report whether anything is remaining untrained in that scenario.

This is the same as, or very similar to, the sort of missed-range error with corpus_file that I was concerned about in my 2018 comment here. The followup discussion claimed that a precise set of starting-points was being found & used in Doc2Vec (though not other algorithms) to prevent any problems – but this report/recipe suggests the necessary precise-starting points aren't being chosen. Any thoughts, @persiyanov & @menshikh-iv?

So in addition to fixing the Doc2Vec error here, the other modes should get a deep consistency check, for example by giving them large training data where every word only appears once – so that it's obvious, after any training epoch, if any words are completely skipped, and where.

gojomo added bug Issue described a bug impact MEDIUM Big annoyance for affected users labels Feb 26, 2020

gojomo mentioned this issue Jul 7, 2020

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

gojomo mentioned this issue Aug 21, 2020

misc ways to improve infer_vector #515

Open

gojomo mentioned this issue Oct 20, 2021

Partial support of compressed corpora in FastText model #3246

Closed

gojomo mentioned this issue Aug 11, 2022

word2vec doesn't scale linearly with multi-cpu configuration ? #3376

Open

gojomo mentioned this issue Oct 31, 2023

Using corpus_file does not speed up while the CPU utilization seems full. #3089

Open

gojomo mentioned this issue Nov 7, 2023

standardize 'corpus_iterable' (over 'sentences') everywhere #3152

Open

gojomo changed the title ~~Doc2vec corpusfile mode skips some documents during training~~ Doc2vec corpus_file mode skips some documents during training Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc2vec corpus_file mode skips some documents during training #2757

Doc2vec corpus_file mode skips some documents during training #2757

pavellevap commented Feb 20, 2020

gojomo commented Feb 25, 2020

pavellevap commented Feb 25, 2020

gojomo commented Feb 26, 2020

Doc2vec corpus_file mode skips some documents during training #2757

Doc2vec corpus_file mode skips some documents during training #2757

Comments

pavellevap commented Feb 20, 2020

Problem description

Steps to reproduce

gojomo commented Feb 25, 2020

pavellevap commented Feb 25, 2020

gojomo commented Feb 26, 2020