doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs #2260

timbicker · 2018-11-06T10:43:21Z

Description

I am training a doc2vec model on a large corpus. I need to observe the model for more detailed statistics for my supervisor/boss.
The problem is similar to the problem below where I just slightly modified the Doc2Vec Tutorial on the Lee Dataset. The model does not improve its recommendations for the most_similar method.

Steps/Code/Corpus to Reproduce

import gensim
import os
import smart_open
import gensim.models.callbacks


# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'


def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])


train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

results_new = {i: None for i, doc in enumerate(train_corpus)}
results_old = results_new.copy()


class TrainProgressEvaluation(gensim.models.callbacks.CallbackAny2Vec):

    def __init__(self, test_set, results_new, results_old):
        self.test_set = test_set
        self.results_new = results_new
        self.results_old = results_old
        self.epoch = 0

    def on_epoch_end(self, model):
        self.epoch += 1
        print(f"epoch {self.epoch} end")

    def on_batch_begin(self, model):
        for num, sample in enumerate(self.test_set):
            recs = model.docvecs.most_similar(num)
            # for the first call results_new[num] is None
            self.results_old[num] = results_new[num] or recs
            self.results_new[num] = recs
            for i in range(len(recs)):
                if not self.results_old[num][i][0] == self.results_new[num][i][0] or not self.results_old[num][i][1] == self.results_new[num][i][1]:
                    print(f"Sample {num} has changed.")
                    print(f"Old tag {self.results_old[num][i][0]}. New tag {self.results_new[num][i][0]}")
                    print(f"Old distance {self.results_old[num][i][1]}. New distance {self.results_new[num][i][1]}")


model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, workers=4)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs,
            callbacks=(TrainProgressEvaluation(train_corpus, results_new, results_old),))

Expected Results

I expect to see many improvements in either recommendation or distance.

Actual Results

Consol Output with four workers:
It surprises me that only the first sample in the training_corpus receives some updates. I don't understand it.

Sample 0 has changed.
/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
Old tag 116. New tag 30
  if np.issubdtype(vec.dtype, np.int):
Old distance 0.5557072162628174. New distance 0.4648822546005249
Sample 0 has changed.
Old tag 42. New tag 224
Old distance 0.48946425318717957. New distance 0.3621359169483185
Sample 0 has changed.
Old tag 51. New tag 96
Old distance 0.4082771837711334. New distance 0.31921446323394775
Sample 0 has changed.
Old tag 90. New tag 77
Old distance 0.3731566369533539. New distance 0.3184990882873535
Sample 0 has changed.
Old tag 128. New tag 45
Old distance 0.34601616859436035. New distance 0.30474674701690674Sample 0 has changed.
Old tag 30. New tag 116
Old distance 0.4648822546005249. New distance 0.5557072162628174
Sample 0 has changed.
Old tag 224. New tag 42
Old distance 0.3621359169483185. New distance 0.48946425318717957
Sample 0 has changed.
Old tag 96. New tag 51
Old distance 0.31921446323394775. New distance 0.4082771837711334
Sample 0 has changed.
Old tag 77. New tag 90
Sample 0 has changed.
Old tag 46. New tag 234
Old distance 0.3005654215812683. New distance 0.3441653251647949
Old distance 0.3184990882873535. New distance 0.3731566369533539
Sample 0 has changed.
Old tag 45. New tag 128
Old distance 0.30474674701690674. New distance 0.34601616859436035
Sample 0 has changed.
Old tag 46. New tag 234
Old distance 0.3005654215812683. New distance 0.3441653251647949
Sample 0 has changed.
Sample 0 has changed.
Old tag 111. New tag 76
Old tag 111. New tag 76
Old distance 0.280322402715683. New distance 0.32334667444229126
Old distance 0.280322402715683. New distance 0.32334667444229126
Sample 0 has changed.
Old tag 221. New tag 49

Old distance 0.2779023051261902. New distance 0.27320006489753723
Sample 0 has changed.
Old tag 52. New tag 4
Old distance 0.27472415566444397. New distance 0.27205419540405273
Sample 0 has changed.
Old tag 221. New tag 49
Old distance 0.2779023051261902. New distance 0.27320006489753723
Sample 0 has changed.
Sample 0 has changed.
Old tag 205. New tag 149
Old distance 0.26930660009384155. New distance 0.2699446976184845
Old tag 52. New tag 4
Old distance 0.27472415566444397. New distance 0.27205419540405273
Sample 0 has changed.
Old tag 205. New tag 149
Old distance 0.26930660009384155. New distance 0.2699446976184845
Sample 0 has changed.
Old tag 116. New tag 30
Old distance 0.5557072162628174. New distance 0.4648822546005249
Sample 0 has changed.
Old tag 42. New tag 224
Old distance 0.48946425318717957. New distance 0.3621359169483185
Sample 0 has changed.
Old tag 51. New tag 96
Old distance 0.4082771837711334. New distance 0.31921446323394775
Sample 0 has changed.
Old tag 90. New tag 77
Old distance 0.3731566369533539. New distance 0.3184990882873535
Sample 0 has changed.
Old tag 128. New tag 45
Old distance 0.34601616859436035. New distance 0.30474674701690674
Sample 0 has changed.
Old tag 234. New tag 46
Old distance 0.3441653251647949. New distance 0.3005654215812683
Sample 0 has changed.
Old tag 76. New tag 111
Old distance 0.32334667444229126. New distance 0.280322402715683
Sample 0 has changed.
Old tag 49. New tag 221
Old distance 0.27320006489753723. New distance 0.2779023051261902
Sample 0 has changed.
Old tag 4. New tag 52
Old distance 0.27205419540405273. New distance 0.27472415566444397
Sample 0 has changed.
Old tag 149. New tag 205
Old distance 0.2699446976184845. New distance 0.26930660009384155
epoch 1 end
epoch 1 end
epoch 2 end
epoch 3 end
epoch 4 end
....

So I debug the model and there are no improvements anymore:

/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
epoch 1 end
epoch 2 end
....

I try it with 1 worker only:

/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
epoch 1 end
epoch 2 end
epoch 3 end
....

What's happening here and how can I see during training how my doc2vec model improves? Because it is also not possible to see the training_error for doc2vec #999.
Further experimenting reveals that docvecs.vectors_docs are of course updated between each call of batch_end. But most_similiar always returns the same suggestion.

Versions

Darwin-17.5.0-x86_64-i386-64bit
Python 3.7.0 (default, Jun 29 2018, 20:13:13)
[Clang 9.1.0 (clang-902.0.39.2)]
NumPy 1.15.0
SciPy 1.1.0
gensim 3.5.0
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

timbicker · 2018-11-06T13:19:06Z

It turns out

model.docvecs.vectors_docs_norm = None
model.docvecs.init_sims()

has to be called before each call model.docvecs.most_similar.
Then the program works as expected.

piskvorky · 2018-11-07T16:51:23Z

If that's the case, then that's definitely a bug!

Are you saying you have to call init_sims() before each call of most_similar? If that's so, please reopen this ticket.

timbicker · 2018-11-09T18:59:56Z

Well, yes and no.
I looked at it more thoroughly: most_similar() uses vectors_docs_norm that are called by init_sims(). Also, most_similar() does call init_sims(), but vectors_docs_norm are only recalculated if they are None. So in order to use most_similar() on newly trained vectors, one has to manually set vectors_docs_norm to None. So yes to me, this looks like a bug or at least unexpected behavior. I would like to fix it then, if that is fine for you.

gojomo · 2018-11-12T01:39:40Z

train() could null the normed vectors, if any, so that they're recalculated to reflect the updated non-normed vectors. (The prior working assumption had been that most_similar() would only be run on a model that had finished trained.)

timbicker · 2018-11-21T13:01:32Z

train() could null the normed vectors, if any, so that they're recalculated to reflect the updated non-normed vectors. (The prior working assumption had been that most_similar() would only be run on a model that had finished trained.)

This is an excellent idea imo. I implemented it in this way.

dnabanita7 · 2019-01-19T23:04:35Z

is it closed? I want to get to work on this @menshikh-iv

menshikh-iv · 2019-01-20T05:24:05Z

@naba7 see status on top

timbicker · 2019-01-20T11:59:22Z

Sorry for my recent absence. I pushed new changes to the branch of the PR, but it is still closed. I hope it is reopened in the next days, so we can finish working on it.
@naba7 feel free to participate in the PR, if there is anything left to do
Thanks for your support.

menshikh-iv · 2019-01-22T03:21:33Z

@timbicker done, see #2273

gojomo · 2019-11-08T14:13:39Z

Also an issue for FastText: #2260

gojomo · 2021-09-17T21:24:12Z

I believe this issue is moot given changes that eliminated so much normed-vector caching in Gensim-4.0.

timbicker closed this as completed Nov 6, 2018

timbicker reopened this Nov 9, 2018

timbicker mentioned this issue Nov 21, 2018

set normed vectors to None when model is trained #2273

Closed

menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Dec 13, 2018

gojomo changed the title ~~doc2vec example model does not improve~~ doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs Nov 8, 2019

mpenkov added this to To do in Misha adhoc via automation Nov 10, 2019

piskvorky closed this as completed Sep 18, 2021

Misha adhoc automation moved this from To do to Done Sep 18, 2021

gojomo mentioned this issue Jan 17, 2023

Gensim Word2Vec produces different most_similar results through final epoch than end of training #3429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs #2260

doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs #2260

timbicker commented Nov 6, 2018 •

edited by mpenkov

timbicker commented Nov 6, 2018

piskvorky commented Nov 7, 2018 •

edited

timbicker commented Nov 9, 2018 •

edited

gojomo commented Nov 12, 2018

timbicker commented Nov 21, 2018

dnabanita7 commented Jan 19, 2019

menshikh-iv commented Jan 20, 2019

timbicker commented Jan 20, 2019 •

edited

menshikh-iv commented Jan 22, 2019

gojomo commented Nov 8, 2019

gojomo commented Sep 17, 2021

doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs #2260

doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs #2260

Comments

timbicker commented Nov 6, 2018 • edited by mpenkov

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

timbicker commented Nov 6, 2018

piskvorky commented Nov 7, 2018 • edited

timbicker commented Nov 9, 2018 • edited

gojomo commented Nov 12, 2018

timbicker commented Nov 21, 2018

dnabanita7 commented Jan 19, 2019

menshikh-iv commented Jan 20, 2019

timbicker commented Jan 20, 2019 • edited

menshikh-iv commented Jan 22, 2019

gojomo commented Nov 8, 2019

gojomo commented Sep 17, 2021

timbicker commented Nov 6, 2018 •

edited by mpenkov

piskvorky commented Nov 7, 2018 •

edited

timbicker commented Nov 9, 2018 •

edited

timbicker commented Jan 20, 2019 •

edited