Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs #2260

Closed
timbicker opened this issue Nov 6, 2018 · 11 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects

Comments

@timbicker
Copy link

timbicker commented Nov 6, 2018

Description

I am training a doc2vec model on a large corpus. I need to observe the model for more detailed statistics for my supervisor/boss.
The problem is similar to the problem below where I just slightly modified the Doc2Vec Tutorial on the Lee Dataset. The model does not improve its recommendations for the most_similar method.

Steps/Code/Corpus to Reproduce

import gensim
import os
import smart_open
import gensim.models.callbacks


# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'


def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])


train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

results_new = {i: None for i, doc in enumerate(train_corpus)}
results_old = results_new.copy()


class TrainProgressEvaluation(gensim.models.callbacks.CallbackAny2Vec):

    def __init__(self, test_set, results_new, results_old):
        self.test_set = test_set
        self.results_new = results_new
        self.results_old = results_old
        self.epoch = 0

    def on_epoch_end(self, model):
        self.epoch += 1
        print(f"epoch {self.epoch} end")

    def on_batch_begin(self, model):
        for num, sample in enumerate(self.test_set):
            recs = model.docvecs.most_similar(num)
            # for the first call results_new[num] is None
            self.results_old[num] = results_new[num] or recs
            self.results_new[num] = recs
            for i in range(len(recs)):
                if not self.results_old[num][i][0] == self.results_new[num][i][0] or not self.results_old[num][i][1] == self.results_new[num][i][1]:
                    print(f"Sample {num} has changed.")
                    print(f"Old tag {self.results_old[num][i][0]}. New tag {self.results_new[num][i][0]}")
                    print(f"Old distance {self.results_old[num][i][1]}. New distance {self.results_new[num][i][1]}")


model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, workers=4)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs,
            callbacks=(TrainProgressEvaluation(train_corpus, results_new, results_old),))

Expected Results

I expect to see many improvements in either recommendation or distance.

Actual Results

Consol Output with four workers:
It surprises me that only the first sample in the training_corpus receives some updates. I don't understand it.

Sample 0 has changed.
/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
Old tag 116. New tag 30
  if np.issubdtype(vec.dtype, np.int):
Old distance 0.5557072162628174. New distance 0.4648822546005249
Sample 0 has changed.
Old tag 42. New tag 224
Old distance 0.48946425318717957. New distance 0.3621359169483185
Sample 0 has changed.
Old tag 51. New tag 96
Old distance 0.4082771837711334. New distance 0.31921446323394775
Sample 0 has changed.
Old tag 90. New tag 77
Old distance 0.3731566369533539. New distance 0.3184990882873535
Sample 0 has changed.
Old tag 128. New tag 45
Old distance 0.34601616859436035. New distance 0.30474674701690674Sample 0 has changed.
Old tag 30. New tag 116
Old distance 0.4648822546005249. New distance 0.5557072162628174
Sample 0 has changed.
Old tag 224. New tag 42
Old distance 0.3621359169483185. New distance 0.48946425318717957
Sample 0 has changed.
Old tag 96. New tag 51
Old distance 0.31921446323394775. New distance 0.4082771837711334
Sample 0 has changed.
Old tag 77. New tag 90
Sample 0 has changed.
Old tag 46. New tag 234
Old distance 0.3005654215812683. New distance 0.3441653251647949
Old distance 0.3184990882873535. New distance 0.3731566369533539
Sample 0 has changed.
Old tag 45. New tag 128
Old distance 0.30474674701690674. New distance 0.34601616859436035
Sample 0 has changed.
Old tag 46. New tag 234
Old distance 0.3005654215812683. New distance 0.3441653251647949
Sample 0 has changed.
Sample 0 has changed.
Old tag 111. New tag 76
Old tag 111. New tag 76
Old distance 0.280322402715683. New distance 0.32334667444229126
Old distance 0.280322402715683. New distance 0.32334667444229126
Sample 0 has changed.
Old tag 221. New tag 49

Old distance 0.2779023051261902. New distance 0.27320006489753723
Sample 0 has changed.
Old tag 52. New tag 4
Old distance 0.27472415566444397. New distance 0.27205419540405273
Sample 0 has changed.
Old tag 221. New tag 49
Old distance 0.2779023051261902. New distance 0.27320006489753723
Sample 0 has changed.
Sample 0 has changed.
Old tag 205. New tag 149
Old distance 0.26930660009384155. New distance 0.2699446976184845
Old tag 52. New tag 4
Old distance 0.27472415566444397. New distance 0.27205419540405273
Sample 0 has changed.
Old tag 205. New tag 149
Old distance 0.26930660009384155. New distance 0.2699446976184845
Sample 0 has changed.
Old tag 116. New tag 30
Old distance 0.5557072162628174. New distance 0.4648822546005249
Sample 0 has changed.
Old tag 42. New tag 224
Old distance 0.48946425318717957. New distance 0.3621359169483185
Sample 0 has changed.
Old tag 51. New tag 96
Old distance 0.4082771837711334. New distance 0.31921446323394775
Sample 0 has changed.
Old tag 90. New tag 77
Old distance 0.3731566369533539. New distance 0.3184990882873535
Sample 0 has changed.
Old tag 128. New tag 45
Old distance 0.34601616859436035. New distance 0.30474674701690674
Sample 0 has changed.
Old tag 234. New tag 46
Old distance 0.3441653251647949. New distance 0.3005654215812683
Sample 0 has changed.
Old tag 76. New tag 111
Old distance 0.32334667444229126. New distance 0.280322402715683
Sample 0 has changed.
Old tag 49. New tag 221
Old distance 0.27320006489753723. New distance 0.2779023051261902
Sample 0 has changed.
Old tag 4. New tag 52
Old distance 0.27205419540405273. New distance 0.27472415566444397
Sample 0 has changed.
Old tag 149. New tag 205
Old distance 0.2699446976184845. New distance 0.26930660009384155
epoch 1 end
epoch 1 end
epoch 2 end
epoch 3 end
epoch 4 end
....

So I debug the model and there are no improvements anymore:

/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
epoch 1 end
epoch 2 end
....

I try it with 1 worker only:

/usr/local/lib/python3.7/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
epoch 1 end
epoch 2 end
epoch 3 end
....

What's happening here and how can I see during training how my doc2vec model improves? Because it is also not possible to see the training_error for doc2vec #999.
Further experimenting reveals that docvecs.vectors_docs are of course updated between each call of batch_end. But most_similiar always returns the same suggestion.

Versions

Darwin-17.5.0-x86_64-i386-64bit
Python 3.7.0 (default, Jun 29 2018, 20:13:13)
[Clang 9.1.0 (clang-902.0.39.2)]
NumPy 1.15.0
SciPy 1.1.0
gensim 3.5.0
FAST_VERSION 0

@timbicker
Copy link
Author

It turns out

model.docvecs.vectors_docs_norm = None
model.docvecs.init_sims()

has to be called before each call model.docvecs.most_similar.
Then the program works as expected.

@piskvorky
Copy link
Owner

piskvorky commented Nov 7, 2018

If that's the case, then that's definitely a bug!

Are you saying you have to call init_sims() before each call of most_similar? If that's so, please reopen this ticket.

@timbicker
Copy link
Author

timbicker commented Nov 9, 2018

Well, yes and no.
I looked at it more thoroughly: most_similar() uses vectors_docs_norm that are called by init_sims(). Also, most_similar() does call init_sims(), but vectors_docs_norm are only recalculated if they are None. So in order to use most_similar() on newly trained vectors, one has to manually set vectors_docs_norm to None. So yes to me, this looks like a bug or at least unexpected behavior. I would like to fix it then, if that is fine for you.

@timbicker timbicker reopened this Nov 9, 2018
@gojomo
Copy link
Collaborator

gojomo commented Nov 12, 2018

train() could null the normed vectors, if any, so that they're recalculated to reflect the updated non-normed vectors. (The prior working assumption had been that most_similar() would only be run on a model that had finished trained.)

@timbicker
Copy link
Author

train() could null the normed vectors, if any, so that they're recalculated to reflect the updated non-normed vectors. (The prior working assumption had been that most_similar() would only be run on a model that had finished trained.)

This is an excellent idea imo. I implemented it in this way.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Dec 13, 2018
@dnabanita7
Copy link

is it closed? I want to get to work on this @menshikh-iv

@menshikh-iv
Copy link
Contributor

@naba7 see status on top

@timbicker
Copy link
Author

timbicker commented Jan 20, 2019

Sorry for my recent absence. I pushed new changes to the branch of the PR, but it is still closed. I hope it is reopened in the next days, so we can finish working on it.
@naba7 feel free to participate in the PR, if there is anything left to do
Thanks for your support.

@menshikh-iv
Copy link
Contributor

@timbicker done, see #2273

@gojomo
Copy link
Collaborator

gojomo commented Nov 8, 2019

Also an issue for FastText: #2260

@gojomo gojomo changed the title doc2vec example model does not improve doc2vec/word2vec/fasttext models do not appear to improve if similarities checked mid-training epochs Nov 8, 2019
@mpenkov mpenkov added this to To do in Misha adhoc via automation Nov 10, 2019
@gojomo
Copy link
Collaborator

gojomo commented Sep 17, 2021

I believe this issue is moot given changes that eliminated so much normed-vector caching in Gensim-4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
Misha adhoc
  
Done
Development

Successfully merging a pull request may close this issue.

5 participants