word2vec give inconsistent results #641

samehif · 2016-03-28T16:18:25Z

I have installed gensim using pip install gensim. the version i have is the latest 0.12.4.

I have created a word2vec model from the same data and using the same parameters but every time i create the model it give different results. I have tried to use seed with fixed int when creating the model but it still behaves in the same way.

Here is some example code:

`>>> from nltk.corpus import brown

from gensim.models import Word2Vec
sentences = brown.sents()[:100]
model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4, seed=128)
model[sentences[0][0]]
array([ 0.04913874, 0.04574081, -0.07402877, -0.03270053, 0.06598952,
0.04157289, 0.05075986, 0.01770534, -0.03796235, 0.04594197], dtype=float32)
model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
model[sentences[0][0]]
array([ 0.04907205, 0.04569579, -0.07379777, -0.03273782, 0.06579078,
0.04167712, 0.05083019, 0.01780009, -0.0378389 , 0.04578455], dtype=float32)
model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
model[sentences[0][0]]
array([ 0.04906179, 0.04569826, -0.07382379, -0.03274316, 0.06583244,
0.04166647, 0.0508585 , 0.01777468, -0.03784611, 0.04578935], dtype=float32)`

gojomo · 2016-03-28T23:17:31Z

As soon as you use more than one worker thread, scheduling jitter from the OS means examples are trained in slightly different order. And, sources of randomness in the algorithm – like frequent-word downsampling – will be applied to different examples, meaning slightly different words chosen each run. (And even further, in Python3, PYTHONHASHSEED-controlled randomization on each interpreter-launch will affect the iteration order of keys in the discovered vocabulary dictionary, which can again affect their sampling or ordering inside the model.)

So: you can't expect identical results without taking extra steps, including limiting yourself to just a single worker thread. A pending PR (#642) will make this clearer in the doc-comment.

See also this thread on the discussion forum – https://groups.google.com/d/msg/gensim/7eiwqfhAbhs/qC0pmbw5HwAJ – the same considerations apply to Word2Vec. (If you have other questions that are not likely to be bugs, it's better to discuss at that forum than in this issue-tracker.)

gojomo closed this as completed Mar 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word2vec give inconsistent results #641

word2vec give inconsistent results #641

samehif commented Mar 28, 2016

gojomo commented Mar 28, 2016

word2vec give inconsistent results #641

word2vec give inconsistent results #641

Comments

samehif commented Mar 28, 2016

gojomo commented Mar 28, 2016