Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word2vec give inconsistent results #641

Closed
samehif opened this issue Mar 28, 2016 · 1 comment
Closed

word2vec give inconsistent results #641

samehif opened this issue Mar 28, 2016 · 1 comment

Comments

@samehif
Copy link

samehif commented Mar 28, 2016

I have installed gensim using pip install gensim. the version i have is the latest 0.12.4.

I have created a word2vec model from the same data and using the same parameters but every time i create the model it give different results. I have tried to use seed with fixed int when creating the model but it still behaves in the same way.

Here is some example code:

`>>> from nltk.corpus import brown

from gensim.models import Word2Vec
sentences = brown.sents()[:100]
model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4, seed=128)
model[sentences[0][0]]
array([ 0.04913874, 0.04574081, -0.07402877, -0.03270053, 0.06598952,
0.04157289, 0.05075986, 0.01770534, -0.03796235, 0.04594197], dtype=float32)
model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
model[sentences[0][0]]
array([ 0.04907205, 0.04569579, -0.07379777, -0.03273782, 0.06579078,
0.04167712, 0.05083019, 0.01780009, -0.0378389 , 0.04578455], dtype=float32)
model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
model[sentences[0][0]]
array([ 0.04906179, 0.04569826, -0.07382379, -0.03274316, 0.06583244,
0.04166647, 0.0508585 , 0.01777468, -0.03784611, 0.04578935], dtype=float32)`

@gojomo
Copy link
Collaborator

gojomo commented Mar 28, 2016

As soon as you use more than one worker thread, scheduling jitter from the OS means examples are trained in slightly different order. And, sources of randomness in the algorithm – like frequent-word downsampling – will be applied to different examples, meaning slightly different words chosen each run. (And even further, in Python3, PYTHONHASHSEED-controlled randomization on each interpreter-launch will affect the iteration order of keys in the discovered vocabulary dictionary, which can again affect their sampling or ordering inside the model.)

So: you can't expect identical results without taking extra steps, including limiting yourself to just a single worker thread. A pending PR (#642) will make this clearer in the doc-comment.

See also this thread on the discussion forum – https://groups.google.com/d/msg/gensim/7eiwqfhAbhs/qC0pmbw5HwAJ – the same considerations apply to Word2Vec. (If you have other questions that are not likely to be bugs, it's better to discuss at that forum than in this issue-tracker.)

@gojomo gojomo closed this as completed Mar 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants