New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible error calling train_sg_pair from train_sentence_sg in word2vec.py (context word vs target word) #300
Comments
Hello Parang, yes, I remember there was some discussion around this on the word2vec mailing list. IIRC Tomas Mikolov concluded it makes no difference, he swapped the order for cache efficiency. In any case, gensim's word2vec port tries to follow the C version faithfully, so it does whatever the C version does (or rather, did, at the time I ported the Python version -- I haven't checked whether the C version has changed since). |
I looked up that discussion: https://groups.google.com/forum/#!searchin/word2vec-toolkit/word$20word2/word2vec-toolkit/-AUPLOHGymI/yD4gl0mSNNEJ Parang, if you plan to experiment with the swapped order, let me know. I'll leave this issue open. If you don't, feel free to close. |
Thanks Radim for your reply. I will try training with both the implementations and then will look at the similarity of the generated word vectors. A quick question: Will the initial random assignment of the word vectors have any effect on the finally generated word vectors? For example: because there were two random assignments initially, the two final vectors are different and we end up comparing apples with oranges. |
No, I think there's been a pull request to seed the vectors deterministically (see the But what I found during the early experiments is that this doesn't affect the results much -- the spread (standard deviation) of accuracy scores across multiple runs is small. If you find your "new" version is consistently 2% better, without sacrificing much performance, we'll keep that. |
@parangsaraf Do you have results of testing your version? |
Hi Radim et al,
To begin with thank you so much for investing your time and painstaking efforts in implementing the python version of word2vec / doc2vec. Without a doubt you have greatly empowered fellow NLP researchers with this contribution.
During my process of understanding the code, I came across a possible bug. Please allow me to elaborate:
In the
train_sg_pair(model, word, word2, alpha, labels, train_w1=True, train_w2=True)
the first statementl1 = model.syn0[word2.index]
corresponds to the hidden layer output, which in turn is generated from the input word thereby makingword2
as the input word of the model and makingword
as the "true" word. If I stick with this assumption, the rest of the implementation for that function falls perfectly in place in accordance to the supposed algorithm.Now, as per me, the problem is with line 127 --
train_sg_pair(model, word, word2, alpha, labels)
of the functiontrain_sentence_sg
. I believe this statement should be --train_sg_pair(model, word2, word, alpha, labels)
(just swapped word and word2). In thefor
loopword
corresponds to the current word andword2
corresponds to the context word; and in the skip-gram model we are trying to predict the context word given the current word. However, the current word as per your implementation oftrain_sg_pair
isword2
and notword
.For your reference, the function
train_sg_pair
is being called correctly fromdoc2vec.py
.I hope I was able to explain the issue. If not, please let me know and I can try to explain it using an example.
Best Regards,
Parang
The text was updated successfully, but these errors were encountered: