Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible error calling train_sg_pair from train_sentence_sg in word2vec.py (context word vs target word) #300

Closed
parangsaraf opened this issue Mar 6, 2015 · 5 comments

Comments

@parangsaraf
Copy link

Hi Radim et al,

To begin with thank you so much for investing your time and painstaking efforts in implementing the python version of word2vec / doc2vec. Without a doubt you have greatly empowered fellow NLP researchers with this contribution.

During my process of understanding the code, I came across a possible bug. Please allow me to elaborate:

In the train_sg_pair(model, word, word2, alpha, labels, train_w1=True, train_w2=True) the first statement l1 = model.syn0[word2.index] corresponds to the hidden layer output, which in turn is generated from the input word thereby making word2 as the input word of the model and making word as the "true" word. If I stick with this assumption, the rest of the implementation for that function falls perfectly in place in accordance to the supposed algorithm.

Now, as per me, the problem is with line 127 -- train_sg_pair(model, word, word2, alpha, labels) of the function train_sentence_sg. I believe this statement should be -- train_sg_pair(model, word2, word, alpha, labels) (just swapped word and word2). In the for loop word corresponds to the current word and word2 corresponds to the context word; and in the skip-gram model we are trying to predict the context word given the current word. However, the current word as per your implementation of train_sg_pair is word2 and not word.

For your reference, the function train_sg_pair is being called correctly from doc2vec.py.

I hope I was able to explain the issue. If not, please let me know and I can try to explain it using an example.

Best Regards,
Parang

@piskvorky
Copy link
Owner

Hello Parang, yes, I remember there was some discussion around this on the word2vec mailing list. IIRC Tomas Mikolov concluded it makes no difference, he swapped the order for cache efficiency.

In any case, gensim's word2vec port tries to follow the C version faithfully, so it does whatever the C version does (or rather, did, at the time I ported the Python version -- I haven't checked whether the C version has changed since).

@piskvorky
Copy link
Owner

I looked up that discussion: https://groups.google.com/forum/#!searchin/word2vec-toolkit/word$20word2/word2vec-toolkit/-AUPLOHGymI/yD4gl0mSNNEJ

Parang, if you plan to experiment with the swapped order, let me know. I'll leave this issue open. If you don't, feel free to close.

@parangsaraf
Copy link
Author

Thanks Radim for your reply. I will try training with both the implementations and then will look at the similarity of the generated word vectors. A quick question: Will the initial random assignment of the word vectors have any effect on the finally generated word vectors? For example: because there were two random assignments initially, the two final vectors are different and we end up comparing apples with oranges.

@piskvorky
Copy link
Owner

No, I think there's been a pull request to seed the vectors deterministically (see the hashfnx parameter). But, the results will be different anyway, as long as you use multiple threads. The threading brings indeterminism.

But what I found during the early experiments is that this doesn't affect the results much -- the spread (standard deviation) of accuracy scores across multiple runs is small.

If you find your "new" version is consistently 2% better, without sacrificing much performance, we'll keep that.

@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

@parangsaraf Do you have results of testing your version?

@gojomo gojomo changed the title Possible code error when calling train_sg_pair from train_sentence_sg in word2vec.py Possible error calling train_sg_pair from train_sentence_sg in word2vec.py (context word vs target word) Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants