New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
W2v negsam #162
W2v negsam #162
Conversation
Great, thanks @cod3licious :) I'll review and merge asap. I'm kinda overwhelmed at the moment with life stuff, sorry. Reviewing the python3 port is also still in my "gensim queue". |
Hey, I would love to see this in Gensim soon. I am right now usong Word2Vec for my master thesis and definetely Neg works better for my but I want to stop using the original C version. Is there a way I could help (a particular test / verification) ? |
Sure! Many things you can help with @mfcabrera :
|
Hi all, a few months ago, using the Gensim implementation of word2vec as a starting point, I added negative sampling (https://github.com/sebastien-j/Word2vec). The modified files are not ready to be merged into gensim yet, but they might be useful. Here are some potential issues with my implementation. First, the different .py and .pyx files should be merged instead of having one for each kind of model. Second, I did not use the look-up table to compute the sigmoid function, but that can be changed quickly. Third, I was generating the vocabulary Huffman tree and using it during training to see whether a word was in the vocabulary or not. Moreover, the random number generation seems wasteful, but I didn't profile the code to see whether this is a bottleneck or not. There are also additional hyper-parameters that may not be useful in gensim. Finally, the code for "optimization 2" is not written. If you have any questions, feel free to ask. I'll try to help. Sébastien |
I wouldn't worry about the sigmoid tables. That optimization doesn't bring much. But I worry about merging this without a cython version. People will complain it's too slow :) |
Putting the sigmoid table back is a really easy task anyway. I can try to make a cleaner version of the cython version. What would be the best way to submit it? Just make a new pull request? @cod3licious, why are both 'hs' and 'negative' parameters? I think one of these should be sufficient. |
Best to make a pull request to @cod3licious 's |
@cod3licious , could you please update your w2v-negsam branch so that it includes the recent changes made in the develop branch of pivorski/gensim? It would help me add Cython functionality. Thank you. |
@sebastien-j concerning the 'hs' and 'negative' parameters: those are both present in the original C code, so I thought I'll add that in as well. This makes it possible to train the model using both methods and not just one, which might be useful for some cases. I'm at work right now but I'll try to include the other changes asap. Thanks for helping out :) |
@cod3licious, thanks. I guess having both 'hs' and 'negative' doesn't hurt and could in some cases be useful, although I still find it somewhat weird. @piskvorky, I sent a pull request to cod3licious:w2w-negsam (for the Cython version). As I mention in my message there, I am unsure about the "best" way to generate random numbers in order to sample from the vocabulary. |
@cod3licious @sebastien-j OK, great. Let's finish this pull request :) Re. RNG: your approach seems fine, generating the random numbers directly. Certainly faster than calling external methods. But did you check the performance (speed, accuracy) with this RNG approach? No problems there? |
@piskvorky , the performance seems ok. The speeds I report correspond to a single experiment on my laptop. There was generally some other light work done simultaneously. You may want to compare these results to those obtained with Mikolov's word2vec tool. (I never got it to work properly on my Windows system.) On fil9 (~123M words; http://mattmahoney.net/dc/textdata.html): Hierarchical softmax: Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=1): Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0): Negative sampling: Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, hs=0, negative=5): Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0, hs=0, negative=5): To see to impact of the RNG, I also tried using numpy's random.randint inside the fast_sentence functions (by using "with gil" to generate the random number, and then releasing the gil): Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, hs=0, negative=5): Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0, hs=0, negative=5): There is a clear speed penalty, but no obvious performance gain. |
@sebastien-j great, thanks again. Is this waiting only for a merge from @cod3licious now? Do we need anything else? |
@piskvorky , the python version of cbow with negative sampling is not in the pull request yet. However, @cod3licious has done it in cod3licious/word2vec/trainmodel.py, so integrating it into gensim should be easy. |
Wait, I'm confused -- we are integrating your changes into the How many pieces to integrate are there, in what branches? I'd suggest putting everything into a single branch. It will then be easier to reason about the changes, test them and ultimately merge into gensim. |
We are indeed integrating my changes into the To recap (as far as I can tell), @cod3licious first implemented all four training methods (python only) in the I then made a pull request for CBOW (h.softmax only), which was merged into gensim. Once that was done, @cod3licious updated her Thus, the only missing part in |
OK. I'll ping @cod3licious via email, I think she's not receiving github notifications. I've added you as gensim collaborator @sebastien-j , so you can merge/commit directly. Please use with care :-) |
I made a mistake in test_word2vec.py. I sent a pull request to @cod3licious 's |
How about we open a new pull request, one that you have full control over, @sebastien-j ? May be easier and quicker. There's always new functionality being added into gensim, and the longer we wait with merging this PR, the more work it will be to resolve conflicts later & get the PR up-to-date. |
I don't think that is necessary. @cod3licious gave me access to her gensim repository. I'll try to add the remaining functionality soon. |
Ah, cool, I didn't know :) Big thanks to both of you for polishing & pushing this! |
I added some more Python code to cover all the training algorithms. Most of it is taken directly from @cod3licious 's
|
Changes made to word2vec.pyx word2vec.py has not beeen modified yet. Not tested.
For the Cython implementation of negative sampling
These modifications are mostly copied from @cod3licious 's wordvec repository.
Index exclusion now matches code in @cod3licious 's word2vec repository. However, this differs from the cython version and from the original word2vec implementation (or at least I think it does). I don't know if we should exclude indices in word2_indices.
Re. differences and variation: I think it's best to aim at replicating the C code as closely as possible (rather than the original paper). The code paths seem pretty complex. We'll need to try some larger corpora (for example Thanks for the work and testing as usual @sebastien-j ! I have rebased this PR on top of the current |
Also reformat some comments
The mean can be also be used.
I have updated the Some comments on the points I raised in my previous post:
There is another difference between the Cython version and the C code. The logistic function approximation used with negative sampling is not the same. I tried doing as in the C code, but for some reasons I don't understand, it didn't work well. Right now, I employ the approximation used in hierarchical softmax (See commit 9332b43). I agree that some additional testing is needed to make suke that everything is ok. @piskvorky , do you mind running some tests with the Google tool and posting the results here? Installing the C version on my Windows laptop is not straightforward. I could run the same tests for this pull request. @piskvorky , do you want me to push the changes I make into your |
OK. I'll run a few combinations on text9 (cbow on/off, some negative sampling; any other parameter suggestions?) using the C tool. I'll post the accuracy+time results here. I don't think pushing will work; my It's probably easier if you rebase the changes you've made since my rebase (=since yesterday) on top of my In any case, let's start using the rebased branch asap, or this will turn into a git nightmare :) |
Sorry, scratch that. Now I see you've actually used the rebased branch already 👍 |
We should at least check all 4 basic combinations {skip-gram, CBOW} x {hierarchical softmax, negative sampling}. To check the Python version, using text8 may be useful. |
(I never used negative sampling; so I'm not sure whether |
With the same hyper-parameters (and 1 worker): Skip-gram h. softmax: Cython 26.7%, 94.3k wps; Python 26.8%, 1016 wps CBOW h. softmax: Cython 14.2%, 315.4k wps; Python 14.2%, 3363 wps Skip-gram neg. sampling: Cython 13.8%, 70.4k wps; Python 14.1%, 996 wps CBOW neg. sampling: Cython 13.0%, 293.4k wps; Python 12.6%, 4001 wps I was able to run the C tool on a virtual machine. On fil9 (with the same hyper-parameters): Skip-gram h. softmax: Cython 50.5%, C 48.9% CBOW h. softmax: Cython 33.2%, C 35.5% Skip-gram neg. sampling: Cython 45.4%, C 44.4% CBOW neg. sampling: Cython 39.2%, C 42.3% |
Sounds good! Are we ready to merge? |
Yes, I think we are ready to merge (but you might want to review it just to be sure...). |
|
||
if model.hs: | ||
# work on the entire tree at once, to push as much work into numpy's C routines as possible (performance) | ||
l2a = deepcopy(model.syn1[word.point]) # 2d matrix, codelen x layer1_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For numpy arrays, simply y = np.array(x)
should be faster than y = deepcopy(x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for fancy indexing (indexing by an array), NumPy should create a copy automatically. Is this explicit copy even necessary?
I looked at the code, but only spotted some very minor things (style inconsistencies). I could fix those myself after merging. As for the code logic, it's hard to check in detail. That's why I suggested comparison with the existing implementation, on a real corpus. As a sort of high-level check. I see you added some unit tests as well, which is great. Are there any more things that could be tested automatically? |
Merged. Thanks @sebastien-j and @cod3licious ! Further changes and fixes can happen directly over From your tests it seems the Cython CBOW is consistently worse than the C CBOW... any idea why? |
added negative sampling to the python train_sentence function for the skipgram model and all additionally needed functions. in build_vocab, if negative sampling is used, there is a big table for the noise word distribution created, which is saved in model.table. this takes quite a bit of RAM and should probably be deleted before saving the model (it's only needed for training anyways).