-
Notifications
You must be signed in to change notification settings - Fork 321
Code for "analogy task" evaluation #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* hacked out of gensim * temporary solution until merged properly
I think it would be better to rename |
Also I would remove the commented code in the main section and demo how to use it in the |
I'll be cleaning this a little today, should make a PR soon-ish. |
This PR is not meant to be merged. The "real" functions reside & will be maintained in gensim, this code is for testing only. I could have sent the single file via email, but github is better for comments & formatting :-) |
I finally managed to train on the entire wikipedia, with the default This is roughly what I get:
So the next steps will be to:
|
Hmm, 25%, worse than my 100 dim with 1 epoch (33%)? Probably different corpus preprocessing. Or perhaps it's the OOV placeholders (I ignored OOV tokens in my run). Incidentally, have you seen this article? https://docs.google.com/document/d/1ydIujJ7ETSZ688RGfU5IMJJsbxAi-kRl8czSwpti15s/ In any case, the results there are significantly better (using the original GloVe implementation), so I'm guessing there must still be some nasty hidden bug/deviation from the paper here :( |
Btw, when porting word2vec, I started with a very faithful translation, so as to easily compare results / rule out conceptual bugs, before moving on with extensions. Is there a version of your GloVe implementation that you know performed equivalently to the original, or did you start differently right off the bat? I'm not at all suggesting you got the algo wrong, but maybe they had extra tricks in the original implementation that didn't "make it" into the paper, hence the difference. Happens all the time. |
There are several deviations from the paper that I know of (and this is how I started for simplicity). So I'll look into them in turn and see if they make a difference. This should be reasonably straightforward now that I have the testing code ready, and should be interesting to find out what seemingly inconsequential changes make a large difference. Nice link, thanks for posting! |
I feel this will make for an interesting article/blog post, make sure you note everything down :) I'll run your updated GloVe on En Wiki too, so we can compare/cross check results. |
Btw have you tried with float32? If not, I'll open a PR and run my tests only on that. Want to avoid the massive memory hit. |
Will do :) I think the memory utilization was mostly due to the way the co-occurrence matrix was constructed, I managed to reduce that substantially. Training on the resulting matrix with 200-long vectors and 30k vocabulary takes about 4.8 GB of memory. I'll shift to single precision at some point but want to do other things first. |
I tried on the English wiki, comparing the current GloVe version & word2vec hierarchical softmax skipgram. Exact same data, window=10, workers=8, dims=1000, standard cosine. 10 training epochs for GloVe and 1 epoch for word2vec (=about 4h total training time in all cases):
( Looks good 👍 I'll try negative sampling and direct PMI scores from Levy & Goldberg next. The end game here is to make sense of the strengths & weaknesses of these variants. And keep the ones that are easy to interpret & analyze errors & tune for specific purposes. I think that's the most annoying (missing) aspect of these methods right now, while being very important in practice. |
Looks great! To be frank I'm slightly lost as to what factors influence the accuracy, so it's great to be able to compare he OOV/ignore approaches (which in this case don't seem to make a large difference?). |
Note that these all use 1k dimensions (you used 200 I think). There was no preprocessing to speak of, the Wiki corpus is taken directly from https://github.com/piskvorky/sim-shootout . In particular, there is no sentence splitting: each document is one large sentence, with the window sliding over its tokens. |
@piskvorky can I just confirm that your experiments were done using the latest master (commit 4cd5ffd) -- both the co-occurrence matrix creation and the training? |
Confirmed, 4cd5ffd it is. For word2vec, it is the latest |
Thanks, there was a bug in earlier version of co-occurrence code, so I just wanted to confirm that. |
Maciej, I re-ran the experiments, also comparing to that SPPMI and SPPMI with SVD of Levy & Goldberg: http://radimrehurek.com/2014/12/making-sense-of-word2vec/ . I tried GloVe with dim 1,000, but it keeps failing with that not-a-number exception. I tried lowering the learning rate, but then got rubbish results (< 1% accuracy on the word analogy task). Can you give some insight on what likely happens there? Why would it fail like that, is there an internal limit somewhere? Or is it due to the gradient descent method you use (adagrad)? I can send you the corpus matrix again if you like. |
My suspicion is that this is due to exploding gradients, maybe clipping Perhaps @ogrisel has some good advice?
|
In scikit-learn, I clipped the gradient in SGDClassifier / SGDRegressor to some high value such The same problem might indeed happen for matrix factorization with least squares reconstruction error as it the case in GloVe. It's worth a try. |
I added two branches that might address the problem:
@piskvorky do you think you could try these out? |
Looks good. Does the original C code clip the gradients? |
I'll try for sure! Muchas gracias Maciej. It's a pleasure to be using a lib with such responsive developer. |
No, the original code does no clipping as far as I can tell. No worries, let me know if any of this helps! |
The Final accuracy seems also fine (68.8%, vs. 67.1% for 600 dims in the blog post) 👍 |
Great! What value did you clip it to? And have you tried the learning rate branch? If it works and the accuracy is also OK it sounds like a better solution as it has one fewer parameter to worry about. |
|
|
Even with a very low learning rate? |
What is a very low learning rate? What should I try? |
Hard to say. 0.01? 0.005? The hope is that with this change there will be some sufficiently low learning rate that doesn't give NaNs and still gives decent performance (given enough epochs). What did you try? If this doesn't work I'll go with clipping. |
Sounds like a little too much magic for my taste, surely we can't be asking users to go through such "tuning". The clipping sounds more reasonable to me. |
Yes, it looks like clipping is a much better option. Thanks for your help.
|
Clipping with a default |
Code for accuracy evaluations, as discussed.
When ran on 30k most common Wiki words (after
filter_extremes(keepn=30000)
), 100 dim, GloVe gets 33.3%, word2vec 45.9%.The code is basically hacked out of gensim's word2vec, with an extra
fake_api
method, which monkey patches trained word2vec / glove objects, so that they share the same interface.The accuracy is then a single function, the same for both models, taking this patched model object as parameter.
Use:
We'll get rid of this
fake_api
when GloVe is properly merged, and the API between models is cleanly unified.Code still a bit rough, and I removed the
__main__
part completely (didn't have time to finish that properly), but hopefully the idea's clear.