Code for "analogy task" evaluation #9

Closed
wants to merge 1 commit into
from

Projects

None yet

3 participants

@piskvorky

Code for accuracy evaluations, as discussed.

When ran on 30k most common Wiki words (after filter_extremes(keepn=30000)), 100 dim, GloVe gets 33.3%, word2vec 45.9%.

The code is basically hacked out of gensim's word2vec, with an extra fake_api method, which monkey patches trained word2vec / glove objects, so that they share the same interface.

The accuracy is then a single function, the same for both models, taking this patched model object as parameter.

Use:

logging.root.level = logging.INFO
model = Word2Vec.load('trainmodel') # or Glove.load('trainedmodel')
test_embed.fake_api(model)  # monkey patch it
ok_words = model.word2id  # or pick the top 30k most frequent words
sections = test_embed.accuracy(model, 'questions-words.txt', test_embed.most_similar, ok_words)

We'll get rid of this fake_api when GloVe is properly merged, and the API between models is cleanly unified.

Code still a bit rough, and I removed the __main__ part completely (didn't have time to finish that properly), but hopefully the idea's clear.

@piskvorky piskvorky add code for "analogy task" eval
* hacked out of gensim
* temporary solution until merged properly
b60ec15
@ogrisel
Contributor
ogrisel commented Nov 6, 2014

I think it would be better to rename test_embed.py to analogy_accuracy.py. Otherwise test runners such as nosetests or developers used to use nose like myself will be mislead into thinking that this file holds a unittest suite which is not the case.

@ogrisel
Contributor
ogrisel commented Nov 6, 2014

Also I would remove the commented code in the main section and demo how to use it in the example/example.py script instance.

@maciejkula
Owner

I'll be cleaning this a little today, should make a PR soon-ish.

@piskvorky

This PR is not meant to be merged.

The "real" functions reside & will be maintained in gensim, this code is for testing only.

I could have sent the single file via email, but github is better for comments & formatting :-)

@maciejkula
Owner

I finally managed to train on the entire wikipedia, with the default filter_extreme settings and keepn=30000 (100 dimensions, 100 epochs).

This is roughly what I get:

Section gram3-comparative mean rank: 0.00132422422422, accuracy: 0.129129129129           
Section gram8-plural mean rank: 0.000447601010101, accuracy: 0.0890151515152              
Section capital-common-countries mean rank: 3.04347826087e-05, accuracy: 0.636363636364   
Section city-in-state mean rank: 0.00117315673289, accuracy: 0.121854304636               
Section family mean rank: 0.000349473684211, accuracy: 0.434210526316                     
Section gram9-plural-verbs mean rank: 0.0022291005291, accuracy: 0.117724867725           
Section gram2-opposite mean rank: 0.0447609649123, accuracy: 0.0210526315789              
Section currency mean rank: 0.115169135802, accuracy: 0.0                                 
Section gram4-superlative mean rank: 0.00134003623188, accuracy: 0.054347826087           
Section gram6-nationality-adjective mean rank: 4.42499392171e-06, accuracy: 0.909555069292
Section gram7-past-tense mean rank: 0.00622515410147, accuracy: 0.0376955903272           
Section gram5-present-participle mean rank: 0.0157334767025, accuracy: 0.0193548387097    
Section capital-world mean rank: 0.000255688910226, accuracy: 0.518482957273              
Section gram1-adjective-to-adverb mean rank: 0.020091543514, accuracy: 0.0221674876847

Overall rank: 0.00510933035607, accuracy: 0.257293092271                                  

So the next steps will be to:

  1. Train word2vec on the same dataset and evaluate using the same evaluation method.
  2. Look into the gradients: I use AdaGrad where the GloVe code uses a modification of AdaGrad where instead of gradients[i] += gradient ** 2 they use gradients[i] += (initial_learning_rate * gradient) ** 2 (resulting in the learning rate falling slower).
  3. Look into not using OOV placeholders in corpus construction (to expand the effective context window).
@piskvorky

Hmm, 25%, worse than my 100 dim with 1 epoch (33%)?

Probably different corpus preprocessing. Or perhaps it's the OOV placeholders (I ignored OOV tokens in my run).

Incidentally, have you seen this article? https://docs.google.com/document/d/1ydIujJ7ETSZ688RGfU5IMJJsbxAi-kRl8czSwpti15s/
Explores some of the experimental differences between word2vec and GloVe rather nicely.

In any case, the results there are significantly better (using the original GloVe implementation), so I'm guessing there must still be some nasty hidden bug/deviation from the paper here :(

@piskvorky

Btw, when porting word2vec, I started with a very faithful translation, so as to easily compare results / rule out conceptual bugs, before moving on with extensions.

Is there a version of your GloVe implementation that you know performed equivalently to the original, or did you start differently right off the bat?

I'm not at all suggesting you got the algo wrong, but maybe they had extra tricks in the original implementation that didn't "make it" into the paper, hence the difference. Happens all the time.

@maciejkula
Owner

There are several deviations from the paper that I know of (and this is how I started for simplicity).

So I'll look into them in turn and see if they make a difference. This should be reasonably straightforward now that I have the testing code ready, and should be interesting to find out what seemingly inconsequential changes make a large difference.

Nice link, thanks for posting!

@piskvorky

I feel this will make for an interesting article/blog post, make sure you note everything down :)

I'll run your updated GloVe on En Wiki too, so we can compare/cross check results.

@piskvorky

Btw have you tried with float32? If not, I'll open a PR and run my tests only on that. Want to avoid the massive memory hit.

@maciejkula
Owner

Will do :)

I think the memory utilization was mostly due to the way the co-occurrence matrix was constructed, I managed to reduce that substantially. Training on the resulting matrix with 200-long vectors and 30k vocabulary takes about 4.8 GB of memory.

I'll shift to single precision at some point but want to do other things first.

@piskvorky

I tried on the English wiki, comparing the current GloVe version & word2vec hierarchical softmax skipgram. Exact same data, window=10, workers=8, dims=1000, standard cosine. 10 training epochs for GloVe and 1 epoch for word2vec (=about 4h total training time in all cases):

**GloVe, ignore: **
capital-common-countries: 91.7% (464/506)
capital-world: 93.3% (1569/1682)
currency: 1.9% (1/54)
city-in-state: 81.7% (1800/2203)
family: 85.0% (260/306)
gram1-adjective-to-adverb: 10.6% (86/812)
gram2-opposite: 32.4% (123/380)
gram3-comparative: 70.8% (794/1122)
gram4-superlative: 33.2% (183/552)
gram5-present-participle: 44.5% (387/870)
gram6-nationality-adjective: 97.7% (1269/1299)
gram7-past-tense: 44.9% (665/1482)
gram8-plural: 81.1% (856/1056)
gram9-plural-verbs: 37.8% (246/650)
total: 67.1% (8703/12974)

**GloVe, oov: **
capital-common-countries: 94.9% (480/506)
capital-world: 95.7% (1610/1682)
currency: 5.6% (3/54)
city-in-state: 83.0% (1829/2203)
family: 85.0% (260/306)
gram1-adjective-to-adverb: 9.7% (79/812)
gram2-opposite: 27.9% (106/380)
gram3-comparative: 75.1% (843/1122)
gram4-superlative: 32.1% (177/552)
gram5-present-participle: 43.6% (379/870)
gram6-nationality-adjective: 96.3% (1251/1299)
gram7-past-tense: 40.3% (597/1482)
gram8-plural: 73.0% (771/1056)
gram9-plural-verbs: 38.0% (247/650)
total: 66.5% (8632/12974)

**word2vec, ignore: **
capital-common-countries: 71.1% (360/506)
capital-world: 78.2% (1316/1682)
currency: 1.9% (1/54)
city-in-state: 66.6% (1467/2203)
family: 85.0% (260/306)
gram1-adjective-to-adverb: 21.8% (177/812)
gram2-opposite: 32.4% (123/380)
gram3-comparative: 52.0% (584/1122)
gram4-superlative: 12.7% (70/552)
gram5-present-participle: 34.6% (301/870)
gram6-nationality-adjective: 83.9% (1090/1299)
gram7-past-tense: 44.9% (666/1482)
gram8-plural: 63.6% (672/1056)
gram9-plural-verbs: 36.2% (235/650)
total: 56.4% (7322/12974)

**word2vec, oov:**
capital-common-countries: 81.0% (410/506)
capital-world: 82.8% (1393/1682)
currency: 0.0% (0/54)
city-in-state: 63.3% (1394/2203)
family: 74.8% (229/306)
gram1-adjective-to-adverb: 18.1% (147/812)
gram2-opposite: 23.7% (90/380)
gram3-comparative: 40.1% (450/1122)
gram4-superlative: 10.1% (56/552)
gram5-present-participle: 25.9% (225/870)
gram6-nationality-adjective: 91.8% (1192/1299)
gram7-past-tense: 39.9% (592/1482)
gram8-plural: 56.2% (593/1056)
gram9-plural-verbs: 32.3% (210/650)
total: 53.8% (6981/12974)

**word2vec, ignore, window=5**
capital-common-countries: 72.9% (369/506)
capital-world: 78.0% (1312/1682)
currency: 1.9% (1/54)
city-in-state: 65.1% (1435/2203)
family: 85.3% (261/306)
gram1-adjective-to-adverb: 17.6% (143/812)
gram2-opposite: 33.7% (128/380)
gram3-comparative: 61.2% (687/1122)
gram4-superlative: 18.1% (100/552)
gram5-present-participle: 34.9% (304/870)
gram6-nationality-adjective: 81.3% (1056/1299)
gram7-past-tense: 48.9% (724/1482)
gram8-plural: 62.9% (664/1056)
gram9-plural-verbs: 42.6% (277/650)
total: 57.5% (7461/12974)

**word2vec, ignore, negative sampling=10, no hierarchical softmax, trained on only first 0.5M documents:**
capital-common-countries: 89.1% (451/506)
capital-world: 87.0% (1463/1682)
currency: 0.0% (0/54)
city-in-state: 73.2% (1613/2203)
family: 82.7% (253/306)
gram1-adjective-to-adverb: 27.6% (224/812)
gram2-opposite: 41.3% (157/380)
gram3-comparative: 81.7% (917/1122)
gram4-superlative: 27.0% (149/552)
gram5-present-participle: 45.4% (395/870)
gram6-nationality-adjective: 98.0% (1273/1299)
gram7-past-tense: 55.0% (815/1482)
gram8-plural: 83.1% (878/1056)
gram9-plural-verbs: 52.3% (340/650)
total: 68.8% (8928/12974)

**word2vec, ignore, negative=10 (like above, but full wiki; training took 13h here)**
capital-common-countries: 87.5% (443/506)
capital-world: 88.2% (1483/1682)
currency: 3.7% (2/54)
city-in-state: 71.2% (1569/2203)
family: 82.4% (252/306)
gram1-adjective-to-adverb: 29.2% (237/812)
gram2-opposite: 35.0% (133/380)
gram3-comparative: 65.6% (736/1122)
gram4-superlative: 29.0% (160/552)
gram5-present-participle: 43.9% (382/870)
gram6-nationality-adjective: 97.5% (1266/1299)
gram7-past-tense: 53.0% (786/1482)
gram8-plural: 75.8% (800/1056)
gram9-plural-verbs: 48.0% (312/650)
total: 66.0% (8561/12974)

**word2vec on Google News, completely different corpus & settings, just for illustration. See http://code.google.com/p/word2vec/#Pre-trained_word_and_phrase_vectors:**
capital-common-countries: 24.7% (94/380)
capital-world: 15.0% (97/647)
currency: 12.2% (61/502)
city-in-state: 14.0% (227/1627)
family: 84.6% (428/506)
gram1-adjective-to-adverb: 28.5% (283/992)
gram2-opposite: 42.7% (347/812)
gram3-comparative: 90.8% (1210/1332)
gram4-superlative: 87.3% (980/1122)
gram5-present-participle: 78.1% (825/1056)
gram6-nationality-adjective: 21.9% (212/967)
gram7-past-tense: 66.0% (1029/1560)
gram8-plural: 89.9% (1197/1332)
gram9-plural-verbs: 67.9% (591/870)
total: 55.3% (7581/13705)

(ignore = any word outside this vocab was ignored; oov = out-of-vocab words were replaced by a single special 'OOV' token)

Looks good 👍

I'll try negative sampling and direct PMI scores from Levy & Goldberg next.

The end game here is to make sense of the strengths & weaknesses of these variants. And keep the ones that are easy to interpret & analyze errors & tune for specific purposes. I think that's the most annoying (missing) aspect of these methods right now, while being very important in practice.

@maciejkula
Owner

Looks great! To be frank I'm slightly lost as to what factors influence the accuracy, so it's great to be able to compare he OOV/ignore approaches (which in this case don't seem to make a large difference?).

@piskvorky

Note that these all use 1k dimensions (you used 200 I think).

There was no preprocessing to speak of, the Wiki corpus is taken directly from https://github.com/piskvorky/sim-shootout . In particular, there is no sentence splitting: each document is one large sentence, with the window sliding over its tokens.

@maciejkula
Owner

@piskvorky can I just confirm that your experiments were done using the latest master (commit 4cd5ffd) -- both the co-occurrence matrix creation and the training?

@piskvorky

Confirmed, 4cd5ffd it is. For word2vec, it is the latest develop branch of gensim (commit 99c7d0be53d5cd4cb5528c707b9cacaa75013434).

@maciejkula
Owner

Thanks, there was a bug in earlier version of co-occurrence code, so I just wanted to confirm that.

@piskvorky

Maciej, I re-ran the experiments, also comparing to that SPPMI and SPPMI with SVD of Levy & Goldberg: http://radimrehurek.com/2014/12/making-sense-of-word2vec/ .

I tried GloVe with dim 1,000, but it keeps failing with that not-a-number exception. I tried lowering the learning rate, but then got rubbish results (< 1% accuracy on the word analogy task).

Can you give some insight on what likely happens there? Why would it fail like that, is there an internal limit somewhere? Or is it due to the gradient descent method you use (adagrad)?

I can send you the corpus matrix again if you like.

@maciejkula
Owner

My suspicion is that this is due to exploding gradients, maybe clipping
the loss at some maximum value will help.

Perhaps @ogrisel has some good advice?
On 23 Dec 2014 18:05, "Radim Řehůřek" notifications@github.com wrote:

Maciej, I re-ran the experiments, also comparing to that SPPMI and SPPMI
with SVD of Levy & Goldberg:
http://radimrehurek.com/2014/12/making-sense-of-word2vec/ .

I tried GloVe with dim 1,000, but it keeps failing with that not-a-number
exception. I tried lowering the learning rate, but then got rubbish results
(< 1% accuracy on the word analogy task).

Can you give some insight on what likely happens there? Why would it fail
like that, is there an internal limit somewhere? Or is it due to the
gradient descent method you use (adagrad)?

I can send you the corpus matrix again if you like.


Reply to this email directly or view it on GitHub
#9 (comment).

@ogrisel
Contributor
ogrisel commented Dec 23, 2014

In scikit-learn, I clipped the gradient in SGDClassifier / SGDRegressor to some high value such 1e12 and it rendered those methods much more stable for squared hinge loss classification and least squares regression.

The same problem might indeed happen for matrix factorization with least squares reconstruction error as it the case in GloVe. It's worth a try.

@maciejkula
Owner

I added two branches that might address the problem:

  • https://github.com/maciejkula/glove-python/tree/clip_loss clips the loss at a configurable value (the max_loss parameter to the Glove constructor). Because we try to reproduce log counts setting this to a very low value (say, 1 or 2) might not be a completely bad idea.
  • https://github.com/maciejkula/glove-python/tree/adagrad_learning_rate increments the Adagrad squared gradients by the square of initial_learning_rate * loss instead of just loss. This way, the learning rate goes down less rapidly if the initial learning rate is low, and this might address the poor performance of low learning rates. (This is, by the way, the implementation used in the C code accompanying the paper.)

@piskvorky do you think you could try these out?

@ogrisel
Contributor
ogrisel commented Dec 24, 2014

Looks good. Does the original C code clip the gradients?

@piskvorky

I'll try for sure! Muchas gracias Maciej. It's a pleasure to be using a lib with such responsive developer.

@maciejkula
Owner

No, the original code does no clipping as far as I can tell.

No worries, let me know if any of this helps!

@piskvorky

The clip_loss branch finished successfully on 1,000 dims (where before it failed).

Final accuracy seems also fine (68.8%, vs. 67.1% for 600 dims in the blog post) 👍

@maciejkula
Owner

Great! What value did you clip it to?

And have you tried the learning rate branch? If it works and the accuracy is also OK it sounds like a better solution as it has one fewer parameter to worry about.

@piskvorky

max_loss=10. Will try adagrad_learning_rate now.

@piskvorky

Non-finite values in word vectors immediately after epoch 0, with adagrad_learning_rate.

@maciejkula
Owner

Even with a very low learning rate?

@piskvorky

What is a very low learning rate? What should I try?

@maciejkula
Owner

Hard to say. 0.01? 0.005? The hope is that with this change there will be some sufficiently low learning rate that doesn't give NaNs and still gives decent performance (given enough epochs). What did you try?

If this doesn't work I'll go with clipping.

@piskvorky

0.01 => 54% accuracy. Then I tried increasing the number of epochs to 50, but that failed with non-finites again (after epoch 40 = twelve hours of training).

Sounds like a little too much magic for my taste, surely we can't be asking users to go through such "tuning". The clipping sounds more reasonable to me.

@maciejkula
Owner

Yes, it looks like clipping is a much better option. Thanks for your help.
On 28 Dec 2014 10:56, "Radim Řehůřek" notifications@github.com wrote:

0.01 => 54% accuracy. Then I tried increasing the number of epochs to 50,
but that failed with non-finites again (after epoch 40 = twelve hours of
training).

Sounds like a little too much magic for my taste, surely we can't be
asking users to go through such "tuning". The clipping sounds more
reasonable to me.


Reply to this email directly or view it on GitHub
#9 (comment).

@maciejkula
Owner

Clipping with a default max_loss of 10 merged in #26.

@maciejkula maciejkula closed this Jan 10, 2016
@piskvorky piskvorky deleted the piskvorky:analogy_eval branch Jan 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment