Wrong calculation for max_iter_dump #1216

ajkl · 2017-03-15T18:57:07Z

In
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/wrappers/wordrank.py#L144
Shouldnt this line max_iter_dump = iter / dump_period * dump_period - 1 just be
max_iter_dump = iter - dump_period ?

To reproduce try these parameters:
model = Wordrank.train(wr_path, data, out_dir, iter=100, dump_period=5)
It will error out with -

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ajkale/anaconda2/envs/wordrank/lib/python2.7/site-packages/gensim/models/wrappers/wordrank.py", line 146, in train
    copyfile('model_word_%d.txt' % max_iter_dump, 'wordrank.words')
  File "/home/ajkale/anaconda2/envs/wordrank/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'model_word_99.txt'

Mainly because max_iter_dump = iter / dump_period * dump_period - 1 calculates max_iter_dump=99 instead of 95.

The text was updated successfully, but these errors were encountered:

fix for piskvorky#1216

tmylk · 2017-03-15T19:58:12Z

Cc @parulsethi

parulsethi · 2017-03-15T21:05:13Z

Actually WordRank dumps the current iteration's file with the start of next iteration. So basically, we need to do 101 iterations for getting 100th iter file. And hence, the corresponding max_iter_dump. though I just noticed that dump_period division is redundant.

I've mentioned this thing under Train Model in wordrank quickstart, but yeah anyone could easily miss on it.

I guess, better behavior would be to simply add 1 more iter in wrapper's train() itself (and so updating max_iter_dump=iter), rather than leaving it to user. So that, with the parameters you mentioned, you'll get embedding from model_word_100.txt only, better than model_word_95.txt. WDYT @ajkl?

ajkl · 2017-03-16T03:31:03Z

Just saw your note in the quickstart notebook. Thats implicit knowledge and i think the code should handle it or warn and not crash.
About your add 1 to iter comment, wont that mean you end up making n+1 iterations than n (since iter is 0 indexed) ? Thats not ideal behavior either although you can argue its just 1 more iteration than what the user specified. I dont think its accurate though.
On second thoughts my earlier suggestion wont work either, it works only when iter mod dump_period = 0 and same goes for your suggestion to add 1 to iter eg: iter=102, dump_period=5

This is what I think it should be max_iter_dump=(iter - 1) - (iter - 1) % dump_period which basically is finding the largest number less than iter which is divisible by dump_period.

parulsethi · 2017-03-16T12:55:13Z

Yep, I still think we can do n+1 iterations because then if someone input, for ex. 100 iterations, they would get the result of 100th iter only, and not 95 (am I missing any drawback of doing n+1 iteration? as if it's time then there would already be 5 redundant iterations in case of 95)
And in case, the iter%dump_period is not zero, then that extra iteration is not needed, as wordrank would already be doing those extra iterations itself(which would be equal to the resulting mod)

So IMO solution could be:

if iter%dump_period==0:
    iter+=1
else:
    warn('Resultant embedding would be from %d iteration' % 'iter - iter % dump_period')

and then doing max_iter_dump=iter - iter % dump_period similar to your suggestion

ajkl · 2017-03-17T06:48:21Z

sounds good, updated the PR

ajkl added a commit to ajkl/gensim that referenced this issue Mar 15, 2017

fix for max_iter_dump calculation

eeb5203

fix for piskvorky#1216

ajkl mentioned this issue Mar 16, 2017

Fix wordrank max_iter_dump calculation. Fix #1216 #1217

Merged

tmylk closed this as completed in c5a053c Mar 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong calculation for max_iter_dump #1216

Wrong calculation for max_iter_dump #1216

ajkl commented Mar 15, 2017 •

edited

tmylk commented Mar 15, 2017

parulsethi commented Mar 15, 2017 •

edited

ajkl commented Mar 16, 2017

parulsethi commented Mar 16, 2017 •

edited

ajkl commented Mar 17, 2017

Wrong calculation for max_iter_dump #1216

Wrong calculation for max_iter_dump #1216

Comments

ajkl commented Mar 15, 2017 • edited

tmylk commented Mar 15, 2017

parulsethi commented Mar 15, 2017 • edited

ajkl commented Mar 16, 2017

parulsethi commented Mar 16, 2017 • edited

ajkl commented Mar 17, 2017

ajkl commented Mar 15, 2017 •

edited

parulsethi commented Mar 15, 2017 •

edited

parulsethi commented Mar 16, 2017 •

edited