Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong calculation for max_iter_dump #1216

Closed
ajkl opened this issue Mar 15, 2017 · 5 comments
Closed

Wrong calculation for max_iter_dump #1216

ajkl opened this issue Mar 15, 2017 · 5 comments

Comments

@ajkl
Copy link
Contributor

ajkl commented Mar 15, 2017

In
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/wrappers/wordrank.py#L144
Shouldnt this line max_iter_dump = iter / dump_period * dump_period - 1 just be
max_iter_dump = iter - dump_period ?

To reproduce try these parameters:
model = Wordrank.train(wr_path, data, out_dir, iter=100, dump_period=5)
It will error out with -

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ajkale/anaconda2/envs/wordrank/lib/python2.7/site-packages/gensim/models/wrappers/wordrank.py", line 146, in train
    copyfile('model_word_%d.txt' % max_iter_dump, 'wordrank.words')
  File "/home/ajkale/anaconda2/envs/wordrank/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'model_word_99.txt'

Mainly because max_iter_dump = iter / dump_period * dump_period - 1 calculates max_iter_dump=99 instead of 95.

ajkl added a commit to ajkl/gensim that referenced this issue Mar 15, 2017
@tmylk
Copy link
Contributor

tmylk commented Mar 15, 2017

Cc @parulsethi

@parulsethi
Copy link
Contributor

parulsethi commented Mar 15, 2017

Actually WordRank dumps the current iteration's file with the start of next iteration. So basically, we need to do 101 iterations for getting 100th iter file. And hence, the corresponding max_iter_dump. though I just noticed that dump_period division is redundant.

I've mentioned this thing under Train Model in wordrank quickstart, but yeah anyone could easily miss on it.

I guess, better behavior would be to simply add 1 more iter in wrapper's train() itself (and so updating max_iter_dump=iter), rather than leaving it to user. So that, with the parameters you mentioned, you'll get embedding from model_word_100.txt only, better than model_word_95.txt. WDYT @ajkl?

@ajkl
Copy link
Contributor Author

ajkl commented Mar 16, 2017

Just saw your note in the quickstart notebook. Thats implicit knowledge and i think the code should handle it or warn and not crash.
About your add 1 to iter comment, wont that mean you end up making n+1 iterations than n (since iter is 0 indexed) ? Thats not ideal behavior either although you can argue its just 1 more iteration than what the user specified. I dont think its accurate though.
On second thoughts my earlier suggestion wont work either, it works only when iter mod dump_period = 0 and same goes for your suggestion to add 1 to iter eg: iter=102, dump_period=5

This is what I think it should be max_iter_dump=(iter - 1) - (iter - 1) % dump_period which basically is finding the largest number less than iter which is divisible by dump_period.

@parulsethi
Copy link
Contributor

parulsethi commented Mar 16, 2017

Yep, I still think we can do n+1 iterations because then if someone input, for ex. 100 iterations, they would get the result of 100th iter only, and not 95 (am I missing any drawback of doing n+1 iteration? as if it's time then there would already be 5 redundant iterations in case of 95)
And in case, the iter%dump_period is not zero, then that extra iteration is not needed, as wordrank would already be doing those extra iterations itself(which would be equal to the resulting mod)

So IMO solution could be:

if iter%dump_period==0:
    iter+=1
else:
    warn('Resultant embedding would be from %d iteration' % 'iter - iter % dump_period')

and then doing max_iter_dump=iter - iter % dump_period similar to your suggestion

@ajkl
Copy link
Contributor Author

ajkl commented Mar 17, 2017

sounds good, updated the PR

@tmylk tmylk closed this as completed in c5a053c Mar 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants