Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) #28

Open
enicon opened this issue Jun 5, 2015 · 10 comments

Comments

@enicon
Copy link

enicon commented Jun 5, 2015

Hello,
thanks for sharing this! I was trying to do some experiment: I keep getting this error from sample.lua and I would appreciate an hint about how to fix it:

[root@sushi char-rnn-master]# th sample.lua -gpuid -1 cv/lm_lstm_epoch9.57_nan.t7
creating an LSTM...
seeding with
/root/torch/install/bin/luajit: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109)
stack traceback:
[C]: at 0x7f8149182b20
[C]: in function 'multinomial'
sample.lua:102: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00405800
[root@sushi char-rnn-master]#

I suspect that something is wrong with the training. I'm using all defaults and the data .txt file is about 550KB. I'm using gpuid=-1 on both training and sampling (no GPU)

Thanks!

(I know I should probably not be running as root...)

@Taschi120
Copy link

Same problem over here, I can provide the snapshot if it helps at all.

Edit: Also, my training and validation losses all come up as NaN during training. Could that have something to do with the bug?

@karpathy
Copy link
Owner

It seems your dataset is very small.
Can you try making small batch size? E.g. batch_size 10 or maybe 20, and also maybe seq_length 50 maybe?

@Taschi120
Copy link

I have a 4.9MB input file and tried to run it with many different parameters - including the ones you just suggested. The error persists.

The same thing happens when I try to train based on the tinyshakespeare dataset with default parameters and -gpuid -1.

@tjrileywisc
Copy link

I'm seeing this too when testing the tinyshakespeare dataset with defaults. Training was broken, but then somehow on another run I was getting the losses to update (not showing NaNs) but then sampling wasn't working. I tried to restart training from scratch just to see if I could get the sampling working, but now training is broken again.

If it helps in any way, I'm on the most recent version of Torch, Lua is 5.1.4 and OS is CentOS 6.2.

@hughperkins
Copy link
Contributor

The error means your data are naned. Two possible causes include the weights becoming naned during training, or the cv snapshot file being corrupted somehow.

@tjrileywisc
Copy link

Might have been the weights getting NaN'd, since I saw this happen during training on the same machine. I switched to an Ubuntu container (instead of CentOS) and the training and sampling worked there on several datasets without any further problems.

@afoland
Copy link

afoland commented Mar 14, 2016

This is nearly for certain due to: torch/torch7#453

I found it crashed when the probability for one output went to 1.0000 . (This explains the observation of its temperature dependence.) I verified none of the inputs were NaN'ed or negative.

In sample.lua I simply subtracted 0.001 from any probability greater than 0.999 before passing it to multinomial, and that cured the crashes.

I have not tried the recommended step of changing multinomial to be double.

@soumith
Copy link

soumith commented Mar 14, 2016

i'll try to get this resolved in the torch repo in the next week.

@nkoumchatzky
Copy link

@afoland @enicon Would you have a self-contained example I could use to test if the proposed solution to torch/torch7#453 would work in your case?
Thanks

@afoland
Copy link

afoland commented Mar 15, 2016

Not very self-contained, I'm afraid, I'm using eidnes' word-rnn (derived heavily from char-rnn), run on one of my datasets.

I'm a complete newcomer to github, if there's a more sensible way to contact you than through posting here I'm happy to exchange more info to see if we can hash out some way to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants