-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) #28
Comments
Same problem over here, I can provide the snapshot if it helps at all. Edit: Also, my training and validation losses all come up as NaN during training. Could that have something to do with the bug? |
It seems your dataset is very small. |
I have a 4.9MB input file and tried to run it with many different parameters - including the ones you just suggested. The error persists. The same thing happens when I try to train based on the tinyshakespeare dataset with default parameters and -gpuid -1. |
I'm seeing this too when testing the tinyshakespeare dataset with defaults. Training was broken, but then somehow on another run I was getting the losses to update (not showing NaNs) but then sampling wasn't working. I tried to restart training from scratch just to see if I could get the sampling working, but now training is broken again. If it helps in any way, I'm on the most recent version of Torch, Lua is 5.1.4 and OS is CentOS 6.2. |
The error means your data are naned. Two possible causes include the weights becoming naned during training, or the cv snapshot file being corrupted somehow. |
Might have been the weights getting NaN'd, since I saw this happen during training on the same machine. I switched to an Ubuntu container (instead of CentOS) and the training and sampling worked there on several datasets without any further problems. |
This is nearly for certain due to: torch/torch7#453 I found it crashed when the probability for one output went to 1.0000 . (This explains the observation of its temperature dependence.) I verified none of the inputs were NaN'ed or negative. In sample.lua I simply subtracted 0.001 from any probability greater than 0.999 before passing it to multinomial, and that cured the crashes. I have not tried the recommended step of changing multinomial to be double. |
i'll try to get this resolved in the torch repo in the next week. |
@afoland @enicon Would you have a self-contained example I could use to test if the proposed solution to torch/torch7#453 would work in your case? |
Not very self-contained, I'm afraid, I'm using eidnes' word-rnn (derived heavily from char-rnn), run on one of my datasets. I'm a complete newcomer to github, if there's a more sensible way to contact you than through posting here I'm happy to exchange more info to see if we can hash out some way to test. |
Hello,
thanks for sharing this! I was trying to do some experiment: I keep getting this error from sample.lua and I would appreciate an hint about how to fix it:
[root@sushi char-rnn-master]# th sample.lua -gpuid -1 cv/lm_lstm_epoch9.57_nan.t7
creating an LSTM...
seeding with
/root/torch/install/bin/luajit: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109)
stack traceback:
[C]: at 0x7f8149182b20
[C]: in function 'multinomial'
sample.lua:102: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00405800
[root@sushi char-rnn-master]#
I suspect that something is wrong with the training. I'm using all defaults and the data .txt file is about 550KB. I'm using gpuid=-1 on both training and sampling (no GPU)
Thanks!
(I know I should probably not be running as root...)
The text was updated successfully, but these errors were encountered: