error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) #28

enicon · 2015-06-05T07:16:54Z

Hello,
thanks for sharing this! I was trying to do some experiment: I keep getting this error from sample.lua and I would appreciate an hint about how to fix it:

[root@sushi char-rnn-master]# th sample.lua -gpuid -1 cv/lm_lstm_epoch9.57_nan.t7
creating an LSTM...
seeding with
/root/torch/install/bin/luajit: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109)
stack traceback:
[C]: at 0x7f8149182b20
[C]: in function 'multinomial'
sample.lua:102: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00405800
[root@sushi char-rnn-master]#

I suspect that something is wrong with the training. I'm using all defaults and the data .txt file is about 550KB. I'm using gpuid=-1 on both training and sampling (no GPU)

Thanks!

(I know I should probably not be running as root...)

Taschi120 · 2015-06-10T22:09:23Z

Same problem over here, I can provide the snapshot if it helps at all.

Edit: Also, my training and validation losses all come up as NaN during training. Could that have something to do with the bug?

karpathy · 2015-06-11T04:30:51Z

It seems your dataset is very small.
Can you try making small batch size? E.g. batch_size 10 or maybe 20, and also maybe seq_length 50 maybe?

Taschi120 · 2015-06-11T13:29:02Z

I have a 4.9MB input file and tried to run it with many different parameters - including the ones you just suggested. The error persists.

The same thing happens when I try to train based on the tinyshakespeare dataset with default parameters and -gpuid -1.

tjrileywisc · 2015-06-21T01:16:15Z

I'm seeing this too when testing the tinyshakespeare dataset with defaults. Training was broken, but then somehow on another run I was getting the losses to update (not showing NaNs) but then sampling wasn't working. I tried to restart training from scratch just to see if I could get the sampling working, but now training is broken again.

If it helps in any way, I'm on the most recent version of Torch, Lua is 5.1.4 and OS is CentOS 6.2.

hughperkins · 2015-07-04T06:39:18Z

The error means your data are naned. Two possible causes include the weights becoming naned during training, or the cv snapshot file being corrupted somehow.

tjrileywisc · 2015-07-05T21:30:30Z

Might have been the weights getting NaN'd, since I saw this happen during training on the same machine. I switched to an Ubuntu container (instead of CentOS) and the training and sampling worked there on several datasets without any further problems.

afoland · 2016-03-14T03:26:39Z

This is nearly for certain due to: torch/torch7#453

I found it crashed when the probability for one output went to 1.0000 . (This explains the observation of its temperature dependence.) I verified none of the inputs were NaN'ed or negative.

In sample.lua I simply subtracted 0.001 from any probability greater than 0.999 before passing it to multinomial, and that cured the crashes.

I have not tried the recommended step of changing multinomial to be double.

soumith · 2016-03-14T03:51:27Z

i'll try to get this resolved in the torch repo in the next week.

nkoumchatzky · 2016-03-14T14:15:43Z

@afoland @enicon Would you have a self-contained example I could use to test if the proposed solution to torch/torch7#453 would work in your case?
Thanks

afoland · 2016-03-15T01:19:49Z

Not very self-contained, I'm afraid, I'm using eidnes' word-rnn (derived heavily from char-rnn), run on one of my datasets.

I'm a complete newcomer to github, if there's a more sensible way to contact you than through posting here I'm happy to exchange more info to see if we can hash out some way to test.

Taschi120 mentioned this issue Jun 14, 2015

train_loss and grad/param norm always are nan #38

Open

qingyuanxingsi mentioned this issue Jan 26, 2016

Multinomial Error #152

Open

allthetime mentioned this issue Mar 10, 2016

Error when sampling larspars/word-rnn#13

Open

nkoumchatzky mentioned this issue Mar 14, 2016

#434 breaks multinomial. torch/torch7#453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) #28

error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) #28

enicon commented Jun 5, 2015

Taschi120 commented Jun 10, 2015

karpathy commented Jun 11, 2015

Taschi120 commented Jun 11, 2015

tjrileywisc commented Jun 21, 2015

hughperkins commented Jul 4, 2015

tjrileywisc commented Jul 5, 2015

afoland commented Mar 14, 2016

soumith commented Mar 14, 2016

nkoumchatzky commented Mar 14, 2016

afoland commented Mar 15, 2016

error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) #28

error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) #28

Comments

enicon commented Jun 5, 2015

Taschi120 commented Jun 10, 2015

karpathy commented Jun 11, 2015

Taschi120 commented Jun 11, 2015

tjrileywisc commented Jun 21, 2015

hughperkins commented Jul 4, 2015

tjrileywisc commented Jul 5, 2015

afoland commented Mar 14, 2016

soumith commented Mar 14, 2016

nkoumchatzky commented Mar 14, 2016

afoland commented Mar 15, 2016