sample.lua fails to run: error in function addmm() #21

swisspol · 2015-05-30T08:59:36Z

Just trying out the default data set:

$ th train.lua -data_dir data/tinyshakespeare

And then using a checkpoint file:

$ th sample.lua cv/lm_lstm_epoch4.74_1.7332.t7 
using CUDA on GPU 0...  
creating an LSTM... 
seeding with    
/Users/pol/torch/install/bin/luajit: /Users/pol/torch/install/share/lua/5.1/nn/Linear.lua:46: expected arguments: *DoubleTensor~2D* [DoubleTensor~2D] [double] DoubleTensor~2D DoubleTensor~2D | *DoubleTensor~2D* double [DoubleTensor~2D] double DoubleTensor~2D DoubleTensor~2D
stack traceback:
    [C]: in function 'addmm'
    /Users/pol/torch/install/share/lua/5.1/nn/Linear.lua:46: in function 'func'
    /Users/pol/torch/install/share/lua/5.1/nngraph/gmodule.lua:214: in function 'neteval'
    /Users/pol/torch/install/share/lua/5.1/nngraph/gmodule.lua:244: in function 'forward'
    sample.lua:88: in main chunk
    [C]: in function 'dofile'
    .../pol/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x0105f02340

The text was updated successfully, but these errors were encountered:

swisspol · 2015-05-30T09:09:03Z

This other tool works fine however:

$ th inspect_checkpoint.lua cv/lm_lstm_epoch18.96_1.4228.t7 
using CUDA on GPU 0...  
opt:    
{
  max_epochs : 30
  seed : 123
  batch_size : 100
  gpuid : 0
  decay_rate : 0.95
  savefile : "lstm"
  model : "lstm"
  grad_clip : 5
  print_every : 1
  data_dir : "data/tinyshakespeare"
  seq_length : 50
  num_layers : 2
  rnn_size : 100
  train_frac : 0.95
  learning_rate : 0.002
  dropout : 0
  eval_val_every : 1000
  val_frac : 0.05
  checkpoint_dir : "cv"
}
val losses: 
{
  2000 : 1.5233611306277
  3000 : 1.4519253438169
  4000 : 1.4227915313027
  1000 : 1.7233323420178
}

karpathy · 2015-05-30T18:18:34Z

Did you train one with GPU and the other with CPU? Check the "gpuid" flag. Is it "0" on both models?

swisspol · 2015-05-30T18:23:46Z

Yes, GPU on both. What's interesting is that using checkpoints created later on during the process do work as well as the very last one.

antonmil · 2015-06-18T09:26:00Z

@swisspol, I'm running into the same issue inside an iTorch notebook environment, but it works fine on standard command line. I'm very new to Lua / Torch but it would be good to figure out what's causing this.

PaulSchnau · 2015-08-06T16:08:28Z

I had this error on OSX 10.10. Using -opencl 1 on both train.lua and sample.lua made it work.

svickers · 2015-08-08T02:55:51Z

Had this problem on a c2.2xl instance. Tried -opencl 1 but no luck.

hughperkins · 2015-08-08T03:21:06Z

@svickers can you post the fourth line of your output, the one that has 'Linear.lua:46' in it. Edit: and also the results of inspect_checkpoint.

svickers · 2015-08-08T06:26:20Z

@hughperkins Sorry Hugh! I killed that vm and went to g2.2xl and everything worked straight away.

hughperkins · 2015-08-08T06:36:01Z

@svickers oh, nice! Hmmm, c2s dont actually have a gpu, right? g2 sounds like a gpu instance?

quematech · 2015-08-08T11:23:53Z

Hello.

I ran into the same problem, and is dealing with a lot of frustration in my holy quest of running tests with a Monty Python Flying Circus corpus :). The error appears too when trying the tiny shakespeare data set.

I run CPU-only computing (no GPU). The training goes well, no NaN value :

th train.lua -data_dir data/tinyshakespeare/ -gpuid -1

loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an lstm with 2 layers
number of parameters in the model: 240321
cloning rnn
cloning criterion
1/21150 (epoch 0.002), train_loss = 4.19766416, grad/param norm = 4.5006e-01, time/batch = 0.34s
2/21150 (epoch 0.005), train_loss = 4.10134056, grad/param norm = 6.3375e-01, time/batch = 0.28s
3/21150 (epoch 0.007), train_loss = 3.44502399, grad/param norm = 9.4798e-01, time/batch = 0.28s

The sampling raises an error whatever the checkpoint file used :

th sample.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
creating an lstm...
missing seed text, using uniform probability over first character
--------------------------
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Linear.lua:46: invalid arguments: DoubleTensor number DoubleTensor number FloatTensor DoubleTensor
expected arguments: *DoubleTensor~2D* [DoubleTensor~2D] [double] DoubleTensor~2D DoubleTensor~2D | *DoubleTensor~2D* double [DoubleTensor~2D] double DoubleTensor~2D DoubleTensor~2D
stack traceback:
        [C]: in function 'addmm'
        /usr/local/share/lua/5.1/nn/Linear.lua:46: in function 'func'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:253: in function 'neteval'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:288: in function 'forward'
        sample.lua:151: in main chunk
        [C]: in function 'dofile'
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00406720

And the inspect method looks ok

th inspect_checkpoint.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
opt:
{
  max_epochs : 50
  seed : 123
  batch_size : 50
  gpuid : -1
  decay_rate : 0.95
  learning_rate_decay : 0.97
  opencl : 0
  model : "lstm"
  grad_clip : 5
  print_every : 1
  data_dir : "data/tinyshakespeare/"
  seq_length : 50
  num_layers : 2
  learning_rate_decay_after : 10
  rnn_size : 128
  train_frac : 0.95
  dropout : 0
  init_from : ""
  learning_rate : 0.002
  eval_val_every : 1000
  val_frac : 0.05
  savefile : "lstm"
  checkpoint_dir : "cv"
}
val losses:
{
  3000 : 1.4450460764536
  4000 : 1.4213234041304
  5000 : 1.4060113392715
  6000 : 1.389498488439
  8000 : 1.3909428322715
  10000 : 1.4003497627469
  7000 : 1.3937299336865
  9000 : 1.3940925438403
  1000 : 1.7136267190726
  2000 : 1.5211800115534
  11000 : 1.389983844627
}

karpathy · 2015-08-08T12:03:49Z

This is a silly bug I think I introduced only few days ago unfortunately. Fixing...

karpathy · 2015-08-08T12:25:06Z

Ok I think I patched this issue with this commit:
0fb9a77

see if things work properly now with the new sampling script. The issue is that CPU models use doubles, but when I was converting GPU models I converted them to float() and then changed the sampling script to use float(), which broke previous CPU-only functionality. Sorry about the mess, when I was originally designing this code I always use GPUs and I didn't anticipate the conversion issues and that training on CPU or converting GPU->CPU would be a common use case.

quematech · 2015-08-08T12:33:37Z

Looks like it works. Thank you and may my cat bless you, m'lord!

nielmclaren · 2015-08-10T05:09:34Z

Fixed it for me, too. Thanks @karpathy!

ghost · 2015-08-10T14:01:09Z

Great job @karpathy! Had the same issue and now it works perfectly.

dev-dude mentioned this issue Feb 24, 2016

Can't sample -- invalid arguments: CudaTensor number CudaTensor number DoubleTensor CudaTensor larspars/word-rnn#10

Closed

1byxero mentioned this issue Mar 10, 2016

unknown Torch class <torch.CudaTensor> abhshkdz/neural-vqa#5

Closed

northeastsquare mentioned this issue Jun 8, 2016

finetune vgg face module, got error cmusatyalab/openface#150

Closed

zhaopku mentioned this issue Aug 28, 2016

Invalid arguments: DoubleTensor number FloatTensor macournoyer/neuralconvo#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample.lua fails to run: error in function addmm() #21

sample.lua fails to run: error in function addmm() #21

swisspol commented May 30, 2015

swisspol commented May 30, 2015

karpathy commented May 30, 2015

swisspol commented May 30, 2015

antonmil commented Jun 18, 2015

PaulSchnau commented Aug 6, 2015

svickers commented Aug 8, 2015

hughperkins commented Aug 8, 2015

svickers commented Aug 8, 2015

hughperkins commented Aug 8, 2015

quematech commented Aug 8, 2015

karpathy commented Aug 8, 2015

karpathy commented Aug 8, 2015

quematech commented Aug 8, 2015

nielmclaren commented Aug 10, 2015

ghost commented Aug 10, 2015

sample.lua fails to run: error in function addmm() #21

sample.lua fails to run: error in function addmm() #21

Comments

swisspol commented May 30, 2015

swisspol commented May 30, 2015

karpathy commented May 30, 2015

swisspol commented May 30, 2015

antonmil commented Jun 18, 2015

PaulSchnau commented Aug 6, 2015

svickers commented Aug 8, 2015

hughperkins commented Aug 8, 2015

svickers commented Aug 8, 2015

hughperkins commented Aug 8, 2015

quematech commented Aug 8, 2015

karpathy commented Aug 8, 2015

karpathy commented Aug 8, 2015

quematech commented Aug 8, 2015

nielmclaren commented Aug 10, 2015

ghost commented Aug 10, 2015