Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample.lua fails to run: error in function addmm() #21

Open
swisspol opened this issue May 30, 2015 · 15 comments
Open

sample.lua fails to run: error in function addmm() #21

swisspol opened this issue May 30, 2015 · 15 comments

Comments

@swisspol
Copy link

Just trying out the default data set:

$ th train.lua -data_dir data/tinyshakespeare

And then using a checkpoint file:

$ th sample.lua cv/lm_lstm_epoch4.74_1.7332.t7 
using CUDA on GPU 0...  
creating an LSTM... 
seeding with    
/Users/pol/torch/install/bin/luajit: /Users/pol/torch/install/share/lua/5.1/nn/Linear.lua:46: expected arguments: *DoubleTensor~2D* [DoubleTensor~2D] [double] DoubleTensor~2D DoubleTensor~2D | *DoubleTensor~2D* double [DoubleTensor~2D] double DoubleTensor~2D DoubleTensor~2D
stack traceback:
    [C]: in function 'addmm'
    /Users/pol/torch/install/share/lua/5.1/nn/Linear.lua:46: in function 'func'
    /Users/pol/torch/install/share/lua/5.1/nngraph/gmodule.lua:214: in function 'neteval'
    /Users/pol/torch/install/share/lua/5.1/nngraph/gmodule.lua:244: in function 'forward'
    sample.lua:88: in main chunk
    [C]: in function 'dofile'
    .../pol/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x0105f02340
@swisspol
Copy link
Author

This other tool works fine however:

$ th inspect_checkpoint.lua cv/lm_lstm_epoch18.96_1.4228.t7 
using CUDA on GPU 0...  
opt:    
{
  max_epochs : 30
  seed : 123
  batch_size : 100
  gpuid : 0
  decay_rate : 0.95
  savefile : "lstm"
  model : "lstm"
  grad_clip : 5
  print_every : 1
  data_dir : "data/tinyshakespeare"
  seq_length : 50
  num_layers : 2
  rnn_size : 100
  train_frac : 0.95
  learning_rate : 0.002
  dropout : 0
  eval_val_every : 1000
  val_frac : 0.05
  checkpoint_dir : "cv"
}
val losses: 
{
  2000 : 1.5233611306277
  3000 : 1.4519253438169
  4000 : 1.4227915313027
  1000 : 1.7233323420178
}

@karpathy
Copy link
Owner

Did you train one with GPU and the other with CPU? Check the "gpuid" flag. Is it "0" on both models?

@swisspol
Copy link
Author

Yes, GPU on both. What's interesting is that using checkpoints created later on during the process do work as well as the very last one.

@antonmil
Copy link

@swisspol, I'm running into the same issue inside an iTorch notebook environment, but it works fine on standard command line. I'm very new to Lua / Torch but it would be good to figure out what's causing this.

@PaulSchnau
Copy link

I had this error on OSX 10.10. Using -opencl 1 on both train.lua and sample.lua made it work.

@svickers
Copy link

svickers commented Aug 8, 2015

Had this problem on a c2.2xl instance. Tried -opencl 1 but no luck.

@hughperkins
Copy link
Contributor

@svickers can you post the fourth line of your output, the one that has 'Linear.lua:46' in it. Edit: and also the results of inspect_checkpoint.

@svickers
Copy link

svickers commented Aug 8, 2015

@hughperkins Sorry Hugh! I killed that vm and went to g2.2xl and everything worked straight away.

@hughperkins
Copy link
Contributor

@svickers oh, nice! Hmmm, c2s dont actually have a gpu, right? g2 sounds like a gpu instance?

@quematech
Copy link

Hello.

I ran into the same problem, and is dealing with a lot of frustration in my holy quest of running tests with a Monty Python Flying Circus corpus :). The error appears too when trying the tiny shakespeare data set.

I run CPU-only computing (no GPU). The training goes well, no NaN value :

th train.lua -data_dir data/tinyshakespeare/ -gpuid -1

loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an lstm with 2 layers
number of parameters in the model: 240321
cloning rnn
cloning criterion
1/21150 (epoch 0.002), train_loss = 4.19766416, grad/param norm = 4.5006e-01, time/batch = 0.34s
2/21150 (epoch 0.005), train_loss = 4.10134056, grad/param norm = 6.3375e-01, time/batch = 0.28s
3/21150 (epoch 0.007), train_loss = 3.44502399, grad/param norm = 9.4798e-01, time/batch = 0.28s

The sampling raises an error whatever the checkpoint file used :

th sample.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
creating an lstm...
missing seed text, using uniform probability over first character
--------------------------
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Linear.lua:46: invalid arguments: DoubleTensor number DoubleTensor number FloatTensor DoubleTensor
expected arguments: *DoubleTensor~2D* [DoubleTensor~2D] [double] DoubleTensor~2D DoubleTensor~2D | *DoubleTensor~2D* double [DoubleTensor~2D] double DoubleTensor~2D DoubleTensor~2D
stack traceback:
        [C]: in function 'addmm'
        /usr/local/share/lua/5.1/nn/Linear.lua:46: in function 'func'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:253: in function 'neteval'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:288: in function 'forward'
        sample.lua:151: in main chunk
        [C]: in function 'dofile'
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00406720

And the inspect method looks ok

th inspect_checkpoint.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
opt:
{
  max_epochs : 50
  seed : 123
  batch_size : 50
  gpuid : -1
  decay_rate : 0.95
  learning_rate_decay : 0.97
  opencl : 0
  model : "lstm"
  grad_clip : 5
  print_every : 1
  data_dir : "data/tinyshakespeare/"
  seq_length : 50
  num_layers : 2
  learning_rate_decay_after : 10
  rnn_size : 128
  train_frac : 0.95
  dropout : 0
  init_from : ""
  learning_rate : 0.002
  eval_val_every : 1000
  val_frac : 0.05
  savefile : "lstm"
  checkpoint_dir : "cv"
}
val losses:
{
  3000 : 1.4450460764536
  4000 : 1.4213234041304
  5000 : 1.4060113392715
  6000 : 1.389498488439
  8000 : 1.3909428322715
  10000 : 1.4003497627469
  7000 : 1.3937299336865
  9000 : 1.3940925438403
  1000 : 1.7136267190726
  2000 : 1.5211800115534
  11000 : 1.389983844627
}

@karpathy
Copy link
Owner

karpathy commented Aug 8, 2015

This is a silly bug I think I introduced only few days ago unfortunately. Fixing...

@karpathy
Copy link
Owner

karpathy commented Aug 8, 2015

Ok I think I patched this issue with this commit:
0fb9a77

see if things work properly now with the new sampling script. The issue is that CPU models use doubles, but when I was converting GPU models I converted them to float() and then changed the sampling script to use float(), which broke previous CPU-only functionality. Sorry about the mess, when I was originally designing this code I always use GPUs and I didn't anticipate the conversion issues and that training on CPU or converting GPU->CPU would be a common use case.

@quematech
Copy link

Looks like it works. Thank you and may my cat bless you, m'lord!

@nielmclaren
Copy link

Fixed it for me, too. Thanks @karpathy!

@ghost
Copy link

ghost commented Aug 10, 2015

Great job @karpathy! Had the same issue and now it works perfectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants