Adding notes on memory requirements? #3

kylemcdonald · 2015-11-20T23:57:17Z

I'm going to test this on a real computer tomorrow, but testing today on the 2GB GPU on my laptop I get an out of memory error with the 600MB pre-trained model.

I tried shutting everything else down in hope that 2GB was almost enough to run the model, but it doesn't seem to help (or even change the error message).

I tried running off the CPU using combinations of -gpuid -1 and -backend nn but i get different errors. Here are all the errors, in order:

kyle@kyle ~/D/L/neuraltalk2 (master)> th eval.lua -model models/checkpoint_v1.t7 -image_folder images/
DataLoaderRaw loading images from folder:   images/ 
listing all images in directory images/ 
DataLoaderRaw found 8 images    
/Users/kyle/torch/install/bin/luajit: ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:99: cuda runtime error (2) : out of memory at /Users/kyle/torch/extra/cutorch/lib/THC/THCStorage.cu:44
stack traceback:
    [C]: in function 'resizeAs'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:99: in function 'createIODescriptors'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:339: in function 'updateOutput'
    /Users/kyle/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    eval.lua:115: in function 'eval_split'
    eval.lua:163: in main chunk
    [C]: in function 'dofile'
    ...kyle/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010b4892d0
kyle@kyle ~/D/L/neuraltalk2 (master) [1]> th eval.lua -backend nn -model models/checkpoint_v1.t7 -image_folder images/
/Users/kyle/torch/install/bin/luajit: /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:262: unknown Torch class <cudnn.SpatialConvolution>
stack traceback:
    [C]: in function 'error'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:262: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:272: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:319: in function 'load'
    eval.lua:64: in main chunk
    [C]: in function 'dofile'
    ...kyle/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010f3862d0
kyle@kyle ~/D/L/neuraltalk2 (master) [1]> th eval.lua -gpuid -1 -model models/checkpoint_v1.t7 -image_folder images/
/Users/kyle/torch/install/bin/luajit: /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:262: unknown Torch class <torch.CudaTensor>
stack traceback:
    [C]: in function 'error'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:262: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:272: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:319: in function 'load'
    eval.lua:64: in main chunk
    [C]: in function 'dofile'
    ...kyle/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010e5a42d0

The text was updated successfully, but these errors were encountered:

karpathy · 2015-11-21T00:00:42Z

Can you try decreasing the batch size to 1, and then decreasing rnn hidden size to say, 128? 2GB is really very little.

karpathy · 2015-11-21T00:02:06Z

The CPU error is because you're using my pretrained checkpoint, which was trained on GPU. Try running with CPU from scratch, should work ok.

kylemcdonald · 2015-11-21T00:03:19Z

adding -batch_size 1 fixed it! thanks a bunch!

yes, i just noticed the CPU/GPU thing. i tried moving the require out of the if but that just created another problem, apparently it's not that simple :)

karpathy · 2015-11-21T00:03:59Z

ok, you also want to make batch size as large as you can fit, by the way, and you should expect to probably decrease the learning rate (if training).

kylemcdonald · 2015-11-21T00:05:50Z

i don't understand how you have time to be so helpful & encouraging, and work on a phd at the same time. thank you so much! :)

edit: for the record, -batch_size 8 is the highest i can go with the net you posted.

susemeee · 2015-11-22T04:23:29Z

I was trying to train CPU-based model and encountered same OOM error on Ubuntu. I tried adjusting batch_size and It perfectly works right now. Thanks for the detailed explanation!

kylemcdonald closed this as completed Nov 21, 2015

Heavy02011 mentioned this issue Nov 22, 2015

another "out of memory issue" when reading the pretained model #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding notes on memory requirements? #3

Adding notes on memory requirements? #3

kylemcdonald commented Nov 20, 2015

karpathy commented Nov 21, 2015

karpathy commented Nov 21, 2015

kylemcdonald commented Nov 21, 2015 •

edited

Loading

karpathy commented Nov 21, 2015

kylemcdonald commented Nov 21, 2015

susemeee commented Nov 22, 2015

Adding notes on memory requirements? #3

Adding notes on memory requirements? #3

Comments

kylemcdonald commented Nov 20, 2015

karpathy commented Nov 21, 2015

karpathy commented Nov 21, 2015

kylemcdonald commented Nov 21, 2015 • edited Loading

karpathy commented Nov 21, 2015

kylemcdonald commented Nov 21, 2015

susemeee commented Nov 22, 2015

kylemcdonald commented Nov 21, 2015 •

edited

Loading