Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding notes on memory requirements? #3

Closed
kylemcdonald opened this issue Nov 20, 2015 · 6 comments
Closed

Adding notes on memory requirements? #3

kylemcdonald opened this issue Nov 20, 2015 · 6 comments

Comments

@kylemcdonald
Copy link

I'm going to test this on a real computer tomorrow, but testing today on the 2GB GPU on my laptop I get an out of memory error with the 600MB pre-trained model.

I tried shutting everything else down in hope that 2GB was almost enough to run the model, but it doesn't seem to help (or even change the error message).

I tried running off the CPU using combinations of -gpuid -1 and -backend nn but i get different errors. Here are all the errors, in order:

kyle@kyle ~/D/L/neuraltalk2 (master)> th eval.lua -model models/checkpoint_v1.t7 -image_folder images/
DataLoaderRaw loading images from folder:   images/ 
listing all images in directory images/ 
DataLoaderRaw found 8 images    
/Users/kyle/torch/install/bin/luajit: ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:99: cuda runtime error (2) : out of memory at /Users/kyle/torch/extra/cutorch/lib/THC/THCStorage.cu:44
stack traceback:
    [C]: in function 'resizeAs'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:99: in function 'createIODescriptors'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:339: in function 'updateOutput'
    /Users/kyle/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    eval.lua:115: in function 'eval_split'
    eval.lua:163: in main chunk
    [C]: in function 'dofile'
    ...kyle/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010b4892d0
kyle@kyle ~/D/L/neuraltalk2 (master) [1]> th eval.lua -backend nn -model models/checkpoint_v1.t7 -image_folder images/
/Users/kyle/torch/install/bin/luajit: /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:262: unknown Torch class <cudnn.SpatialConvolution>
stack traceback:
    [C]: in function 'error'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:262: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:272: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:319: in function 'load'
    eval.lua:64: in main chunk
    [C]: in function 'dofile'
    ...kyle/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010f3862d0
kyle@kyle ~/D/L/neuraltalk2 (master) [1]> th eval.lua -gpuid -1 -model models/checkpoint_v1.t7 -image_folder images/
/Users/kyle/torch/install/bin/luajit: /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:262: unknown Torch class <torch.CudaTensor>
stack traceback:
    [C]: in function 'error'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:262: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:272: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:288: in function 'readObject'
    /Users/kyle/torch/install/share/lua/5.1/torch/File.lua:319: in function 'load'
    eval.lua:64: in main chunk
    [C]: in function 'dofile'
    ...kyle/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010e5a42d0
@karpathy
Copy link
Owner

Can you try decreasing the batch size to 1, and then decreasing rnn hidden size to say, 128? 2GB is really very little.

@karpathy
Copy link
Owner

The CPU error is because you're using my pretrained checkpoint, which was trained on GPU. Try running with CPU from scratch, should work ok.

@kylemcdonald
Copy link
Author

kylemcdonald commented Nov 21, 2015

adding -batch_size 1 fixed it! thanks a bunch!

yes, i just noticed the CPU/GPU thing. i tried moving the require out of the if but that just created another problem, apparently it's not that simple :)

@karpathy
Copy link
Owner

ok, you also want to make batch size as large as you can fit, by the way, and you should expect to probably decrease the learning rate (if training).

@kylemcdonald
Copy link
Author

i don't understand how you have time to be so helpful & encouraging, and work on a phd at the same time. thank you so much! :)

edit: for the record, -batch_size 8 is the highest i can go with the net you posted.

@susemeee
Copy link

I was trying to train CPU-based model and encountered same OOM error on Ubuntu. I tried adjusting batch_size and It perfectly works right now. Thanks for the detailed explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants