Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory when running on single GPU #14

Closed
JoseGuilhermeCR opened this issue May 26, 2022 · 3 comments
Closed

CUDA out of memory when running on single GPU #14

JoseGuilhermeCR opened this issue May 26, 2022 · 3 comments

Comments

@JoseGuilhermeCR
Copy link

Hello there.

I've been trying to run the training for a few hours. My specs are:
Nvidia RTX 2070, 32GB of Ram and a ryzen 3700X.
Fedora 36 with proprietary Nvidia drivers (510.68.02 CUDA Version: 11.6).

When I run the command python main.py train --conf conf/wikisql.conf --gpu 0, it stops shortly after printing start training. It says it failed to allocate memory on the GPU.

If I don't specify --gpu 0, it gives the same output, which leads me to think that it's using the GPU either way.

This is the stack trace produced:

start training
Traceback (most recent call last):
  File "/home/djouze/dev/HydraNet-WikiSQL/main.py", line 70, in <module>
    cur_loss = model.train_on_batch(batch)
  File "/home/djouze/dev/HydraNet-WikiSQL/modeling/torch_model.py", line 41, in train_on_batch
    batch_loss = torch.mean(self.model(**batch)["loss"])
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/dev/HydraNet-WikiSQL/modeling/torch_model.py", line 114, in forward
    bert_output, pooled_output = self.base_model(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 815, in forward
    encoder_outputs = self.encoder(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 508, in forward
    layer_outputs = layer_module(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 395, in forward
    self_attention_outputs = self.attention(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 323, in forward
    self_outputs = self.self(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 253, in forward
    attention_probs = self.dropout(attention_probs)
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 1279, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 7.79 GiB total capacity; 6.05 GiB already allocated; 158.19 MiB free; 6.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My last try was using the docker image, after the training started, my memory consumption got to 31GB, but the docker seems to have crashed without producing any output.

Could you help me find out why that's happening? I assume it would be able to run on single GPU spec, although in that case it would likely take more time (correct me if I'm wrong).

Thank you for your attention,
José

@lyuqin
Copy link
Owner

lyuqin commented May 26, 2022

Hi José,
Yes, this is OOM error. Please change batch_size in wikisql.conf to smaller one. The current 256 setting is trained on my GPU cluster with 4x48GB memory, for 1x32GB I recommend to try batch_size=32 first.
Note that smaller batch size could result in lower accuracy, please refer to my reply here #9

@lyuqin
Copy link
Owner

lyuqin commented May 26, 2022

Sorry, one correction:

  1. If you specify --gpu 0, then it uses Nvidia RTX 2070 (8GB gpu memory?), so batch_size should be ~8.
  2. If you don't specify --gpu, then it uses ram (32GB), so batch_size should be 32.
    If there is still OOM issue, please keep decreasing batch_size until no error

@JoseGuilhermeCR
Copy link
Author

Thank you for your promptness in answering me!

I'm actually not quite sure about all of this, I'm just helping someone run on my computer because they thought it would work out of the box.

I'll try running with a small batch size, and I'd actually like to note that not specifying --gpu was not enough to make it use CPU and RAM, not as far as I noticed it.

@lyuqin lyuqin closed this as completed Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants