CUDA out of memory when running on single GPU #14

JoseGuilhermeCR · 2022-05-26T02:21:22Z

Hello there.

I've been trying to run the training for a few hours. My specs are:
Nvidia RTX 2070, 32GB of Ram and a ryzen 3700X.
Fedora 36 with proprietary Nvidia drivers (510.68.02 CUDA Version: 11.6).

When I run the command python main.py train --conf conf/wikisql.conf --gpu 0, it stops shortly after printing start training. It says it failed to allocate memory on the GPU.

If I don't specify --gpu 0, it gives the same output, which leads me to think that it's using the GPU either way.

This is the stack trace produced:

start training
Traceback (most recent call last):
  File "/home/djouze/dev/HydraNet-WikiSQL/main.py", line 70, in <module>
    cur_loss = model.train_on_batch(batch)
  File "/home/djouze/dev/HydraNet-WikiSQL/modeling/torch_model.py", line 41, in train_on_batch
    batch_loss = torch.mean(self.model(**batch)["loss"])
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/dev/HydraNet-WikiSQL/modeling/torch_model.py", line 114, in forward
    bert_output, pooled_output = self.base_model(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 815, in forward
    encoder_outputs = self.encoder(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 508, in forward
    layer_outputs = layer_module(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 395, in forward
    self_attention_outputs = self.attention(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 323, in forward
    self_outputs = self.self(
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 253, in forward
    attention_probs = self.dropout(attention_probs)
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 1279, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 7.79 GiB total capacity; 6.05 GiB already allocated; 158.19 MiB free; 6.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My last try was using the docker image, after the training started, my memory consumption got to 31GB, but the docker seems to have crashed without producing any output.

Could you help me find out why that's happening? I assume it would be able to run on single GPU spec, although in that case it would likely take more time (correct me if I'm wrong).

Thank you for your attention,
José

The text was updated successfully, but these errors were encountered:

lyuqin · 2022-05-26T03:18:47Z

Hi José,
Yes, this is OOM error. Please change batch_size in wikisql.conf to smaller one. The current 256 setting is trained on my GPU cluster with 4x48GB memory, for 1x32GB I recommend to try batch_size=32 first.
Note that smaller batch size could result in lower accuracy, please refer to my reply here #9

lyuqin · 2022-05-26T03:57:47Z

Sorry, one correction:

If you specify --gpu 0, then it uses Nvidia RTX 2070 (8GB gpu memory?), so batch_size should be ~8.
If you don't specify --gpu, then it uses ram (32GB), so batch_size should be 32.
If there is still OOM issue, please keep decreasing batch_size until no error

JoseGuilhermeCR · 2022-05-26T23:17:22Z

Thank you for your promptness in answering me!

I'm actually not quite sure about all of this, I'm just helping someone run on my computer because they thought it would work out of the box.

I'll try running with a small batch size, and I'd actually like to note that not specifying --gpu was not enough to make it use CPU and RAM, not as far as I noticed it.

lyuqin closed this as completed Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory when running on single GPU #14

CUDA out of memory when running on single GPU #14

JoseGuilhermeCR commented May 26, 2022

lyuqin commented May 26, 2022

lyuqin commented May 26, 2022

JoseGuilhermeCR commented May 26, 2022

CUDA out of memory when running on single GPU #14

CUDA out of memory when running on single GPU #14

Comments

JoseGuilhermeCR commented May 26, 2022

lyuqin commented May 26, 2022

lyuqin commented May 26, 2022

JoseGuilhermeCR commented May 26, 2022