You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to run the training for a few hours. My specs are:
Nvidia RTX 2070, 32GB of Ram and a ryzen 3700X.
Fedora 36 with proprietary Nvidia drivers (510.68.02 CUDA Version: 11.6).
When I run the command python main.py train --conf conf/wikisql.conf --gpu 0, it stops shortly after printing start training. It says it failed to allocate memory on the GPU.
If I don't specify --gpu 0, it gives the same output, which leads me to think that it's using the GPU either way.
This is the stack trace produced:
start training
Traceback (most recent call last):
File "/home/djouze/dev/HydraNet-WikiSQL/main.py", line 70, in <module>
cur_loss = model.train_on_batch(batch)
File "/home/djouze/dev/HydraNet-WikiSQL/modeling/torch_model.py", line 41, in train_on_batch
batch_loss = torch.mean(self.model(**batch)["loss"])
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/dev/HydraNet-WikiSQL/modeling/torch_model.py", line 114, in forward
bert_output, pooled_output = self.base_model(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 815, in forward
encoder_outputs = self.encoder(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 508, in forward
layer_outputs = layer_module(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 395, in forward
self_attention_outputs = self.attention(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 323, in forward
self_outputs = self.self(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 253, in forward
attention_probs = self.dropout(attention_probs)
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 1279, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 7.79 GiB total capacity; 6.05 GiB already allocated; 158.19 MiB free; 6.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My last try was using the docker image, after the training started, my memory consumption got to 31GB, but the docker seems to have crashed without producing any output.
Could you help me find out why that's happening? I assume it would be able to run on single GPU spec, although in that case it would likely take more time (correct me if I'm wrong).
Thank you for your attention,
José
The text was updated successfully, but these errors were encountered:
Hi José,
Yes, this is OOM error. Please change batch_size in wikisql.conf to smaller one. The current 256 setting is trained on my GPU cluster with 4x48GB memory, for 1x32GB I recommend to try batch_size=32 first.
Note that smaller batch size could result in lower accuracy, please refer to my reply here #9
If you specify --gpu 0, then it uses Nvidia RTX 2070 (8GB gpu memory?), so batch_size should be ~8.
If you don't specify --gpu, then it uses ram (32GB), so batch_size should be 32.
If there is still OOM issue, please keep decreasing batch_size until no error
I'm actually not quite sure about all of this, I'm just helping someone run on my computer because they thought it would work out of the box.
I'll try running with a small batch size, and I'd actually like to note that not specifying --gpu was not enough to make it use CPU and RAM, not as far as I noticed it.
Hello there.
I've been trying to run the training for a few hours. My specs are:
Nvidia RTX 2070, 32GB of Ram and a ryzen 3700X.
Fedora 36 with proprietary Nvidia drivers (510.68.02 CUDA Version: 11.6).
When I run the command
python main.py train --conf conf/wikisql.conf --gpu 0
, it stops shortly after printingstart training
. It says it failed to allocate memory on the GPU.If I don't specify
--gpu 0
, it gives the same output, which leads me to think that it's using the GPU either way.This is the stack trace produced:
My last try was using the docker image, after the training started, my memory consumption got to 31GB, but the docker seems to have crashed without producing any output.
Could you help me find out why that's happening? I assume it would be able to run on single GPU spec, although in that case it would likely take more time (correct me if I'm wrong).
Thank you for your attention,
José
The text was updated successfully, but these errors were encountered: