RuntimeError: CUDA out of memory. #8

loretoparisi · 2021-11-22T22:21:15Z

Train command

%cd /home/ec2-user/SageMaker/SDR
!python sdr_main.py --dataset_name wines

Stacktrace:

Traceback (most recent call last):
  File "sdr_main.py", line 80, in <module>
    main()
  File "sdr_main.py", line 28, in main
    main_train(model_class_pointer, hyperparams,parser)
  File "sdr_main.py", line 72, in main_train
    trainer.fit(model)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
    return self.train_or_test()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 550, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 692, in run_training_batch
    self.trainer.hiddens)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 806, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 319, in training_step
    training_step_output = self.trainer.accelerator_backend.training_step(args)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/dp_accelerator.py", line 117, in training_step
    return self._step(args)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/dp_accelerator.py", line 113, in _step
    output = self.trainer.model(*args)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/overrides/data_parallel.py", line 93, in forward
    return self.module.training_step(*inputs[0], **kwargs[0])
  File "/home/ec2-user/SageMaker/SDR/models/doc_similarity_pl_template.py", line 49, in training_step
    batch = self(batch)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/SageMaker/SDR/models/SDR/SDR.py", line 78, in forward
    eval(f"self.forward_{self.hparams.mode}")(batch)
  File "/home/ec2-user/SageMaker/SDR/models/SDR/SDR.py", line 48, in forward_train
    run_mlm=True,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/SageMaker/SDR/models/SDR/similarity_modeling.py", line 129, in forward
    return_dict=return_dict,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 835, in forward
    return_dict=return_dict,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 490, in forward
    output_attentions,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 433, in forward
    self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1597, in apply_chunking_to_forward
    return forward_fn(*input_tensors)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in feed_forward_chunk
    intermediate_output = self.intermediate(attention_output)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 367, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/functional.py", line 1556, in gelu
    return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 14.76 GiB total capacity; 11.17 GiB already allocated; 14.75 MiB free; 11.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The text was updated successfully, but these errors were encountered:

weicheng113 · 2022-03-09T04:39:22Z

Late reply, but just in case for the future, anyone may need this. I set train_batch_size=4 instead of 32 in utils.argparse_init.py. It is able to train without CUDA out of memory on GPU 16GB machine.

parser.add_argument( "--train_batch_size", default={"document_similarity": 4}[task_name], type=int, help="Number of samples in batch", )

I think GPU 16GB or 12 GB(single or multiple GPUs) are quite common. If the author can provide instruction or option to train on these machines, that would be very helpful. Thanks.

weicheng113 · 2022-03-16T05:56:13Z

I got it to train on multiple gpus(so that I can train on AWS easier, single gpu 32 GB or above only available in p3dn.24xlarge and p4d.24xlarge, which is expensive and wasting resources). I made following modification in case anyone in future also needs this.

# in sdr_main.py
trainer = pytorch_lightning.Trainer(
...
distributed_backend="dp",
...)

# in argparse_init.py
parser.add_argument("--gpus", default=2, type=str, help="gpu count") # you can specify the gpu count you have

# in SDR.py add following method, the reason is we get multiple losses from multiple gpus.
def training_step_end(self, training_step_outputs):
        return {'loss': training_step_outputs['loss'].sum()}

hassiahk · 2022-03-31T10:54:45Z

@weicheng113 what is the configuration you used for multiple GPUs? 16GB each?

And did you change any other params like train_batch_size etc while training on multiple GPUs?

weicheng113 · 2022-03-31T21:02:45Z

@hassiahk 16GB each. Please check the last two comments I made above.

hassiahk · 2022-04-01T07:21:31Z

@hassiahk 16GB each. Please check the last two comments I made above.

So you could only run with train_batch_size=4 and not 32 using multiple GPUs?

gauraviiita · 2022-07-11T13:38:02Z

I faced the same problem and resolved it by degrading the PyTorch version from 1.10.1 to 1.8.1 with code 11.3.
In my case, I am using GPU RTX 3060, which works only with Cuda version 11.3 or above, and when I installed Cuda 11.3, it came with PyTorch 1.10.1. So I degraded the PyTorch version, and now it is working fine.

$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html

loretoparisi mentioned this issue Nov 22, 2021

OOM with a lot of GPU memory left pytorch/pytorch#67680

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA out of memory. #8

RuntimeError: CUDA out of memory. #8

loretoparisi commented Nov 22, 2021

weicheng113 commented Mar 9, 2022

weicheng113 commented Mar 16, 2022 •

edited

Loading

hassiahk commented Mar 31, 2022

weicheng113 commented Mar 31, 2022

hassiahk commented Apr 1, 2022

gauraviiita commented Jul 11, 2022 •

edited

Loading

RuntimeError: CUDA out of memory. #8

RuntimeError: CUDA out of memory. #8

Comments

loretoparisi commented Nov 22, 2021

weicheng113 commented Mar 9, 2022

weicheng113 commented Mar 16, 2022 • edited Loading

hassiahk commented Mar 31, 2022

weicheng113 commented Mar 31, 2022

hassiahk commented Apr 1, 2022

gauraviiita commented Jul 11, 2022 • edited Loading

weicheng113 commented Mar 16, 2022 •

edited

Loading

gauraviiita commented Jul 11, 2022 •

edited

Loading