RuntimeError: CUDA out of memory. #53

wsnoble · 2022-08-02T21:59:55Z

I tried to use Casanovo to make predictions on an MGF file containing 31,078 spectra, and it ran out of memory. Is there anything I can do to mitigate this problem, other than breaking up the input file into small pieces or switching to a different machine?

casanovo --mode=denovo --model_path=/net/noble/vol1/home/noble/proj/2022_varun_ls-casanovo/data/22-07-02_weights/pretrained_excl_mouse.ckpt --test_data_path=20190227_231_15%_1 --output_path=20190227_231_15%_1 --config_path=config.yaml
INFO: De novo sequencing with Casanovo...
INFO: Created a temporary directory at /tmp/tmpzqps6s6h
INFO: Writing /tmp/tmpzqps6s6h/_remote_module_non_scriptable.py
INFO: Reading 1 files...
20190227_231_15%_1/20190227_231_15%_1.mgf: 31078spectra [00:08, 3647.09spectra/s]
/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing `Trainer(accelerator='ddp')` has been deprecated in v1.5 and will be removed in v1.7. Use `Trainer(strategy='ddp')` instead.
  f"Passing `Trainer(accelerator={self.distributed_backend!r})` has been deprecated"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:55938 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:55938 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:55938 (errno: 97 - Address family not supported by protocol).
INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing:  35% 11/31 [02:46<03:05,  9.26s/it]Traceback (most recent call last):
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/casanovo.py", line 83, in main
    denovo(test_data_path, model_path, config, output_path)  
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/train_test.py", line 246, in denovo
    trainer.test(model_trained, loaders.test_dataloader())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl
    results = self._run(model, ckpt_path=self.tested_ckpt_path)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
    self.training_type_plugin.start_evaluating(self)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 206, in start_evaluating
    self._results = trainer.run_stage()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1286, in run_stage
    return self._run_evaluate()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1334, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 213, in _evaluation_step
    output = self.trainer.accelerator.test_step(step_kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 247, in test_step
    return self.training_type_plugin.test_step(*step_kwargs.values())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 450, in test_step
    return self.lightning_module.test_step(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 403, in test_step
    pred_seqs, scores = self.predict_step(batch)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 188, in predict_step
    return self(batch[0], batch[1])
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 163, in forward
    scores, tokens = self.greedy_decode(spectra, precursors)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 212, in greedy_decode
    memories, mem_masks = self.encoder(spectra)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/depthcharge/components/transformers.py", line 105, in forward
    return self.transformer_encoder(peaks, src_key_padding_mask=mask), mask
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 238, in forward
    output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 456, in forward
    src_mask if src_mask is not None else src_key_padding_mask,
RuntimeError: CUDA out of memory. Tried to allocate 714.00 MiB (GPU 0; 7.79 GiB total capacity; 2.46 GiB already allocated; 632.94 MiB free; 3.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The text was updated successfully, but these errors were encountered:

wfondrie · 2022-08-02T22:18:38Z

You can try decreasing the following parameters:

val_batch_size: 1024
test_batch_size: 1024

However, the error message is a little weird here:

 7.79 GiB total capacity; 2.46 GiB already allocated; 632.94 MiB free; 3.65 GiB reserved in total by PyTorch

This implies to me that there may be another process running on your GPU.

wsnoble · 2022-08-02T22:50:03Z

Thanks. Would it be possible to catch this error and issue a more informative message such as this?

RuntimeError: CUDA out of memory. Tried to allocate 714.00 MiB (GPU 0; 7.79 GiB total capacity; 2.46 GiB already allocated; 632.94 MiB free; 3.65 GiB reserved in total by PyTorch) 
Consider reducing the value of val_batch_size and test_batch_size.

bittremieux · 2022-08-05T15:28:35Z

It's not super nice design because the error would need to be caught at a very high level, basically in the outermost main function. Additionally, we'd need to string match the error message to figure out when catching a RuntimeError to check whether it matches a GPU OOM, which is a bit iffy.

I think the message is reasonably informative, and I'd just add a note about it in the FAQ section.

Fixes #53.

bittremieux added the enhancement New feature or request label Aug 5, 2022

bittremieux added a commit that referenced this issue Aug 6, 2022

Document GPU out of memory error

36d34b4

Fixes #53.

bittremieux mentioned this issue Aug 6, 2022

Document GPU out of memory error #61

Merged

bittremieux closed this as completed in #61 Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA out of memory. #53

RuntimeError: CUDA out of memory. #53

wsnoble commented Aug 2, 2022

wfondrie commented Aug 2, 2022

wsnoble commented Aug 2, 2022

bittremieux commented Aug 5, 2022

RuntimeError: CUDA out of memory. #53

RuntimeError: CUDA out of memory. #53

Comments

wsnoble commented Aug 2, 2022

wfondrie commented Aug 2, 2022

wsnoble commented Aug 2, 2022

bittremieux commented Aug 5, 2022