Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA out of memory. #53

Closed
wsnoble opened this issue Aug 2, 2022 · 3 comments · Fixed by #61
Closed

RuntimeError: CUDA out of memory. #53

wsnoble opened this issue Aug 2, 2022 · 3 comments · Fixed by #61
Labels
enhancement New feature or request

Comments

@wsnoble
Copy link
Contributor

wsnoble commented Aug 2, 2022

I tried to use Casanovo to make predictions on an MGF file containing 31,078 spectra, and it ran out of memory. Is there anything I can do to mitigate this problem, other than breaking up the input file into small pieces or switching to a different machine?

casanovo --mode=denovo --model_path=/net/noble/vol1/home/noble/proj/2022_varun_ls-casanovo/data/22-07-02_weights/pretrained_excl_mouse.ckpt --test_data_path=20190227_231_15%_1 --output_path=20190227_231_15%_1 --config_path=config.yaml
INFO: De novo sequencing with Casanovo...
INFO: Created a temporary directory at /tmp/tmpzqps6s6h
INFO: Writing /tmp/tmpzqps6s6h/_remote_module_non_scriptable.py
INFO: Reading 1 files...
20190227_231_15%_1/20190227_231_15%_1.mgf: 31078spectra [00:08, 3647.09spectra/s]
/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing `Trainer(accelerator='ddp')` has been deprecated in v1.5 and will be removed in v1.7. Use `Trainer(strategy='ddp')` instead.
  f"Passing `Trainer(accelerator={self.distributed_backend!r})` has been deprecated"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:55938 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:55938 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:55938 (errno: 97 - Address family not supported by protocol).
INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing:  35% 11/31 [02:46<03:05,  9.26s/it]Traceback (most recent call last):
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/casanovo.py", line 83, in main
    denovo(test_data_path, model_path, config, output_path)  
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/train_test.py", line 246, in denovo
    trainer.test(model_trained, loaders.test_dataloader())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl
    results = self._run(model, ckpt_path=self.tested_ckpt_path)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
    self.training_type_plugin.start_evaluating(self)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 206, in start_evaluating
    self._results = trainer.run_stage()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1286, in run_stage
    return self._run_evaluate()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1334, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 213, in _evaluation_step
    output = self.trainer.accelerator.test_step(step_kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 247, in test_step
    return self.training_type_plugin.test_step(*step_kwargs.values())
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 450, in test_step
    return self.lightning_module.test_step(*args, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 403, in test_step
    pred_seqs, scores = self.predict_step(batch)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 188, in predict_step
    return self(batch[0], batch[1])
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 163, in forward
    scores, tokens = self.greedy_decode(spectra, precursors)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/casanovo/denovo/model.py", line 212, in greedy_decode
    memories, mem_masks = self.encoder(spectra)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/depthcharge/components/transformers.py", line 105, in forward
    return self.transformer_encoder(peaks, src_key_padding_mask=mask), mask
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 238, in forward
    output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/noble/vol1/home/noble/miniconda3/envs/casanovo_env/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 456, in forward
    src_mask if src_mask is not None else src_key_padding_mask,
RuntimeError: CUDA out of memory. Tried to allocate 714.00 MiB (GPU 0; 7.79 GiB total capacity; 2.46 GiB already allocated; 632.94 MiB free; 3.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@wfondrie
Copy link
Collaborator

wfondrie commented Aug 2, 2022

You can try decreasing the following parameters:

val_batch_size: 1024
test_batch_size: 1024

However, the error message is a little weird here:

 7.79 GiB total capacity; 2.46 GiB already allocated; 632.94 MiB free; 3.65 GiB reserved in total by PyTorch

This implies to me that there may be another process running on your GPU.

@wsnoble
Copy link
Contributor Author

wsnoble commented Aug 2, 2022

Thanks. Would it be possible to catch this error and issue a more informative message such as this?

RuntimeError: CUDA out of memory. Tried to allocate 714.00 MiB (GPU 0; 7.79 GiB total capacity; 2.46 GiB already allocated; 632.94 MiB free; 3.65 GiB reserved in total by PyTorch) 
Consider reducing the value of val_batch_size and test_batch_size.

@bittremieux
Copy link
Collaborator

It's not super nice design because the error would need to be caught at a very high level, basically in the outermost main function. Additionally, we'd need to string match the error message to figure out when catching a RuntimeError to check whether it matches a GPU OOM, which is a bit iffy.

I think the message is reasonably informative, and I'd just add a note about it in the FAQ section.

@bittremieux bittremieux added the enhancement New feature or request label Aug 5, 2022
bittremieux added a commit that referenced this issue Aug 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants