Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bottleneck? #13

Open
Vadim2S opened this issue Jul 12, 2021 · 3 comments
Open

Possible bottleneck? #13

Vadim2S opened this issue Jul 12, 2021 · 3 comments

Comments

@Vadim2S
Copy link

Vadim2S commented Jul 12, 2021

I am got warning:

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: Dataloader(num_workers>0) and ddp_spawn do not mix well! Your performance might suffer dramatically. Please consider setting distributed_backend=ddp to use num_workers > 0 (this is a bottleneck of Python .spawn() and PyTorch

is this Ok?

@wookladin
Copy link
Contributor

Hi. In my case, I tried using distributed_backend='ddp' as that warning recommended.
However, multi-GPU training error occurs in the following situations:

  • when the first GPU (i.e. ID 0) is not included in the GPUs list. For example:
    python synthesizer_trainer.py -g 1,2,3
  • when the GPUs list is not sequential. For example:
    python synthesizer_trainer.py -g 0,2,3

About the issue mentioned above, see Lightning-AI/pytorch-lightning#4171

This error is caused by pytorch-lightning and can be resolved by upgrading the version.

As the error said, using DDP and num_workers>0 at once makes initializing and training speed faster.
If you want speed-up in the current setting,

  1. Change accelerator='None' to 'ddp' in synthesizer_trainer.py and cotatron_trainer.py
  2. After that, if you want to use GPU number 1, 2, 4,
    You can use it like CUDA_VISIBLE_DEVICES=1,2,4 python3 synthesizer_trainer.py instead of the gpu option using -g 1,2,4.

@wookladin
Copy link
Contributor

In order to completely solve this problem, we need a version up of the PyTorch Lightning module. However, there are conflicts between the pl versions, so we plan to check them carefully.
Thank you for sharing the issue!

@wookladin wookladin pinned this issue Jul 15, 2021
@Vadim2S
Copy link
Author

Vadim2S commented Jul 26, 2021

Unfortunate, accelerator='ddp' is not stable. Accelerator='None' is OK.

File "/home/assem-vc/synthesizer_trainer.py", line 85, in
main(args)
File "/home/assem-vc/synthesizer_trainer.py", line 64, in main
trainer.fit(model)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
results = self.accelerator_backend.train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 279, in ddp_train
results = self.train_or_test()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 482, in train
self.train_loop.run_training_epoch()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
self.accumulated_loss.append(opt_closure_result.loss)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py", line 64, in append
x = x.to(self.memory)
RuntimeError: CUDA error: the launch timed out and was terminated
Exception in thread Thread-22:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, *self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3f13358193 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x17f66 (0x7f3f13595f66 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x19cbd (0x7f3f13597cbd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f3f1334863d in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: c10d::Reducer::~Reducer() + 0x449 (0x7f3eff7e9b89 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer
, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f3eff7cb592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f3eff034e56 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: + 0x9e813b (0x7f3eff7cc13b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: + 0x293f30 (0x7f3eff077f30 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2951ce (0x7f3eff0791ce in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python3() [0x5d1ca7]
frame #11: /usr/bin/python3() [0x5a605d]
frame #12: /usr/bin/python3() [0x5d1ca7]
frame #13: /usr/bin/python3() [0x5a3132]
frame #14: /usr/bin/python3() [0x4ef828]
frame #15: _PyGC_CollectNoFail + 0x2f (0x6715cf in /usr/bin/python3)
frame #16: PyImport_Cleanup + 0x244 (0x683bf4 in /usr/bin/python3)
frame #17: Py_FinalizeEx + 0x7f (0x67eaef in /usr/bin/python3)
frame #18: Py_RunMain + 0x32d (0x6b624d in /usr/bin/python3)
frame #19: Py_BytesMain + 0x2d (0x6b64bd in /usr/bin/python3)
frame #20: __libc_start_main + 0xf3 (0x7f3f1f2e30b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: _start + 0x2e (0x5f927e in /usr/bin/python3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants