Possible bottleneck? #13

Vadim2S · 2021-07-12T07:17:03Z

I am got warning:

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: Dataloader(num_workers>0) and ddp_spawn do not mix well! Your performance might suffer dramatically. Please consider setting distributed_backend=ddp to use num_workers > 0 (this is a bottleneck of Python .spawn() and PyTorch

is this Ok?

wookladin · 2021-07-12T09:53:16Z

Hi. In my case, I tried using distributed_backend='ddp' as that warning recommended.
However, multi-GPU training error occurs in the following situations:

when the first GPU (i.e. ID 0) is not included in the GPUs list. For example:
python synthesizer_trainer.py -g 1,2,3
when the GPUs list is not sequential. For example:
python synthesizer_trainer.py -g 0,2,3

About the issue mentioned above, see Lightning-AI/pytorch-lightning#4171

This error is caused by pytorch-lightning and can be resolved by upgrading the version.

As the error said, using DDP and num_workers>0 at once makes initializing and training speed faster.
If you want speed-up in the current setting,

Change accelerator='None' to 'ddp' in synthesizer_trainer.py and cotatron_trainer.py
After that, if you want to use GPU number 1, 2, 4,
You can use it like CUDA_VISIBLE_DEVICES=1,2,4 python3 synthesizer_trainer.py instead of the gpu option using -g 1,2,4.

wookladin · 2021-07-12T09:56:30Z

In order to completely solve this problem, we need a version up of the PyTorch Lightning module. However, there are conflicts between the pl versions, so we plan to check them carefully.
Thank you for sharing the issue!

Vadim2S · 2021-07-26T10:35:07Z

Unfortunate, accelerator='ddp' is not stable. Accelerator='None' is OK.

File "/home/assem-vc/synthesizer_trainer.py", line 85, in
main(args)
File "/home/assem-vc/synthesizer_trainer.py", line 64, in main
trainer.fit(model)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
results = self.accelerator_backend.train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 279, in ddp_train
results = self.train_or_test()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 482, in train
self.train_loop.run_training_epoch()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
self.accumulated_loss.append(opt_closure_result.loss)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py", line 64, in append
x = x.to(self.memory)
RuntimeError: CUDA error: the launch timed out and was terminated
Exception in thread Thread-22:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, *self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3f13358193 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x17f66 (0x7f3f13595f66 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x19cbd (0x7f3f13597cbd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f3f1334863d in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: c10d::Reducer::~Reducer() + 0x449 (0x7f3eff7e9b89 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f3eff7cb592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f3eff034e56 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: + 0x9e813b (0x7f3eff7cc13b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: + 0x293f30 (0x7f3eff077f30 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2951ce (0x7f3eff0791ce in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python3() [0x5d1ca7]
frame #11: /usr/bin/python3() [0x5a605d]
frame #12: /usr/bin/python3() [0x5d1ca7]
frame #13: /usr/bin/python3() [0x5a3132]
frame #14: /usr/bin/python3() [0x4ef828]
frame #15: _PyGC_CollectNoFail + 0x2f (0x6715cf in /usr/bin/python3)
frame #16: PyImport_Cleanup + 0x244 (0x683bf4 in /usr/bin/python3)
frame #17: Py_FinalizeEx + 0x7f (0x67eaef in /usr/bin/python3)
frame #18: Py_RunMain + 0x32d (0x6b624d in /usr/bin/python3)
frame #19: Py_BytesMain + 0x2d (0x6b64bd in /usr/bin/python3)
frame #20: __libc_start_main + 0xf3 (0x7f3f1f2e30b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: _start + 0x2e (0x5f927e in /usr/bin/python3)

wookladin pinned this issue Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bottleneck? #13

Possible bottleneck? #13

Vadim2S commented Jul 12, 2021

wookladin commented Jul 12, 2021

wookladin commented Jul 12, 2021

Vadim2S commented Jul 26, 2021

Possible bottleneck? #13

Possible bottleneck? #13

Comments

Vadim2S commented Jul 12, 2021

wookladin commented Jul 12, 2021

wookladin commented Jul 12, 2021

Vadim2S commented Jul 26, 2021