Bug with num_workers=8 #30

skol101 · 2023-01-11T08:30:26Z

Executing command CUDA_VISIBLE_DEVICES="0" python train.py -c configs/freevc.json -m freevc

But the training process continues

INFO:freevc:Train Epoch: 1 [0%]
INFO:freevc:[6.033028602600098, 4.603592395782471, 0.2524043619632721, 88.27883911132812, 37.568702697753906, 0, 0.0002]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f73359e7430>
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 14261) is killed by signal: Aborted.

The text was updated successfully, but these errors were encountered:

skol101 · 2023-01-11T08:44:23Z

When resuming from your pretrained G and D,

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 23.69 GiB total capacity; 5.86 GiB already allocated; 2.71 GiB free; 19.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

skol101 · 2023-01-11T08:49:14Z

Works fine with num_workers=4. It's a minor issue, but could be useful for somebody.

skol101 · 2023-01-11T09:13:19Z

Hmm also, bugs during eval step with num_workers=4

INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
./logs/freevc/G_195000.pth
INFO:freevc:Loaded checkpoint './logs/freevc/G_195000.pth' (iteration 2053)
./logs/freevc/D_195000.pth
INFO:freevc:Loaded checkpoint './logs/freevc/D_195000.pth' (iteration 2053)
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:freevc:Train Epoch: 2053 [97%]
INFO:freevc:[2.2948668003082275, 2.467487096786499, 9.334207534790039, 16.157079696655273, 1.7016690969467163, 379800, 0.00015467115812058983]
INFO:freevc:====> Epoch: 2053
INFO:freevc:====> Epoch: 2054
INFO:freevc:Train Epoch: 2055 [5%]
INFO:freevc:[2.3948028087615967, 2.7103328704833984, 10.981183052062988, 18.12336540222168, 1.8913426399230957, 380000, 0.0001546324927477965]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fd1f94cc430>
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 30152) is killed by signal: Aborted. 
INFO:freevc:Saving model and optimizer state at iteration 2055 to ./logs/freevc/G_380000.pth
INFO:freevc:Saving model and optimizer state at iteration 2055 to ./logs/freevc/D_380000.pth

OlaWod · 2023-01-11T13:22:02Z

I had not encountered this problem so currently I tend to think it is due to the machine.

skol101 · 2023-01-11T13:24:27Z

Yes, maybe it's some local misconfiguration of the env.

skol101 · 2023-01-11T13:33:25Z

What pytroch/cuda versions are you running, please?

OlaWod · 2023-01-11T13:42:51Z

torch 1.10.0
cudatoolkit 11.1.1

skol101 · 2023-01-11T14:07:54Z

Cheers, mine has
pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0

yt605155624 · 2023-05-22T11:33:39Z

set num_workers=0 works well for me

yt605155624 · 2023-05-24T02:14:40Z

set persistent_workers=True in train and eval DataLoder works well for me when I set num_workers>1
check link

skol101 changed the title ~~Bug with num_workers >0~~ Bug with num_workers=0 Jan 11, 2023

skol101 closed this as completed Jan 11, 2023

skol101 changed the title ~~Bug with num_workers=0~~ Bug with num_workers=8 Jan 11, 2023

skol101 reopened this Jan 11, 2023

skol101 closed this as completed Jan 11, 2023

skol101 reopened this Jan 11, 2023

skol101 closed this as completed Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug with num_workers=8 #30

Bug with num_workers=8 #30

skol101 commented Jan 11, 2023 •

edited

Loading

skol101 commented Jan 11, 2023

skol101 commented Jan 11, 2023

skol101 commented Jan 11, 2023 •

edited

Loading

OlaWod commented Jan 11, 2023

skol101 commented Jan 11, 2023

skol101 commented Jan 11, 2023

OlaWod commented Jan 11, 2023

skol101 commented Jan 11, 2023

yt605155624 commented May 22, 2023

yt605155624 commented May 24, 2023 •

edited

Loading

Bug with num_workers=8 #30

Bug with num_workers=8 #30

Comments

skol101 commented Jan 11, 2023 • edited Loading

skol101 commented Jan 11, 2023

skol101 commented Jan 11, 2023

skol101 commented Jan 11, 2023 • edited Loading

OlaWod commented Jan 11, 2023

skol101 commented Jan 11, 2023

skol101 commented Jan 11, 2023

OlaWod commented Jan 11, 2023

skol101 commented Jan 11, 2023

yt605155624 commented May 22, 2023

yt605155624 commented May 24, 2023 • edited Loading

skol101 commented Jan 11, 2023 •

edited

Loading

skol101 commented Jan 11, 2023 •

edited

Loading

yt605155624 commented May 24, 2023 •

edited

Loading