Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with num_workers=8 #30

Closed
skol101 opened this issue Jan 11, 2023 · 10 comments
Closed

Bug with num_workers=8 #30

skol101 opened this issue Jan 11, 2023 · 10 comments

Comments

@skol101
Copy link

skol101 commented Jan 11, 2023

Executing command CUDA_VISIBLE_DEVICES="0" python train.py -c configs/freevc.json -m freevc

But the training process continues

INFO:freevc:Train Epoch: 1 [0%]
INFO:freevc:[6.033028602600098, 4.603592395782471, 0.2524043619632721, 88.27883911132812, 37.568702697753906, 0, 0.0002]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f73359e7430>
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 14261) is killed by signal: Aborted.
@skol101 skol101 changed the title Bug with num_workers >0 Bug with num_workers=0 Jan 11, 2023
@skol101 skol101 closed this as completed Jan 11, 2023
@skol101 skol101 changed the title Bug with num_workers=0 Bug with num_workers=8 Jan 11, 2023
@skol101
Copy link
Author

skol101 commented Jan 11, 2023

When resuming from your pretrained G and D,

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 23.69 GiB total capacity; 5.86 GiB already allocated; 2.71 GiB free; 19.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@skol101 skol101 reopened this Jan 11, 2023
@skol101
Copy link
Author

skol101 commented Jan 11, 2023

Works fine with num_workers=4. It's a minor issue, but could be useful for somebody.

@skol101
Copy link
Author

skol101 commented Jan 11, 2023

Hmm also, bugs during eval step with num_workers=4

INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
./logs/freevc/G_195000.pth
INFO:freevc:Loaded checkpoint './logs/freevc/G_195000.pth' (iteration 2053)
./logs/freevc/D_195000.pth
INFO:freevc:Loaded checkpoint './logs/freevc/D_195000.pth' (iteration 2053)
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:freevc:Train Epoch: 2053 [97%]
INFO:freevc:[2.2948668003082275, 2.467487096786499, 9.334207534790039, 16.157079696655273, 1.7016690969467163, 379800, 0.00015467115812058983]
INFO:freevc:====> Epoch: 2053
INFO:freevc:====> Epoch: 2054
INFO:freevc:Train Epoch: 2055 [5%]
INFO:freevc:[2.3948028087615967, 2.7103328704833984, 10.981183052062988, 18.12336540222168, 1.8913426399230957, 380000, 0.0001546324927477965]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fd1f94cc430>
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 30152) is killed by signal: Aborted. 
INFO:freevc:Saving model and optimizer state at iteration 2055 to ./logs/freevc/G_380000.pth
INFO:freevc:Saving model and optimizer state at iteration 2055 to ./logs/freevc/D_380000.pth


@OlaWod
Copy link
Owner

OlaWod commented Jan 11, 2023

I had not encountered this problem so currently I tend to think it is due to the machine.

@skol101
Copy link
Author

skol101 commented Jan 11, 2023

Yes, maybe it's some local misconfiguration of the env.

@skol101 skol101 closed this as completed Jan 11, 2023
@skol101
Copy link
Author

skol101 commented Jan 11, 2023

What pytroch/cuda versions are you running, please?

@skol101 skol101 reopened this Jan 11, 2023
@OlaWod
Copy link
Owner

OlaWod commented Jan 11, 2023

torch 1.10.0
cudatoolkit 11.1.1

@skol101
Copy link
Author

skol101 commented Jan 11, 2023

Cheers, mine has
pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0

@skol101 skol101 closed this as completed Jan 11, 2023
@yt605155624
Copy link

set num_workers=0 works well for me

@yt605155624
Copy link

yt605155624 commented May 24, 2023

set persistent_workers=True in train and eval DataLoder works well for me when I set num_workers>1
check link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants