[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed #1478

gihwan-kim · 2022-11-27T20:59:01Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version (master) or latest version (1.x).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmediting

Environment

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.10.2
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.3
OpenCV: 4.5.4
MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMEditing: 0.16.0+7b3a8bd

Reproduces the problem - code sample

I just training again.

Reproduces the problem - command or script

./tools/dist_train.sh ./configs/restorers/real_basicvsr/realbasicvsr_wogan_c64b20_2x30x8_lr1e-4_300k_reds.py 1

Reproduces the problem - error message

  File "./tools/train.py", line 169, in <module>
    main()
  File "./tools/train.py", line 165, in main
    meta=meta)
  File "/home/gihwan/mmedit/mmedit/apis/train.py", line 104, in train_model
    meta=meta)
  File "/home/gihwan/mmedit/mmedit/apis/train.py", line 241, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train
    data_batch = next(data_loader)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 32, in __next__
    data = next(self.iter_loader)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
av.codec.codec.UnknownCodecError: Caught UnknownCodecError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/mmedit/mmedit/datasets/dataset_wrappers.py", line 31, in __getitem__
    return self.dataset[idx % self._ori_len]
  File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
    return self.pipeline(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
    data = t(data)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 547, in __call__
    results = degradation(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 465, in __call__
    results[key] = self._apply_random_compression(results[key])
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 434, in _apply_random_compression
    stream = container.add_stream(codec, rate=1)
  File "av/container/output.pyx", line 64, in av.container.output.OutputContainer.add_stream
  File "av/codec/codec.pyx", line 184, in av.codec.codec.Codec.__cinit__
  File "av/codec/codec.pyx", line 193, in av.codec.codec.Codec._init
av.codec.codec.UnknownCodecError: libx264

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10741) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Additional information

I'm trying to train a Real BasicVSR to check if it trains in my environment.
I have similar issue like issue. But that issue isn't resolved yet.

The text was updated successfully, but these errors were encountered:

LeoXing1996 · 2022-11-28T02:29:10Z

Hey @gihwan-kim, this seems to be a PyAV error, since av.codec.codec.UnknownCodecError: libx264 is raised.
What PyAV version you used?

gihwan-kim · 2022-11-28T08:06:05Z

Hey @gihwan-kim, this seems to be a PyAV error, since av.codec.codec.UnknownCodecError: libx264 is raised. What PyAV version you used?

>>> import av
>>> av.__version__
'8.0.2'

Its ver 8.0.2

LeoXing1996 · 2022-11-28T12:00:47Z

@gihwan-kim, Can you attempt to install PyAV==8.0.3. BTW, what is your ffmpeg version?

gihwan-kim · 2022-11-29T12:37:52Z

@gihwan-kim, Can you attempt to install PyAV==8.0.3. BTW, what is your ffmpeg version?

Thank you for kind reply!.
When i changed pyav version to 8.0.3, it dose not appear UnknownCodecError. But new issue is occured..;

and ffmpeg version is 4.3

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.75 GiB total capacity; 9.28 GiB already allocated; 18.75 MiB free; 9.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12491) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I think too much iteration need large memory. Should i have to change training configuration?

This is my output of nvidia-smi command.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 35%   29C    P8     1W / 260W |    102MiB / 11011MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1116      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A      1243      G   /usr/bin/gnome-shell               60MiB |

LeoXing1996 · 2022-11-30T09:22:04Z

@gihwan-kim, training RealBasicVSR with default config needs at least 17201MB of GPU memory, and you can refer to the memory field in log.

I think you can try to change crop_size in the training pipeline to a smaller value to save memory.

gihwan-kim · 2022-12-05T07:17:31Z

@gihwan-kim, training RealBasicVSR with default config needs at least 17201MB of GPU memory, and you can refer to the memory field in log.

I think you can try to change crop_size in the training pipeline to a smaller value to save memory.

I could solve by changing workers_per_gpu, samples_per_gpu, num_input_frames values in cofig file. Thank you!
But, while training, i found "file not found error"

FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
    return self.pipeline(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
    data = t(data)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/loading.py", line 176, in __call__
    img_bytes = self.file_client.get(filepath)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 993, in get
    return self.client.get(filepath)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 518, in get
    with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/UDM10/BIx4/archpeople/00000000.png'

My UDM data file path is 'data/UDM10/BIx4/archpeople/000.png'.
Is there any naming rule or guide about preprocessing UDM10 file to BIx4 in RealBasicVSR?
I found REDS and videoLQ guide in paper and official document.
But i couldn't find where to download and pre processing UDM10 data.
I just found where to down load from this link udm10

Z-Fran · 2022-12-05T07:53:39Z

@gihwan-kim For the master branch, you need to rename your images of datasets. You can use a simple script to resolve it. Like this:

    data_root = 'dataset/data/udm10/'
    save_root = 'dataset/data/UDM10/'
    dirs = os.listdir(data_root)
    dirs = sorted(dirs, key=str.lower)
    num = 0
    for _dir in dirs:
        sub_root1 = save_root+'GT/'+str(num).zfill(8)
        sub_root2 = save_root+'BDx4/'+str(num).zfill(8)
        os.system('cp -r '+data_root+_dir+'/truth/ '+sub_root1)
        os.system('cp -r '+data_root+_dir+'/blur4/ '+sub_root2)
        num+=1

For the 1.x or dev-1.x branch, if your UDM data file path is 'data/UDM10/BIx4/archpeople/000.png', you can simply add a parameter like filename_tmpl='{:03d}.png'. You can reference https://github.com/open-mmlab/mmediting/blob/dev-1.x/configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py#L204

gihwan-kim · 2022-12-05T12:58:25Z

@gihwan-kim For the master branch, you need to rename your images of datasets. You can use a simple script to resolve it. Like this:
    data_root = 'dataset/data/udm10/'
    save_root = 'dataset/data/UDM10/'
    dirs = os.listdir(data_root)
    dirs = sorted(dirs, key=str.lower)
    num = 0
    for _dir in dirs:
        sub_root1 = save_root+'GT/'+str(num).zfill(8)
        sub_root2 = save_root+'BDx4/'+str(num).zfill(8)
        os.system('cp -r '+data_root+_dir+'/truth/ '+sub_root1)
        os.system('cp -r '+data_root+_dir+'/blur4/ '+sub_root2)
        num+=1
For the 1.x or dev-1.x branch, if your UDM data file path is 'data/UDM10/BIx4/archpeople/000.png', you can simply add a parameter like filename_tmpl='{:03d}.png'. You can reference https://github.com/open-mmlab/mmediting/blob/dev-1.x/configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py#L204

Thank you for kindness help!
As i mentioned. I have question about validation data set UDM10 in RealBasicVSR.
I downloaded udm10 dataset from this link udm10 download site .
This site's udm10 data directory structure is

./udm10
├── archpeople
│   ├── blur4
│   └── truth
├── archwall
│   ├── blur4
│   └── truth
├── auditorium
│   ├── blur4
│   └── truth
├── band
│   ├── blur4
│   └── truth
├── caffe
│   ├── blur4
│   └── truth
├── camera
│   ├── blur4
│   └── truth
├── clap
│   ├── blur4
│   └── truth
├── lake
│   ├── blur4
│   └── truth
├── photography
│   ├── blur4
│   └── truth
└── polyflow
    ├── blur4
    └── truth

Is it blur4 data is BIx4 ?
Or should i have to pre-processing like this link link ?

And Blx4 mean bicubic interpolation x4 downsampling?

Z-Fran · 2022-12-06T06:17:20Z

Blur4 is not BIx4 or BDx4. BIx4 and BDx4 are both pre-processed using MATLAB. For BDx4, you need to use MATLAB script https://github.com/ckkelvinchan/BasicVSR-IconVSR/blob/main/BD_degradation.m . For BIx4, you can simply use imresize of MATLAB to get data or imresize of python implementation like this https://github.com/fatheral/matlab_imresize/blob/master/imresize.py . I can provide my data if you need it.
And Blx4 mean bicubic interpolation x4 downsampling. @gihwan-kim

gihwan-kim · 2022-12-06T06:45:05Z

Blur4 is not BIx4 or BDx4. BIx4 and BDx4 are both pre-processed using MATLAB. For BDx4, you need to use MATLAB script https://github.com/ckkelvinchan/BasicVSR-IconVSR/blob/main/BD_degradation.m . For BIx4, you can simply use imresize of MATLAB to get data or imresize of python implementation like this https://github.com/fatheral/matlab_imresize/blob/master/imresize.py . I can provide my data if you need it. And Blx4 mean bicubic interpolation x4 downsampling. @gihwan-kim

I will check imresize code what you mentioned thank you! :)
If you okay. It would be grateful if you could send your data.

Z-Fran · 2022-12-07T04:40:37Z

https://drive.google.com/file/d/1G4V4KZZhhfzUlqHiSBBuWyqLyIOvOs0W/view?usp=share_link @gihwan-kim

gihwan-kim · 2022-12-07T18:00:44Z

https://drive.google.com/file/d/1G4V4KZZhhfzUlqHiSBBuWyqLyIOvOs0W/view?usp=share_link @gihwan-kim

Thank you.!

gihwan-kim added the kind/bug something isn't working label Nov 27, 2022

mm-assistant bot assigned LeoXing1996 Nov 27, 2022

gihwan-kim changed the title ~~[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failedhttps://github.com/open-mmlab/mmcv/issues/new/choose~~ [Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed Nov 27, 2022

gihwan-kim closed this as completed Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed #1478

[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed #1478

gihwan-kim commented Nov 27, 2022

LeoXing1996 commented Nov 28, 2022

gihwan-kim commented Nov 28, 2022

LeoXing1996 commented Nov 28, 2022

gihwan-kim commented Nov 29, 2022 •

edited

LeoXing1996 commented Nov 30, 2022

gihwan-kim commented Dec 5, 2022

Z-Fran commented Dec 5, 2022 •

edited

gihwan-kim commented Dec 5, 2022

Z-Fran commented Dec 6, 2022 •

edited

gihwan-kim commented Dec 6, 2022

Z-Fran commented Dec 7, 2022

gihwan-kim commented Dec 7, 2022

[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed #1478

[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed #1478

Comments

gihwan-kim commented Nov 27, 2022

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

LeoXing1996 commented Nov 28, 2022

gihwan-kim commented Nov 28, 2022

LeoXing1996 commented Nov 28, 2022

gihwan-kim commented Nov 29, 2022 • edited

LeoXing1996 commented Nov 30, 2022

gihwan-kim commented Dec 5, 2022

Z-Fran commented Dec 5, 2022 • edited

gihwan-kim commented Dec 5, 2022

Z-Fran commented Dec 6, 2022 • edited

gihwan-kim commented Dec 6, 2022

Z-Fran commented Dec 7, 2022

gihwan-kim commented Dec 7, 2022

gihwan-kim commented Nov 29, 2022 •

edited

Z-Fran commented Dec 5, 2022 •

edited

Z-Fran commented Dec 6, 2022 •

edited