Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed #1478

Closed
3 tasks done
gihwan-kim opened this issue Nov 27, 2022 · 12 comments
Closed
3 tasks done
Assignees
Labels
kind/bug something isn't working

Comments

@gihwan-kim
Copy link

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmediting

Environment

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.10.2
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.3
OpenCV: 4.5.4
MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMEditing: 0.16.0+7b3a8bd

Reproduces the problem - code sample

I just training again.

Reproduces the problem - command or script

./tools/dist_train.sh ./configs/restorers/real_basicvsr/realbasicvsr_wogan_c64b20_2x30x8_lr1e-4_300k_reds.py 1

Reproduces the problem - error message

  File "./tools/train.py", line 169, in <module>
    main()
  File "./tools/train.py", line 165, in main
    meta=meta)
  File "/home/gihwan/mmedit/mmedit/apis/train.py", line 104, in train_model
    meta=meta)
  File "/home/gihwan/mmedit/mmedit/apis/train.py", line 241, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train
    data_batch = next(data_loader)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 32, in __next__
    data = next(self.iter_loader)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
av.codec.codec.UnknownCodecError: Caught UnknownCodecError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/mmedit/mmedit/datasets/dataset_wrappers.py", line 31, in __getitem__
    return self.dataset[idx % self._ori_len]
  File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
    return self.pipeline(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
    data = t(data)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 547, in __call__
    results = degradation(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 465, in __call__
    results[key] = self._apply_random_compression(results[key])
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 434, in _apply_random_compression
    stream = container.add_stream(codec, rate=1)
  File "av/container/output.pyx", line 64, in av.container.output.OutputContainer.add_stream
  File "av/codec/codec.pyx", line 184, in av.codec.codec.Codec.__cinit__
  File "av/codec/codec.pyx", line 193, in av.codec.codec.Codec._init
av.codec.codec.UnknownCodecError: libx264

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10741) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Additional information

I'm trying to train a Real BasicVSR to check if it trains in my environment.
I have similar issue like issue. But that issue isn't resolved yet.

@gihwan-kim gihwan-kim added the kind/bug something isn't working label Nov 27, 2022
@gihwan-kim gihwan-kim changed the title [Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failedhttps://github.com/open-mmlab/mmcv/issues/new/choose [Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed Nov 27, 2022
@LeoXing1996
Copy link
Collaborator

Hey @gihwan-kim, this seems to be a PyAV error, since av.codec.codec.UnknownCodecError: libx264 is raised.
What PyAV version you used?

@gihwan-kim
Copy link
Author

Hey @gihwan-kim, this seems to be a PyAV error, since av.codec.codec.UnknownCodecError: libx264 is raised. What PyAV version you used?

>>> import av
>>> av.__version__
'8.0.2'

Its ver 8.0.2

@LeoXing1996
Copy link
Collaborator

@gihwan-kim, Can you attempt to install PyAV==8.0.3. BTW, what is your ffmpeg version?

@gihwan-kim
Copy link
Author

gihwan-kim commented Nov 29, 2022

@gihwan-kim, Can you attempt to install PyAV==8.0.3. BTW, what is your ffmpeg version?

Thank you for kind reply!.
When i changed pyav version to 8.0.3, it dose not appear UnknownCodecError. But new issue is occured..;

and ffmpeg version is 4.3

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.75 GiB total capacity; 9.28 GiB already allocated; 18.75 MiB free; 9.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12491) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I think too much iteration need large memory. Should i have to change training configuration?

This is my output of nvidia-smi command.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 35%   29C    P8     1W / 260W |    102MiB / 11011MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1116      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A      1243      G   /usr/bin/gnome-shell               60MiB |

@LeoXing1996
Copy link
Collaborator

@gihwan-kim, training RealBasicVSR with default config needs at least 17201MB of GPU memory, and you can refer to the memory field in log.

I think you can try to change crop_size in the training pipeline to a smaller value to save memory.

@gihwan-kim
Copy link
Author

@gihwan-kim, training RealBasicVSR with default config needs at least 17201MB of GPU memory, and you can refer to the memory field in log.

I think you can try to change crop_size in the training pipeline to a smaller value to save memory.

I could solve by changing workers_per_gpu, samples_per_gpu, num_input_frames values in cofig file. Thank you!
But, while training, i found "file not found error"

FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
    return self.pipeline(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
    data = t(data)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/loading.py", line 176, in __call__
    img_bytes = self.file_client.get(filepath)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 993, in get
    return self.client.get(filepath)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 518, in get
    with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/UDM10/BIx4/archpeople/00000000.png'

My UDM data file path is 'data/UDM10/BIx4/archpeople/000.png'.
Is there any naming rule or guide about preprocessing UDM10 file to BIx4 in RealBasicVSR?
I found REDS and videoLQ guide in paper and official document.
But i couldn't find where to download and pre processing UDM10 data.
I just found where to down load from this link udm10

@Z-Fran
Copy link
Collaborator

Z-Fran commented Dec 5, 2022

@gihwan-kim For the master branch, you need to rename your images of datasets. You can use a simple script to resolve it. Like this:

    data_root = 'dataset/data/udm10/'
    save_root = 'dataset/data/UDM10/'
    dirs = os.listdir(data_root)
    dirs = sorted(dirs, key=str.lower)
    num = 0
    for _dir in dirs:
        sub_root1 = save_root+'GT/'+str(num).zfill(8)
        sub_root2 = save_root+'BDx4/'+str(num).zfill(8)
        os.system('cp -r '+data_root+_dir+'/truth/ '+sub_root1)
        os.system('cp -r '+data_root+_dir+'/blur4/ '+sub_root2)
        num+=1

For the 1.x or dev-1.x branch, if your UDM data file path is 'data/UDM10/BIx4/archpeople/000.png', you can simply add a parameter like filename_tmpl='{:03d}.png'. You can reference https://github.com/open-mmlab/mmediting/blob/dev-1.x/configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py#L204

@gihwan-kim
Copy link
Author

@gihwan-kim For the master branch, you need to rename your images of datasets. You can use a simple script to resolve it. Like this:

    data_root = 'dataset/data/udm10/'
    save_root = 'dataset/data/UDM10/'
    dirs = os.listdir(data_root)
    dirs = sorted(dirs, key=str.lower)
    num = 0
    for _dir in dirs:
        sub_root1 = save_root+'GT/'+str(num).zfill(8)
        sub_root2 = save_root+'BDx4/'+str(num).zfill(8)
        os.system('cp -r '+data_root+_dir+'/truth/ '+sub_root1)
        os.system('cp -r '+data_root+_dir+'/blur4/ '+sub_root2)
        num+=1

For the 1.x or dev-1.x branch, if your UDM data file path is 'data/UDM10/BIx4/archpeople/000.png', you can simply add a parameter like filename_tmpl='{:03d}.png'. You can reference https://github.com/open-mmlab/mmediting/blob/dev-1.x/configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py#L204

Thank you for kindness help!
As i mentioned. I have question about validation data set UDM10 in RealBasicVSR.
I downloaded udm10 dataset from this link udm10 download site .
This site's udm10 data directory structure is

./udm10
├── archpeople
│   ├── blur4
│   └── truth
├── archwall
│   ├── blur4
│   └── truth
├── auditorium
│   ├── blur4
│   └── truth
├── band
│   ├── blur4
│   └── truth
├── caffe
│   ├── blur4
│   └── truth
├── camera
│   ├── blur4
│   └── truth
├── clap
│   ├── blur4
│   └── truth
├── lake
│   ├── blur4
│   └── truth
├── photography
│   ├── blur4
│   └── truth
└── polyflow
    ├── blur4
    └── truth

Is it blur4 data is BIx4 ?
Or should i have to pre-processing like this link link ?

And Blx4 mean bicubic interpolation x4 downsampling?

@Z-Fran
Copy link
Collaborator

Z-Fran commented Dec 6, 2022

Blur4 is not BIx4 or BDx4. BIx4 and BDx4 are both pre-processed using MATLAB. For BDx4, you need to use MATLAB script https://github.com/ckkelvinchan/BasicVSR-IconVSR/blob/main/BD_degradation.m . For BIx4, you can simply use imresize of MATLAB to get data or imresize of python implementation like this https://github.com/fatheral/matlab_imresize/blob/master/imresize.py . I can provide my data if you need it.
And Blx4 mean bicubic interpolation x4 downsampling. @gihwan-kim

@gihwan-kim
Copy link
Author

Blur4 is not BIx4 or BDx4. BIx4 and BDx4 are both pre-processed using MATLAB. For BDx4, you need to use MATLAB script https://github.com/ckkelvinchan/BasicVSR-IconVSR/blob/main/BD_degradation.m . For BIx4, you can simply use imresize of MATLAB to get data or imresize of python implementation like this https://github.com/fatheral/matlab_imresize/blob/master/imresize.py . I can provide my data if you need it. And Blx4 mean bicubic interpolation x4 downsampling. @gihwan-kim

I will check imresize code what you mentioned thank you! :)
If you okay. It would be grateful if you could send your data.

@Z-Fran
Copy link
Collaborator

Z-Fran commented Dec 7, 2022

@gihwan-kim
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants