Setting data_parallel: true results in an error #90

ghost · 2022-03-24T18:57:59Z

When setting data_parallel to "true" in the config, I end up with this as a result.
I run this under a Jupyter notebook instance that is based on the ENUNU-Training-Kit with the dev2 branch of NNSVS.

Error executing job with overrides: ['model=acoustic_custom', 'train=myconfig', 'data=myconfig', 'data.train_no_dev.in_dir=dump/multi-gpu-test/norm/train_no_dev/in_acoustic/', 'data.train_no_dev.out_dir=dump/multi-gpu-test/norm/train_no_dev/out_acoustic/', 'data.dev.in_dir=dump/multi-gpu-test/norm/dev/in_acoustic/', 'data.dev.out_dir=dump/multi-gpu-test/norm/dev/out_acoustic/', 'train.out_dir=exp/multi-gpu-test_dynamivox_notebook/acoustic', 'train.resume.checkpoint=']
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 158, in <module>
    my_app()
  File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 141, in my_app
    train_loop(
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 100, in train_loop
    loss = train_step(
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 32, in train_step
    out_feats = model.preprocess_target(out_feats)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'preprocess_target'
++ set +x

r9y9 · 2022-03-26T06:01:23Z

Thanks for reporting this. This is my bad. I will fix it soon.

Fix data parallel issues #90

r9y9 · 2022-03-26T07:38:30Z

This should be fixed on master now. Please let me know if the problem persists.

ghost · 2022-03-27T13:10:56Z

The problem appears to be fixed, thank you.
Performance doesn't seem to scale well with the number of GPUs though. 4x GPUs is only about 1.5x as fast as 1 GPU even though all are being utilized.
Could just be a CPU bottleneck on my side though. As even with a high number of workers only a few cores appear to be doing the majority of the work.

r9y9 · 2022-03-27T14:13:31Z

That's somewhat expected; only the single-process parallelism (using nn.DataParallel) is supported, and that's a bit slow (but easy to implement). We can use multi-process distributed training to further speed up multi-GPU training. Using distributed training doesn't mean we can see 4x speedup with 4x GPUs though. That's faster at least.

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#comparison-between-dataparallel-and-distributeddataparallel

I'll work on distributed training in the future but it's not a priority in my opinion. Using single GPU is enough for most cases.

r9y9 added a commit that referenced this issue Mar 26, 2022

Fix data parallel issues #90

c27d9a8

r9y9 added a commit that referenced this issue Mar 26, 2022

Fix data parallel issues #90

faa1c2a

r9y9 mentioned this issue Mar 26, 2022

Fix data parallel issues #90 #91

Merged

r9y9 added the bug Something isn't working label Mar 26, 2022

r9y9 closed this as completed in #91 Mar 26, 2022

r9y9 added a commit that referenced this issue Mar 26, 2022

Merge pull request #91 from r9y9/fix-DP

8c0bb10

Fix data parallel issues #90

r9y9 mentioned this issue Mar 27, 2022

Distributed data-parallel to speed up multi-GPU training #92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting data_parallel: true results in an error #90

Setting data_parallel: true results in an error #90

ghost commented Mar 24, 2022

r9y9 commented Mar 26, 2022

r9y9 commented Mar 26, 2022

ghost commented Mar 27, 2022 •

edited by ghost

r9y9 commented Mar 27, 2022

Setting data_parallel: true results in an error #90

Setting data_parallel: true results in an error #90

Comments

ghost commented Mar 24, 2022

r9y9 commented Mar 26, 2022

r9y9 commented Mar 26, 2022

ghost commented Mar 27, 2022 • edited by ghost

r9y9 commented Mar 27, 2022

ghost commented Mar 27, 2022 •

edited by ghost