Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting data_parallel: true results in an error #90

Closed
ghost opened this issue Mar 24, 2022 · 4 comments · Fixed by #91
Closed

Setting data_parallel: true results in an error #90

ghost opened this issue Mar 24, 2022 · 4 comments · Fixed by #91
Labels
bug Something isn't working

Comments

@ghost
Copy link

ghost commented Mar 24, 2022

When setting data_parallel to "true" in the config, I end up with this as a result.
I run this under a Jupyter notebook instance that is based on the ENUNU-Training-Kit with the dev2 branch of NNSVS.

Error executing job with overrides: ['model=acoustic_custom', 'train=myconfig', 'data=myconfig', 'data.train_no_dev.in_dir=dump/multi-gpu-test/norm/train_no_dev/in_acoustic/', 'data.train_no_dev.out_dir=dump/multi-gpu-test/norm/train_no_dev/out_acoustic/', 'data.dev.in_dir=dump/multi-gpu-test/norm/dev/in_acoustic/', 'data.dev.out_dir=dump/multi-gpu-test/norm/dev/out_acoustic/', 'train.out_dir=exp/multi-gpu-test_dynamivox_notebook/acoustic', 'train.resume.checkpoint=']
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 158, in <module>
    my_app()
  File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 141, in my_app
    train_loop(
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 100, in train_loop
    loss = train_step(
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 32, in train_step
    out_feats = model.preprocess_target(out_feats)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'preprocess_target'
++ set +x
@r9y9
Copy link
Collaborator

r9y9 commented Mar 26, 2022

Thanks for reporting this. This is my bad. I will fix it soon.

r9y9 added a commit that referenced this issue Mar 26, 2022
r9y9 added a commit that referenced this issue Mar 26, 2022
@r9y9 r9y9 added the bug Something isn't working label Mar 26, 2022
@r9y9 r9y9 closed this as completed in #91 Mar 26, 2022
r9y9 added a commit that referenced this issue Mar 26, 2022
Fix data parallel issues #90
@r9y9
Copy link
Collaborator

r9y9 commented Mar 26, 2022

This should be fixed on master now. Please let me know if the problem persists.

@ghost
Copy link
Author

ghost commented Mar 27, 2022

The problem appears to be fixed, thank you.
Performance doesn't seem to scale well with the number of GPUs though. 4x GPUs is only about 1.5x as fast as 1 GPU even though all are being utilized.
Could just be a CPU bottleneck on my side though. As even with a high number of workers only a few cores appear to be doing the majority of the work.

@r9y9
Copy link
Collaborator

r9y9 commented Mar 27, 2022

That's somewhat expected; only the single-process parallelism (using nn.DataParallel) is supported, and that's a bit slow (but easy to implement). We can use multi-process distributed training to further speed up multi-GPU training. Using distributed training doesn't mean we can see 4x speedup with 4x GPUs though. That's faster at least.

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#comparison-between-dataparallel-and-distributeddataparallel

I'll work on distributed training in the future but it's not a priority in my opinion. Using single GPU is enough for most cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant