Distributed and bug fixes #13

vreshniak · 2022-01-24T04:37:50Z

This PR is mostly for discussion at this point. Please don't merge now

Critical changes:
- added @torch.no_grad() decorators for accelerated optimization steps in accelerate.py. This is absolutely necessary and has been missing
- fixed size of gamma in anderson_acceleration.py. This bug caused incorrect broadcasting of vectors in extr = X[:,-2] + DX[:,-1] - (DX[:,:-1]+DR)@gamma
Additions:
- def distributed_accelerated_step in accelerate.py and corresponding modification to def accelerate. def averaged_* have not been changed but must be later
- new CIFAR10_distributed example inspired by ImageNet1k. Lines to pay attention: 32-40, 120-134, 197-209, 258-267

To run new example locally:
torchrun --standalone --nnodes=1 --nproc_per_node=10 main.py

'gamma' should have shape [n,] instead of [n,1]. This caused to incorrect evaluation of 'extr' due to shape broadcasting

Added distributed_accelerate and remove_acceleration

allaffa · 2022-01-24T18:18:49Z

I tried to run the code on a DGX server and I got the following error

  File "/opt/conda/lib/python3.8/site-packages/AADL/accelerate.py", line 117, in distributed_accelerated_step
    torch.distributed.gather(group_hist[-1], gather_list=history_list,   dst=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2117, in gather
    work = default_pg.gather(output_tensors, input_tensors, opts)
RuntimeError: ProcessGroupNCCL does not support gather

Apparently, this is a renown issue for NCCL backend that has not been addressed:
pytorch/pytorch#55893

vreshniak · 2022-01-24T19:10:47Z

I tried to run the code on a DGX server and I got the following error

  File "/opt/conda/lib/python3.8/site-packages/AADL/accelerate.py", line 117, in distributed_accelerated_step
    torch.distributed.gather(group_hist[-1], gather_list=history_list,   dst=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2117, in gather
    work = default_pg.gather(output_tensors, input_tensors, opts)
RuntimeError: ProcessGroupNCCL does not support gather

Apparently, this is a renown issue for NCCL backend that has not been addressed: pytorch/pytorch#55893

Just set _debug = False on line 13 of accelerate.py. sync_frequency=1 in main.py, so this shouldn't be an issue.

Also, when running it on DGX server, what is the def setup_ddp() function that you use?
The current one runs everything locally on a single node.

allaffa · 2022-01-24T20:16:38Z

I tried to run the code on a DGX server and I got the following error
  File "/opt/conda/lib/python3.8/site-packages/AADL/accelerate.py", line 117, in distributed_accelerated_step
    torch.distributed.gather(group_hist[-1], gather_list=history_list,   dst=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2117, in gather
    work = default_pg.gather(output_tensors, input_tensors, opts)
RuntimeError: ProcessGroupNCCL does not support gather
Apparently, this is a renown issue for NCCL backend that has not been addressed: pytorch/pytorch#55893
Just set _debug = False on line 13 of accelerate.py. sync_frequency=1 in main.py, so this shouldn't be an issue.

Also, when running it on DGX server, what is the def setup_ddp() function that you use? The current one runs everything locally on a single node.

To run it on the DGX machine, I replaced the setup_ddp() of you file with the one provided in the main file of the ImageNet1k example

vreshniak added 5 commits January 20, 2022 12:59

fix bug with gamma shape

e2d218b

'gamma' should have shape [n,] instead of [n,1]. This caused to incorrect evaluation of 'extr' due to shape broadcasting

added functions to package __init__.py

3a3fb9e

Added distributed_accelerate and remove_acceleration

added distributed_accelerated_step and @torch.no_grad() decorators

47d0026

minor fixes

81b7151

added CIFAR10_distributed example

f3839cc

vreshniak requested a review from allaffa January 24, 2022 04:37

added CIFAR10_distributed example

76abbe4

allaffa assigned vreshniak Jan 24, 2022

allaffa added the bug Something isn't working label Jan 24, 2022

allaffa approved these changes Mar 4, 2022

View reviewed changes

allaffa merged commit 8e7d184 into master Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed and bug fixes #13

Distributed and bug fixes #13

vreshniak commented Jan 24, 2022 •

edited

allaffa commented Jan 24, 2022

vreshniak commented Jan 24, 2022

allaffa commented Jan 24, 2022

Distributed and bug fixes #13

Distributed and bug fixes #13

Conversation

vreshniak commented Jan 24, 2022 • edited

allaffa commented Jan 24, 2022

vreshniak commented Jan 24, 2022

allaffa commented Jan 24, 2022

vreshniak commented Jan 24, 2022 •

edited