Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed and bug fixes #13

Merged
merged 6 commits into from Mar 4, 2022
Merged

Distributed and bug fixes #13

merged 6 commits into from Mar 4, 2022

Conversation

vreshniak
Copy link
Collaborator

@vreshniak vreshniak commented Jan 24, 2022

This PR is mostly for discussion at this point. Please don't merge now

  • Critical changes:

    • added @torch.no_grad() decorators for accelerated optimization steps in accelerate.py. This is absolutely necessary and has been missing
    • fixed size of gamma in anderson_acceleration.py. This bug caused incorrect broadcasting of vectors in extr = X[:,-2] + DX[:,-1] - (DX[:,:-1]+DR)@gamma
  • Additions:

    • def distributed_accelerated_step in accelerate.py and corresponding modification to def accelerate. def averaged_* have not been changed but must be later
    • new CIFAR10_distributed example inspired by ImageNet1k. Lines to pay attention: 32-40, 120-134, 197-209, 258-267

To run new example locally:
torchrun --standalone --nnodes=1 --nproc_per_node=10 main.py

'gamma' should have shape [n,] instead of [n,1]. This caused to incorrect evaluation of 'extr' due to shape broadcasting
Added distributed_accelerate and remove_acceleration
@allaffa allaffa added the bug Something isn't working label Jan 24, 2022
@allaffa
Copy link
Collaborator

allaffa commented Jan 24, 2022

I tried to run the code on a DGX server and I got the following error

  File "/opt/conda/lib/python3.8/site-packages/AADL/accelerate.py", line 117, in distributed_accelerated_step
    torch.distributed.gather(group_hist[-1], gather_list=history_list,   dst=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2117, in gather
    work = default_pg.gather(output_tensors, input_tensors, opts)
RuntimeError: ProcessGroupNCCL does not support gather

Apparently, this is a renown issue for NCCL backend that has not been addressed:
pytorch/pytorch#55893

@vreshniak
Copy link
Collaborator Author

I tried to run the code on a DGX server and I got the following error

  File "/opt/conda/lib/python3.8/site-packages/AADL/accelerate.py", line 117, in distributed_accelerated_step
    torch.distributed.gather(group_hist[-1], gather_list=history_list,   dst=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2117, in gather
    work = default_pg.gather(output_tensors, input_tensors, opts)
RuntimeError: ProcessGroupNCCL does not support gather

Apparently, this is a renown issue for NCCL backend that has not been addressed: pytorch/pytorch#55893

Just set _debug = False on line 13 of accelerate.py. sync_frequency=1 in main.py, so this shouldn't be an issue.

Also, when running it on DGX server, what is the def setup_ddp() function that you use?
The current one runs everything locally on a single node.

@allaffa
Copy link
Collaborator

allaffa commented Jan 24, 2022

I tried to run the code on a DGX server and I got the following error

  File "/opt/conda/lib/python3.8/site-packages/AADL/accelerate.py", line 117, in distributed_accelerated_step
    torch.distributed.gather(group_hist[-1], gather_list=history_list,   dst=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2117, in gather
    work = default_pg.gather(output_tensors, input_tensors, opts)
RuntimeError: ProcessGroupNCCL does not support gather

Apparently, this is a renown issue for NCCL backend that has not been addressed: pytorch/pytorch#55893

Just set _debug = False on line 13 of accelerate.py. sync_frequency=1 in main.py, so this shouldn't be an issue.

Also, when running it on DGX server, what is the def setup_ddp() function that you use? The current one runs everything locally on a single node.

To run it on the DGX machine, I replaced the setup_ddp() of you file with the one provided in the main file of the ImageNet1k example

@allaffa allaffa merged commit 8e7d184 into master Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants