New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed and bug fixes #13
Conversation
'gamma' should have shape [n,] instead of [n,1]. This caused to incorrect evaluation of 'extr' due to shape broadcasting
Added distributed_accelerate and remove_acceleration
I tried to run the code on a DGX server and I got the following error
Apparently, this is a renown issue for NCCL backend that has not been addressed: |
Just set Also, when running it on DGX server, what is the |
To run it on the DGX machine, I replaced the setup_ddp() of you file with the one provided in the main file of the ImageNet1k example |
This PR is mostly for discussion at this point. Please don't merge now
Critical changes:
@torch.no_grad()
decorators for accelerated optimization steps inaccelerate.py
. This is absolutely necessary and has been missinggamma
inanderson_acceleration.py
. This bug caused incorrect broadcasting of vectors inextr = X[:,-2] + DX[:,-1] - (DX[:,:-1]+DR)@gamma
Additions:
def distributed_accelerated_step
inaccelerate.py
and corresponding modification todef accelerate
.def averaged_*
have not been changed but must be laterTo run new example locally:
torchrun --standalone --nnodes=1 --nproc_per_node=10 main.py