-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horovod Gradient Compression #2025
Comments
Hey @vineeths96, the order is:
So I think your intuition was correct, but the only difference is that the communication step is also when we perform the allreduce on the gradients (in compressed form to reduce network overhead). In case you're worried about overflow / underflow, there is a PR in review that addresses this (#1949). |
Hello @tgaddair. Thank you for your reply. I am interested in developing compressors such as QSGD. There are already a few implementations like here, where the compressed gradient at each worker is a tuple. But as per Step 3 in your reply above, Horovod averages the compressed gradients. I am not able to understand how you would average the compressed gradients in this case - where we have a list tuples. |
Hi @vineeths96, In the original paper(line 4 of Algorithm 1), the compressed gradient will be broadcasted to all peers. In this case, I think you may need use allgather operator. I have one imlementation for 8bit-QSGD with horovod here, which may be helpful. |
Hello @xinyandai, Thank you for your reply. I just received enough time to go through your code. What I could understand is that you made three major changes in the Horovod Torch Distributed Optimizer:
Correct me if I am wrong. |
I have a question about how the gradient compression is performed within Horovod. As mentioned in the source code here, we need to inherit the base compressor class and override
compress
anddecompress
functions.What my doubt is that in which order the operations are performed. More specifically, is the order as shown below?
all_reduce
of gradients and update the parameters.The text was updated successfully, but these errors were encountered: