Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod Gradient Compression #2025

Closed
vineeths96 opened this issue Jun 15, 2020 · 4 comments
Closed

Horovod Gradient Compression #2025

vineeths96 opened this issue Jun 15, 2020 · 4 comments

Comments

@vineeths96
Copy link

I have a question about how the gradient compression is performed within Horovod. As mentioned in the source code here, we need to inherit the base compressor class and override compress and decompress functions.

What my doubt is that in which order the operations are performed. More specifically, is the order as shown below?

  1. Calculate gradient on each worker.
  2. Compress the gradients at each worker.
  3. Communicate the gradients (compressed) to all workers.
  4. Decompress the collection of gradients at each worker
  5. Perform an all_reduce of gradients and update the parameters.
@vineeths96 vineeths96 changed the title Hovorod Gradient Compression Horovod Gradient Compression Jun 15, 2020
@tgaddair
Copy link
Collaborator

Hey @vineeths96, the order is:

  1. Calculate gradient on each worker.
  2. Compress the gradient on each worker.
  3. Allreduce and average the compressed gradients across all workers.
  4. Decompress the allreduced gradients.
  5. Update the parameters.

So I think your intuition was correct, but the only difference is that the communication step is also when we perform the allreduce on the gradients (in compressed form to reduce network overhead).

In case you're worried about overflow / underflow, there is a PR in review that addresses this (#1949).

@vineeths96
Copy link
Author

Hello @tgaddair.

Thank you for your reply.

I am interested in developing compressors such as QSGD. There are already a few implementations like here, where the compressed gradient at each worker is a tuple.

But as per Step 3 in your reply above, Horovod averages the compressed gradients. I am not able to understand how you would average the compressed gradients in this case - where we have a list tuples.

@xinyandai
Copy link

xinyandai commented Jun 25, 2020

Hi @vineeths96,

In the original paper(line 4 of Algorithm 1), the compressed gradient will be broadcasted to all peers. In this case, I think you may need use allgather operator.

I have one imlementation for 8bit-QSGD with horovod here, which may be helpful.

@vineeths96
Copy link
Author

Hello @xinyandai,

Thank you for your reply.

I just received enough time to go through your code. What I could understand is that you made three major changes in the Horovod Torch Distributed Optimizer:

  1. You created compressors for each tensor in your class init.
  2. You replaced the allreduce function with your allgather function.
  3. In synchronize function you make sure you call the decompress functions with the proper arguments.

Correct me if I am wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants