Horovod Gradient Compression #2025

vineeths96 · 2020-06-15T05:46:54Z

I have a question about how the gradient compression is performed within Horovod. As mentioned in the source code here, we need to inherit the base compressor class and override compress and decompress functions.

What my doubt is that in which order the operations are performed. More specifically, is the order as shown below?

Calculate gradient on each worker.
Compress the gradients at each worker.
Communicate the gradients (compressed) to all workers.
Decompress the collection of gradients at each worker
Perform an all_reduce of gradients and update the parameters.

The text was updated successfully, but these errors were encountered:

tgaddair · 2020-06-15T21:14:00Z

Hey @vineeths96, the order is:

Calculate gradient on each worker.
Compress the gradient on each worker.
Allreduce and average the compressed gradients across all workers.
Decompress the allreduced gradients.
Update the parameters.

So I think your intuition was correct, but the only difference is that the communication step is also when we perform the allreduce on the gradients (in compressed form to reduce network overhead).

In case you're worried about overflow / underflow, there is a PR in review that addresses this (#1949).

vineeths96 · 2020-06-16T02:19:14Z

Hello @tgaddair.

Thank you for your reply.

I am interested in developing compressors such as QSGD. There are already a few implementations like here, where the compressed gradient at each worker is a tuple.

But as per Step 3 in your reply above, Horovod averages the compressed gradients. I am not able to understand how you would average the compressed gradients in this case - where we have a list tuples.

xinyandai · 2020-06-25T08:18:53Z

Hi @vineeths96,

In the original paper(line 4 of Algorithm 1), the compressed gradient will be broadcasted to all peers. In this case, I think you may need use allgather operator.

I have one imlementation for 8bit-QSGD with horovod here, which may be helpful.

vineeths96 · 2020-07-10T01:53:18Z

Hello @xinyandai,

Thank you for your reply.

I just received enough time to go through your code. What I could understand is that you made three major changes in the Horovod Torch Distributed Optimizer:

You created compressors for each tensor in your class init.
You replaced the allreduce function with your allgather function.
In synchronize function you make sure you call the decompress functions with the proper arguments.

Correct me if I am wrong.

vineeths96 changed the title ~~Hovorod Gradient Compression~~ Horovod Gradient Compression Jun 15, 2020

vineeths96 closed this as completed Jul 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod Gradient Compression #2025

Horovod Gradient Compression #2025

vineeths96 commented Jun 15, 2020

tgaddair commented Jun 15, 2020

vineeths96 commented Jun 16, 2020

xinyandai commented Jun 25, 2020 •

edited

vineeths96 commented Jul 10, 2020

Horovod Gradient Compression #2025

Horovod Gradient Compression #2025

Comments

vineeths96 commented Jun 15, 2020

tgaddair commented Jun 15, 2020

vineeths96 commented Jun 16, 2020

xinyandai commented Jun 25, 2020 • edited

vineeths96 commented Jul 10, 2020

xinyandai commented Jun 25, 2020 •

edited