Add support for gradient_predivide_factor and averaging in Horovod backend. #1949

romerojosh · 2020-05-13T20:58:45Z

This PR implements support for gradient_predivide_factor in the DistributedOptimizer, similar to that available in APEX DDP. The purpose of this is to enable splitting the averaging before and after the allreduce, which can be useful in managing the numerical range for mixed precision computations (see #1699). The gradient_predivide_factor is applied as follows:

        If op == Average, gradient_predivide_factor splits the averaging
        before and after the sum. Gradients are scaled by
        1.0 / gradient_predivide_factor before the sum and
        gradient_predivide_factor / size after the sum.

To facilitate this, additional arguments, prescale_factor and postscale_factor, have been added to the basic hvd.allreduce functions, enabling the definition of multiplicative factors to scale the tensors before and after the allreduce respectively. For efficiency, the pre and post-scaling is implemented in the Horovod backend on the fused tensor buffer, rather than through framework level operations. For GPU, this required a CUDA kernel implementation to scale the GPU buffer which in turn, required adding compilation of CUDA code to the current build infrastructure. The changes to the build are a more general version of build code changes to enable CUDA compilation proposed in #1760.

As an additional general benefit from these changes, gradient averaging in the optimizer can now be carried out within the Horovod backend on the fused tensor buffer using the postscale_factor argument, rather than on a tensor by tensor basis at the framework level. This should, in general, be more efficient.

Marking the current PR as WIP, but would appreciate feedback to finish up. The implementation is complete, but there are still a few details that need to be worked out (adding support for Keras, if/how to add this to the adasum path), as well as some additional testing.

tgaddair · 2020-05-13T21:31:57Z

horovod/tensorflow/__init__.py

+    if not rocm_built():
+      if (op == Average):
+        # Averaging happens via postscale_factor.
+        postscale_factor /= size()


For Elastic mode, having the size defined only during graph construction like this is causing problems when the world size scales up or down. If possible, it would be good if we could move this into the C++ layer to avoid this problem.

One way we could do this: remove the "true op" concept and just pass the actual op enum to C++. If it's AVERAGE, we can scale the postscale_factor dynamically within the TensorFlow custom op using the updated world size. Then, we can still pass in SUM to the enqueue function so everything else works the same.

What do you think?

I see what you are saying. Perhaps we can move the modification of postscale_factor into EnqueueTensorAllreduce here, if op is average:

horovod/horovod/common/operations.cc

Lines 830 to 837 in b94807d

// AVERAGE should be taken care of in the framework layer. Equeuing it here directly is not allowed.

// For example of how to deal with op=hvd.Average in framework layer, please refer to function

// `def _allreduce_async(tensor, output, name, op)` in

// horovod/horovod/torch/mpi_ops.py

if (reduce_op == ReduceOp::AVERAGE) {

LOG(ERROR, horovod_global.controller->GetRank()) << "Enqueuing AVERAGE allreduce is not allowed.";

return status.Aborted("AVERAGE not allowed.");

}

But then this also leads to the question of how to handle the gradient_predivide_factor as well. For elastic, are we ok if the postscale_factor and prescale_factor remain constant w.r.t the world size? In general, the gradient_predivide_factor is set based the number of workers.

tgaddair · 2020-06-17T20:03:58Z

horovod/common/controller.cc

@@ -449,6 +449,36 @@ Response Controller::ConstructResponse(std::string& name, int joined_size) {
    }
  }

+  // If we are doing an allreduce, check that prescaling and postscaling factors
+  // are identical.


For clarity, can you change this to "are identical across ranks"?

Yep, no problem. Done.

tgaddair

Overall very awesome PR! Just want to get some clarity on a few things to make sure the API is as clear as possible.

horovod/common/controller.cc

horovod/common/half.cc

horovod/common/operations.cc

horovod/tensorflow/__init__.py

tgaddair · 2020-06-18T17:51:28Z

horovod/tensorflow/__init__.py

+    average_in_framework = False
+    if rocm_built():
+        # For ROCm, perform averaging at framework level
+        if op == Average or op == Adasum:


Nit: I think this would be easier to parse (for me) as:

average_in_framework = op == Average or op == Adasum op = Sum if op == Average else op

Works for me. Changed it.

horovod/tensorflow/__init__.py

examples/mxnet_imagenet_resnet50.py

tgaddair · 2020-06-18T19:25:40Z

horovod/torch/optimizer.py

@@ -113,7 +116,18 @@ def _allreduce_grad_async(self, p):
        tensor = p.grad
        tensor_compressed, ctx = self._compression.compress(tensor)

-        handle = allreduce_async_(tensor_compressed, name=name, op=self.op)
+        if self.op == Average and self.gradient_predivide_factor != 1.0:


Since allreduce_async_ is in the public API, I think we should consider how this affects the user's use this function. Here we're defining an Average op as a Sum op with a post-scale factor scaled by 1 / size. But if the user were to call allreduce_async_ directly with op=Average and postscale_factor=s they would get a different result.

In the implementation of _allreduce_async, it seems that the divisor is only size when ROCm is used. Which would seem to break op=Average when calling allreduce_async_ directly. Maybe we can pull this logic into _allreduce_async to simplify things?

This is a similar to TF, in that in this section of the code, the user does not have explicit control over the pre- and post-scale factors but only indirect control through gradient_predivide_factor.

For the allreduce_async_ implementation, they could set op=Average and both scale factors and it should work fine (i.e. the tensors will be prescaled before allreduce by prescale_factor, postscaled after allreduce by postscale_factor and divided by size for the average. When op==Average, we do not use divisor, but instead apply the additional size factor into postscale_factor in EnqueueTensorAllreduce here:

horovod/horovod/common/operations.cc

Line 840 in b0e51db

postscale_factor /= horovod_global.controller->GetSize();

I'm can't think of a situation where someone would want to mix pre- and post-scale factors with op==Average but I do think this is doing what would be expected in that case.

setup.py

…ckend. Signed-off-by: Josh Romero <joshr@nvidia.com>