Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for gradient_predivide_factor and averaging in Horovod backend. #1949

Merged
merged 30 commits into from Aug 17, 2020

Conversation

romerojosh
Copy link
Collaborator

This PR implements support for gradient_predivide_factor in the DistributedOptimizer, similar to that available in APEX DDP. The purpose of this is to enable splitting the averaging before and after the allreduce, which can be useful in managing the numerical range for mixed precision computations (see #1699). The gradient_predivide_factor is applied as follows:

        If op == Average, gradient_predivide_factor splits the averaging
        before and after the sum. Gradients are scaled by
        1.0 / gradient_predivide_factor before the sum and
        gradient_predivide_factor / size after the sum. 

To facilitate this, additional arguments, prescale_factor and postscale_factor, have been added to the basic hvd.allreduce functions, enabling the definition of multiplicative factors to scale the tensors before and after the allreduce respectively. For efficiency, the pre and post-scaling is implemented in the Horovod backend on the fused tensor buffer, rather than through framework level operations. For GPU, this required a CUDA kernel implementation to scale the GPU buffer which in turn, required adding compilation of CUDA code to the current build infrastructure. The changes to the build are a more general version of build code changes to enable CUDA compilation proposed in #1760.

As an additional general benefit from these changes, gradient averaging in the optimizer can now be carried out within the Horovod backend on the fused tensor buffer using the postscale_factor argument, rather than on a tensor by tensor basis at the framework level. This should, in general, be more efficient.

Marking the current PR as WIP, but would appreciate feedback to finish up. The implementation is complete, but there are still a few details that need to be worked out (adding support for Keras, if/how to add this to the adasum path), as well as some additional testing.

@tgaddair tgaddair added this to the 0.20.0 milestone May 13, 2020
if not rocm_built():
if (op == Average):
# Averaging happens via postscale_factor.
postscale_factor /= size()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Elastic mode, having the size defined only during graph construction like this is causing problems when the world size scales up or down. If possible, it would be good if we could move this into the C++ layer to avoid this problem.

One way we could do this: remove the "true op" concept and just pass the actual op enum to C++. If it's AVERAGE, we can scale the postscale_factor dynamically within the TensorFlow custom op using the updated world size. Then, we can still pass in SUM to the enqueue function so everything else works the same.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying. Perhaps we can move the modification of postscale_factor into EnqueueTensorAllreduce here, if op is average:

// AVERAGE should be taken care of in the framework layer. Equeuing it here directly is not allowed.
// For example of how to deal with op=hvd.Average in framework layer, please refer to function
// `def _allreduce_async(tensor, output, name, op)` in
// horovod/horovod/torch/mpi_ops.py
if (reduce_op == ReduceOp::AVERAGE) {
LOG(ERROR, horovod_global.controller->GetRank()) << "Enqueuing AVERAGE allreduce is not allowed.";
return status.Aborted("AVERAGE not allowed.");
}

But then this also leads to the question of how to handle the gradient_predivide_factor as well. For elastic, are we ok if the postscale_factor and prescale_factor remain constant w.r.t the world size? In general, the gradient_predivide_factor is set based the number of workers.

@romerojosh romerojosh changed the title [WIP] Add support for gradient_predivide_factor and averaging in Horovod backend. Add support for gradient_predivide_factor and averaging in Horovod backend. May 29, 2020
@@ -449,6 +449,36 @@ Response Controller::ConstructResponse(std::string& name, int joined_size) {
}
}

// If we are doing an allreduce, check that prescaling and postscaling factors
// are identical.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, can you change this to "are identical across ranks"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, no problem. Done.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall very awesome PR! Just want to get some clarity on a few things to make sure the API is as clear as possible.

horovod/common/controller.cc Show resolved Hide resolved
horovod/common/half.cc Show resolved Hide resolved
horovod/common/operations.cc Show resolved Hide resolved
horovod/tensorflow/__init__.py Show resolved Hide resolved
average_in_framework = False
if rocm_built():
# For ROCm, perform averaging at framework level
if op == Average or op == Adasum:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this would be easier to parse (for me) as:

average_in_framework = op == Average or op == Adasum
op = Sum if op == Average else op

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. Changed it.

horovod/tensorflow/__init__.py Show resolved Hide resolved
horovod/tensorflow/__init__.py Show resolved Hide resolved
examples/mxnet_imagenet_resnet50.py Show resolved Hide resolved
@@ -113,7 +116,18 @@ def _allreduce_grad_async(self, p):
tensor = p.grad
tensor_compressed, ctx = self._compression.compress(tensor)

handle = allreduce_async_(tensor_compressed, name=name, op=self.op)
if self.op == Average and self.gradient_predivide_factor != 1.0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since allreduce_async_ is in the public API, I think we should consider how this affects the user's use this function. Here we're defining an Average op as a Sum op with a post-scale factor scaled by 1 / size. But if the user were to call allreduce_async_ directly with op=Average and postscale_factor=s they would get a different result.

In the implementation of _allreduce_async, it seems that the divisor is only size when ROCm is used. Which would seem to break op=Average when calling allreduce_async_ directly. Maybe we can pull this logic into _allreduce_async to simplify things?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar to TF, in that in this section of the code, the user does not have explicit control over the pre- and post-scale factors but only indirect control through gradient_predivide_factor.

For the allreduce_async_ implementation, they could set op=Average and both scale factors and it should work fine (i.e. the tensors will be prescaled before allreduce by prescale_factor, postscaled after allreduce by postscale_factor and divided by size for the average. When op==Average, we do not use divisor, but instead apply the additional size factor into postscale_factor in EnqueueTensorAllreduce here:

postscale_factor /= horovod_global.controller->GetSize();

I'm can't think of a situation where someone would want to mix pre- and post-scale factors with op==Average but I do think this is doing what would be expected in that case.

setup.py Show resolved Hide resolved
…ckend.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
… of 0.

Signed-off-by: Josh Romero <joshr@nvidia.com>
… extension modules in setup.py.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…k support. Add cmake to ppc64le test environment.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…. Addressing some other minor comments.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…ain double precision accuracy.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
This reverts commit f938c63.

Signed-off-by: Josh Romero <joshr@nvidia.com>
…to maintain double precision accuracy."

This reverts commit bf74af2.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…or elastic compatibility.

Signed-off-by: Josh Romero <joshr@nvidia.com>
…ported architectures. Invoke nvcc to obtain complete list of supported CCs to use for default compilation.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

@tgaddair tgaddair merged commit e4554de into horovod:master Aug 17, 2020
@eric-haibin-lin
Copy link
Collaborator

@szhengac FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants