New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for gradient_predivide_factor and averaging in Horovod backend. #1949
Conversation
horovod/tensorflow/__init__.py
Outdated
if not rocm_built(): | ||
if (op == Average): | ||
# Averaging happens via postscale_factor. | ||
postscale_factor /= size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Elastic mode, having the size defined only during graph construction like this is causing problems when the world size scales up or down. If possible, it would be good if we could move this into the C++ layer to avoid this problem.
One way we could do this: remove the "true op" concept and just pass the actual op enum to C++. If it's AVERAGE, we can scale the postscale_factor
dynamically within the TensorFlow custom op using the updated world size. Then, we can still pass in SUM to the enqueue function so everything else works the same.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying. Perhaps we can move the modification of postscale_factor
into EnqueueTensorAllreduce
here, if op is average:
horovod/horovod/common/operations.cc
Lines 830 to 837 in b94807d
// AVERAGE should be taken care of in the framework layer. Equeuing it here directly is not allowed. | |
// For example of how to deal with op=hvd.Average in framework layer, please refer to function | |
// `def _allreduce_async(tensor, output, name, op)` in | |
// horovod/horovod/torch/mpi_ops.py | |
if (reduce_op == ReduceOp::AVERAGE) { | |
LOG(ERROR, horovod_global.controller->GetRank()) << "Enqueuing AVERAGE allreduce is not allowed."; | |
return status.Aborted("AVERAGE not allowed."); | |
} |
But then this also leads to the question of how to handle the gradient_predivide_factor
as well. For elastic, are we ok if the postscale_factor
and prescale_factor
remain constant w.r.t the world size? In general, the gradient_predivide_factor
is set based the number of workers.
f45ceb8
to
88922e5
Compare
88922e5
to
ab3af83
Compare
horovod/common/controller.cc
Outdated
@@ -449,6 +449,36 @@ Response Controller::ConstructResponse(std::string& name, int joined_size) { | |||
} | |||
} | |||
|
|||
// If we are doing an allreduce, check that prescaling and postscaling factors | |||
// are identical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, can you change this to "are identical across ranks"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, no problem. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall very awesome PR! Just want to get some clarity on a few things to make sure the API is as clear as possible.
horovod/tensorflow/__init__.py
Outdated
average_in_framework = False | ||
if rocm_built(): | ||
# For ROCm, perform averaging at framework level | ||
if op == Average or op == Adasum: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this would be easier to parse (for me) as:
average_in_framework = op == Average or op == Adasum
op = Sum if op == Average else op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. Changed it.
horovod/torch/optimizer.py
Outdated
@@ -113,7 +116,18 @@ def _allreduce_grad_async(self, p): | |||
tensor = p.grad | |||
tensor_compressed, ctx = self._compression.compress(tensor) | |||
|
|||
handle = allreduce_async_(tensor_compressed, name=name, op=self.op) | |||
if self.op == Average and self.gradient_predivide_factor != 1.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since allreduce_async_
is in the public API, I think we should consider how this affects the user's use this function. Here we're defining an Average op as a Sum op with a post-scale factor scaled by 1 / size. But if the user were to call allreduce_async_
directly with op=Average
and postscale_factor=s
they would get a different result.
In the implementation of _allreduce_async
, it seems that the divisor is only size when ROCm is used. Which would seem to break op=Average
when calling allreduce_async_
directly. Maybe we can pull this logic into _allreduce_async
to simplify things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a similar to TF, in that in this section of the code, the user does not have explicit control over the pre- and post-scale factors but only indirect control through gradient_predivide_factor
.
For the allreduce_async_
implementation, they could set op=Average
and both scale factors and it should work fine (i.e. the tensors will be prescaled before allreduce by prescale_factor
, postscaled after allreduce by postscale_factor
and divided by size
for the average. When op==Average
, we do not use divisor, but instead apply the additional size
factor into postscale_factor
in EnqueueTensorAllreduce
here:
horovod/horovod/common/operations.cc
Line 840 in b0e51db
postscale_factor /= horovod_global.controller->GetSize(); |
I'm can't think of a situation where someone would want to mix pre- and post-scale factors with op==Average
but I do think this is doing what would be expected in that case.
ab3af83
to
b0e51db
Compare
677e677
to
ab8344c
Compare
d331573
to
f108f86
Compare
…ckend. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
… of 0. Signed-off-by: Josh Romero <joshr@nvidia.com>
… extension modules in setup.py. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…k support. Add cmake to ppc64le test environment. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…. Addressing some other minor comments. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…ain double precision accuracy. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
This reverts commit f938c63. Signed-off-by: Josh Romero <joshr@nvidia.com>
…to maintain double precision accuracy." This reverts commit bf74af2. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…or elastic compatibility. Signed-off-by: Josh Romero <joshr@nvidia.com>
…ported architectures. Invoke nvcc to obtain complete list of supported CCs to use for default compilation. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
a27ff94
to
f0bcf58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work!
@szhengac FYI |
This PR implements support for
gradient_predivide_factor
in theDistributedOptimizer
, similar to that available in APEX DDP. The purpose of this is to enable splitting the averaging before and after the allreduce, which can be useful in managing the numerical range for mixed precision computations (see #1699). Thegradient_predivide_factor
is applied as follows:To facilitate this, additional arguments,
prescale_factor
andpostscale_factor
, have been added to the basichvd.allreduce
functions, enabling the definition of multiplicative factors to scale the tensors before and after the allreduce respectively. For efficiency, the pre and post-scaling is implemented in the Horovod backend on the fused tensor buffer, rather than through framework level operations. For GPU, this required a CUDA kernel implementation to scale the GPU buffer which in turn, required adding compilation of CUDA code to the current build infrastructure. The changes to the build are a more general version of build code changes to enable CUDA compilation proposed in #1760.As an additional general benefit from these changes, gradient averaging in the optimizer can now be carried out within the Horovod backend on the fused tensor buffer using the
postscale_factor
argument, rather than on a tensor by tensor basis at the framework level. This should, in general, be more efficient.Marking the current PR as WIP, but would appreciate feedback to finish up. The implementation is complete, but there are still a few details that need to be worked out (adding support for Keras, if/how to add this to the adasum path), as well as some additional testing.