New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF: Add register_local_var to distributed optimizers and gradient aggrega… #3695
Conversation
cc @romerojosh @maxhgerlach reopening another one as #3663 was reverted. |
Unit Test Results (with flaky tests) 1 263 files + 82 1 263 suites +82 11h 39m 9s ⏱️ + 3m 40s Results for commit d4e91b0. ± Comparison against base commit ab97fd1. ♻️ This comment has been updated with latest results. |
@MrAta, this seems to be failing on GPU consistently (potentially also causing a hang). Failed GitHub pipeline: Error message on Buildkite which that one leads to: |
Hi @maxhgerlach can you please help with finding the test config as well as the stack traces for the failure? |
@MrAta The CI reporting on GitHub is a bit odd, but if you click on the "Build and Test GPU (on Builtkite)" entry, you can eventually get to this page which reports the GPU test results: https://buildkite.com/horovod/horovod/builds/8318 The tests here shows some useful stalling information: https://buildkite.com/horovod/horovod/builds/8318#01832614-9be1-4c56-98ea-767272468e6e/6-4795 Looks like maybe one of the ranks in the test isn't broadcasting variables? |
@MrAta I tried out the failing tests on a local GPU system and it seems the issue might stem from issues with GPU device assignment. I was getting an error similar to one reported in an old Horovod issue with TF1 (#646). In your case, the issues seem to stem from some of the manual device placement and also the call to horovod/test/parallel/test_tensorflow2_keras.py Lines 61 to 62 in 94529cc
I'm not sure I can push to the master branch of your repository that this PR is sourced from so instead, here is a git diff/patch that I used to get the tests to pass on my system:
|
Thank you so much @romerojosh! let me apply that patch and push. |
Thanks for the support, @romerojosh! @MrAta, I resolved a minor conflict in |
Does it scale gradient of local var? The reason is all reduce data parallel variable with mean implies gradient is divided by allreduce size. Gradient of local var has to be scaled with the same size. Otherwise gradient of local var and the rest are technically calculated from different loss function, which leads to tiny accuracy loss that is very hard to spot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test all pass now -- great work!
Josh already reviewed the changes in the earlier version of this PR, so I have very little add; really appreciate the effort adding proper tests for the changes. One thing: #3700 just landed and you should update some of the TF version checks accordingly after rebasing to master.
CHANGELOG.md
Outdated
@@ -22,6 +22,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). | |||
- Added `transformation_edit_fields` and `transformation_removed_fields` param for EstimatorParams. ([#3651](https://github.com/horovod/horovod/pull/3651)) | |||
- Added `PartialDistributedGradientTape()` API for model parallel use cases. ([#3643](https://github.com/horovod/horovod/pull/3643)) | |||
- Enable use of native `ncclAvg` op for NCCL allreduces. ([#3646](https://github.com/horovod/horovod/pull/3646)) | |||
- Added `register_local_var` functionality to distributed optimizers and local gradient aggregators. ([3695](https://github.com/horovod/horovod/pull/3695)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added `register_local_var` functionality to distributed optimizers and local gradient aggregators. ([3695](https://github.com/horovod/horovod/pull/3695)) | |
- TensorFlow: Added `register_local_var` functionality to distributed optimizers and local gradient aggregators. ([3695](https://github.com/horovod/horovod/pull/3695)) |
I don't think that behavior is included with this PR, but I'm also not sure if that's really something that Horovod's DistributedOptimizer should do automatically or rather something that's better controlled explicitly in user code. |
I think it is fine to control it in user code with explicit use of gradient tape. But optimizer wraps gradient calculation, communication, and weight update. Combining another level of gradient scale handling seems to defeat the purpose of using distributed optimizer more or less. maybe options to the optimizer? |
Not sure if I'm understanding your point well or not. But, note that this is for model parallel use cases not data parallel use cases. In model parallel use cases, each rank has exclusively their own local layers (vars, and hence their gradients) which are not shared among ranks. Therefore, averaging the gradients of local layers defeats the purpose of "model parallelism". The example model in the unittest probably is not the best example to show case, because all ranks deem the first (few) layers as local layers. But, in practice, in more real world use cases (our models at LinkedIn for example), models are multi tower (block), which each tower is local to only one rank. |
…tors Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
I'm talking about hybrid case, where part of the model is in model parallel (e.g. different embedding tables on different rank) and part of the model is in data parallel (e.g. MLP sits on top of embeddings). And there is an alltoall between model and data parallel region. In data parallel region, gradient are averaged by all reduce, technically they are computed w.r.t average global loss. But when it back propagate into model parallel region, activation gradient w.r.t to the global batch size is available in one rank, compute weight gradient from it is w.r.t to sum of global loss not mean, thus the discrepancy. Model will still train but get slightly worse accuracy. |
Suppose we have two ranks and each of them have one local embedding layer E1 on rank 1 and E2 on rank 2. |
After dividing gradients of E1 and E2 by 2, all the gradients (embedding and MLP) are calculated w.r.t to the same loss function, no issue anymore. |
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I'll probably create a simple reproduce case. |
Signed-off-by: Ata FatahiBaarzi afatahibaarzi@linkedin.com
Checklist before submitting
Description
This is to continue work done in #3628 and #3643 and adds the same functionality to the Distributed Optimizers and Local Gradient Aggregators. This is useful for model parallel use cases that use distributed optimizers and want to skip syncing their "local" gradients.
An example usage is shown in the unit tests included PR, but in short it's like the following:
If this change get merged, similar to #3643 we can possibly add a new API called
PartialDistributedOptimizer
.