Fixes gradient update timing in TF AggregationHelperEager #3496

plliao · 2022-03-25T23:53:25Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Fixes gradient update timing in TF LocalGradientAggregationHelperEager.

Why?
The gradient updating timing in LocalGradientAggregationHelperEager was set one step before the locally aggregated gradients are allreduced. Therefore, the applied gradient are the locally aggregated gradient instead of the allreduced gradient. This PR fixes the issue.

The PR also improves the gradient aggregation test case to cover tf.IndexedSlices gradient and validates the updated values. Before the change, the grad and values are always zero which does not test anything.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: Pei-Lun Liao <pliao@linkedin.com>

github-actions · 2022-03-26T03:15:04Z

Unit Test Results

    446 files -   375   446 suites - 375 7h 5m 13s ⏱️ - 2h 29m 5s
    766 tests +      1   615 ✔️ -   107   149 💤 +  106 2 ❌ +2
10 330 runs - 8 506 7 134 ✔️ - 6 419 3 194 💤 - 2 089 2 ❌ +2

For more details on these failures, see this check.

Results for commit c5f57d2. ± Comparison against base commit 4b8cc49.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.

test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation

test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation_0
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation_1

This pull request skips 106 tests.

test.parallel.test_adasum_pytorch.TorchAdasumTests ‑ test_orthogonal
test.parallel.test_adasum_pytorch.TorchAdasumTests ‑ test_parallel
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_allgather_object
test.parallel.test_mxnet2.MX2Tests ‑ test_broadcast_object
test.parallel.test_mxnet2.MX2Tests ‑ test_compression_fp16
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
…

github-actions · 2022-03-26T03:15:15Z

Unit Test Results (with flaky tests)

    501 files -   404   501 suites - 404 8h 17m 34s ⏱️ - 1h 38m 26s
    766 tests +      1   612 ✔️ -   110   149 💤 +  106 5 ❌ +5
11 544 runs - 9 488 8 005 ✔️ - 6 880 3 530 💤 - 2 617 9 ❌ +9

For more details on these failures, see this check.

Results for commit c5f57d2. ± Comparison against base commit 4b8cc49.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.

test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation

test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation_0
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation_1

This pull request skips 106 tests.

test.parallel.test_adasum_pytorch.TorchAdasumTests ‑ test_orthogonal
test.parallel.test_adasum_pytorch.TorchAdasumTests ‑ test_parallel
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_allgather_object
test.parallel.test_mxnet2.MX2Tests ‑ test_broadcast_object
test.parallel.test_mxnet2.MX2Tests ‑ test_compression_fp16
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
…

Tixxx · 2022-03-30T06:09:42Z

I'm not entirely sure if we should remove it. From the comments in "non_aggregation_step":

            # In TF 2.4+ `_aggregate_gradients()` is called from inside of `apply_gradients()`.

               # We account for this by calling `_aggregate_gradients()` for steps where we do
               # not call `apply_gradients()`.

My understanding is that they return 'true' 1 step early to let apply_gradients() do the final step which will then call _allreduce again which returns reduced tensors.
@tgaddair @aaron276h

plliao · 2022-03-30T19:11:29Z

I'm not entirely sure if we should remove it. From the comments in "non_aggregation_step":
            # In TF 2.4+ `_aggregate_gradients()` is called from inside of `apply_gradients()`.
               # We account for this by calling `_aggregate_gradients()` for steps where we do
               # not call `apply_gradients()`.
My understanding is that they return 'true' 1 step early to let apply_gradients() do the final step which will then call _allreduce again which returns reduced tensors. @tgaddair @aaron276h

I think behavior was changed in this PR. After TF 2.4, the gradients are allreduced in _compute_gradients instead of _aggregate_gradients. Hence, the applied gradients is not reduced and averaged.
#2647

Tixxx · 2022-03-30T23:47:00Z

I'm not entirely sure if we should remove it. From the comments in "non_aggregation_step":
            # In TF 2.4+ `_aggregate_gradients()` is called from inside of `apply_gradients()`.
               # We account for this by calling `_aggregate_gradients()` for steps where we do
               # not call `apply_gradients()`.
My understanding is that they return 'true' 1 step early to let apply_gradients() do the final step which will then call _allreduce again which returns reduced tensors. @tgaddair @aaron276h
I think behavior was changed in this PR. After TF 2.4, the gradients are allreduced in _compute_gradients instead of _aggregate_gradients. Hence, the applied gradients is not reduced and averaged. #2647

I see, thanks for the explanation.

shaowei-su · 2022-05-02T19:07:57Z

Quick follow up on this issue: @Tixxx could you help push this issue fixes (with the other PR: #3499) into production releases? the latest release 0.24.3 does not seem to come with this fix. Thanks!

Tixxx · 2022-05-02T19:13:32Z

Quick follow up on this issue: @Tixxx could you help push this issue fixes (with the other PR: #3499) into production releases? the latest release 0.24.3 does not seem to come with this fix. Thanks!

Done, added the two PRs to 0.25.0 release milestone. You can check them out here

shaowei-su · 2022-05-02T19:18:38Z

Thank you @Tixxx for your quick responses and help!

Fixes gradient update timing in TF AggregationHelperEager

c5f57d2

Signed-off-by: Pei-Lun Liao <pliao@linkedin.com>

plliao force-pushed the fix_apply_gradient_timing_in_TF_AggregationHelper branch from ecda0ce to c5f57d2 Compare March 25, 2022 23:54

Tixxx approved these changes Mar 30, 2022

View reviewed changes

Tixxx merged commit 9e87451 into horovod:master Mar 31, 2022

Tixxx added this to the v0.25.0 milestone May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes gradient update timing in TF AggregationHelperEager #3496

Fixes gradient update timing in TF AggregationHelperEager #3496

plliao commented Mar 25, 2022

github-actions bot commented Mar 26, 2022

github-actions bot commented Mar 26, 2022

Tixxx commented Mar 30, 2022 •

edited

plliao commented Mar 30, 2022

Tixxx commented Mar 30, 2022

shaowei-su commented May 2, 2022

Tixxx commented May 2, 2022

shaowei-su commented May 2, 2022

Fixes gradient update timing in TF AggregationHelperEager #3496

Fixes gradient update timing in TF AggregationHelperEager #3496

Conversation

plliao commented Mar 25, 2022

Checklist before submitting

Description

Review process to land

github-actions bot commented Mar 26, 2022

Unit Test Results

github-actions bot commented Mar 26, 2022

Unit Test Results (with flaky tests)

Tixxx commented Mar 30, 2022 • edited

plliao commented Mar 30, 2022

Tixxx commented Mar 30, 2022

shaowei-su commented May 2, 2022

Tixxx commented May 2, 2022

shaowei-su commented May 2, 2022

Tixxx commented Mar 30, 2022 •

edited