feat: support gradient accumulation in spark torch estimator #3681

thinkall · 2022-09-07T12:06:18Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

We can use backward_passes_per_step in torch, although we need to adjust the training code to make it work as expected. However, in spark torch estimator, there is no implementation for supporting it.

In this PR, we added support to spark torch estimator, thus we can apply gradient accumlation by simply set up param backward_passes_per_step in TorchEstimator. Example and test are modified accordingly.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: Li Jiang <bnujli@gmail.com>

EnricoMi · 2022-09-09T05:52:34Z

Please rebase with latest master, which includes a fix for the CI.

Signed-off-by: Li Jiang <bnujli@gmail.com>

thinkall · 2022-09-09T07:03:12Z

Please rebase with latest master, which includes a fix for the CI.

Thanks @EnricoMi , done!

chongxiaoc · 2022-09-09T08:05:06Z

OOO this week. I will take a look when I'm back.

github-actions · 2022-09-09T13:44:56Z

Unit Test Results

  1 049 files ±0   1 049 suites ±0 11h 3m 32s ⏱️ + 6m 40s
    813 tests ±0     755 ✔️ ±0     58 💤 ±0 0 ❌ ±0
20 592 runs ±0 14 536 ✔️ ±0 6 056 💤 ±0 0 ❌ ±0

Results for commit 6d33ba4. ± Comparison against base commit 25ed803.

♻️ This comment has been updated with latest results.

github-actions · 2022-09-09T13:45:09Z

Unit Test Results (with flaky tests)

  1 169 files -   35   1 169 suites - 35 11h 33m 24s ⏱️ - 9m 14s
    813 tests ±    0     755 ✔️ +    1     58 💤 ±    0 0 ❌ - 1
23 072 runs - 515 16 010 ✔️ - 347 7 062 💤 - 167 0 ❌ - 1

Results for commit 6d33ba4. ± Comparison against base commit 25ed803.

♻️ This comment has been updated with latest results.

horovod/spark/torch/remote.py

Signed-off-by: Li Jiang <bnujli@gmail.com>

do loss.div_ only when backward_passes_per_step > 1 Signed-off-by: Li Jiang <bnujli@gmail.com>

EverybodyHops and others added 4 commits September 6, 2022 07:07

[feature] support gradient accumulation in spark torch estimator

107a03c

Signed-off-by: Li Jiang <bnujli@gmail.com>

docs: update description of backward_passes_per_step

82817b5

Signed-off-by: Li Jiang <bnujli@gmail.com>

docs: update spark pytorch example with backward_passes_per_step

2f98c51

Signed-off-by: Li Jiang <bnujli@gmail.com>

docs: update test for backward_passes_per_step

04fa540

Signed-off-by: Li Jiang <bnujli@gmail.com>

Merge branch 'horovod:master' into spark-torch-gradient-accumulation

ac55b99

Signed-off-by: Li Jiang <bnujli@gmail.com>

thinkall force-pushed the spark-torch-gradient-accumulation branch from 3591557 to ac55b99 Compare September 9, 2022 07:02

EnricoMi requested review from chongxiaoc and Tixxx September 9, 2022 07:52

Tixxx reviewed Sep 9, 2022

View reviewed changes

horovod/spark/torch/remote.py Show resolved Hide resolved

Tixxx reviewed Sep 9, 2022

View reviewed changes

horovod/spark/torch/remote.py Outdated Show resolved Hide resolved

Merge branch 'horovod:master' into spark-torch-gradient-accumulation

e6133e3

Signed-off-by: Li Jiang <bnujli@gmail.com>

thinkall force-pushed the spark-torch-gradient-accumulation branch from bd25468 to e6133e3 Compare September 13, 2022 02:13

Add value check for backward_passes_per_step,

6d33ba4

do loss.div_ only when backward_passes_per_step > 1 Signed-off-by: Li Jiang <bnujli@gmail.com>

thinkall requested review from Tixxx and removed request for chongxiaoc September 13, 2022 12:49

Tixxx approved these changes Sep 13, 2022

View reviewed changes

chongxiaoc approved these changes Sep 13, 2022

View reviewed changes

chongxiaoc merged commit 94529cc into horovod:master Sep 13, 2022

thinkall deleted the spark-torch-gradient-accumulation branch September 14, 2022 06:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support gradient accumulation in spark torch estimator #3681

feat: support gradient accumulation in spark torch estimator #3681

thinkall commented Sep 7, 2022

EnricoMi commented Sep 9, 2022

thinkall commented Sep 9, 2022

chongxiaoc commented Sep 9, 2022

github-actions bot commented Sep 9, 2022 •

edited

github-actions bot commented Sep 9, 2022 •

edited

feat: support gradient accumulation in spark torch estimator #3681

feat: support gradient accumulation in spark torch estimator #3681

Conversation

thinkall commented Sep 7, 2022

Checklist before submitting

Description

Review process to land

EnricoMi commented Sep 9, 2022

thinkall commented Sep 9, 2022

chongxiaoc commented Sep 9, 2022

github-actions bot commented Sep 9, 2022 • edited

Unit Test Results

github-actions bot commented Sep 9, 2022 • edited

Unit Test Results (with flaky tests)

github-actions bot commented Sep 9, 2022 •

edited

github-actions bot commented Sep 9, 2022 •

edited