Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support gradient accumulation in spark torch estimator #3681

Merged
merged 7 commits into from Sep 13, 2022

Conversation

thinkall
Copy link
Contributor

@thinkall thinkall commented Sep 7, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

We can use backward_passes_per_step in torch, although we need to adjust the training code to make it work as expected. However, in spark torch estimator, there is no implementation for supporting it.

In this PR, we added support to spark torch estimator, thus we can apply gradient accumlation by simply set up param backward_passes_per_step in TorchEstimator. Example and test are modified accordingly.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

EverybodyHops and others added 4 commits September 6, 2022 07:07
Signed-off-by: Li Jiang <bnujli@gmail.com>
Signed-off-by: Li Jiang <bnujli@gmail.com>
Signed-off-by: Li Jiang <bnujli@gmail.com>
@EnricoMi
Copy link
Collaborator

EnricoMi commented Sep 9, 2022

Please rebase with latest master, which includes a fix for the CI.

@thinkall thinkall force-pushed the spark-torch-gradient-accumulation branch from 3591557 to ac55b99 Compare September 9, 2022 07:02
@thinkall
Copy link
Contributor Author

thinkall commented Sep 9, 2022

Please rebase with latest master, which includes a fix for the CI.

Thanks @EnricoMi , done!

@chongxiaoc
Copy link
Collaborator

OOO this week. I will take a look when I'm back.

@github-actions
Copy link

github-actions bot commented Sep 9, 2022

Unit Test Results

  1 049 files  ±0    1 049 suites  ±0   11h 3m 32s ⏱️ + 6m 40s
     813 tests ±0       755 ✔️ ±0       58 💤 ±0  0 ±0 
20 592 runs  ±0  14 536 ✔️ ±0  6 056 💤 ±0  0 ±0 

Results for commit 6d33ba4. ± Comparison against base commit 25ed803.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Sep 9, 2022

Unit Test Results (with flaky tests)

  1 169 files   -   35    1 169 suites   - 35   11h 33m 24s ⏱️ - 9m 14s
     813 tests ±    0       755 ✔️ +    1       58 💤 ±    0  0  - 1 
23 072 runs   - 515  16 010 ✔️  - 347  7 062 💤  - 167  0  - 1 

Results for commit 6d33ba4. ± Comparison against base commit 25ed803.

♻️ This comment has been updated with latest results.

horovod/spark/torch/remote.py Outdated Show resolved Hide resolved
@thinkall thinkall force-pushed the spark-torch-gradient-accumulation branch from bd25468 to e6133e3 Compare September 13, 2022 02:13
do loss.div_ only when backward_passes_per_step > 1

Signed-off-by: Li Jiang <bnujli@gmail.com>
@thinkall thinkall requested review from Tixxx and removed request for chongxiaoc September 13, 2022 12:49
@chongxiaoc chongxiaoc merged commit 94529cc into horovod:master Sep 13, 2022
@thinkall thinkall deleted the spark-torch-gradient-accumulation branch September 14, 2022 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants