Added `trainer.gradient_accumulation_steps` for increasing effective batch size #3305

tgaddair · 2023-03-29T18:28:51Z

Benefits include:

Lower network bandwidth overhead by reducing frequency of allreduce / gradient synchronization
Increase effective batch size to smooth out variance when training very large models

github-actions · 2023-03-29T20:43:05Z

Unit Test Results

    6 files ±    0     6 suites ±0 1h 52m 27s ⏱️ + 1h 30m 16s
153 tests +141 140 ✔️ +130 13 💤 +11 0 ❌ ±0
193 runs +133 172 ✔️ +124 21 💤 +  9 0 ❌ ±0

Results for commit 37b6678. ± Comparison against base commit 531e024.

♻️ This comment has been updated with latest results.

justinxzhao

C00L! 👍

justinxzhao · 2023-03-30T22:52:09Z

tests/integration_tests/test_trainer.py

+
+    # Just test that training completes without error.
+    # TODO(travis): We may want to expand upon this in the future to include some checks on model
+    # convergence like gradient magnitudes, etc. Should also add distributed tests.


Should we re-use the test utility for distributed tests?

run_test_suite(config, dataset, "ray")

I think we should refactor to be able to do this in a follow-up, yeah. The reason I didn't do it here is because that function is a lot more expensive, since it runs a lot more additional tests. But we should in general rely on some standard test suite functions that can run with different levels of checks.

arnavgarg1

Nice!

Added gradient accumulation tests

f040142

tgaddair requested review from justinxzhao, abidwael and arnavgarg1 March 29, 2023 18:29

tgaddair changed the title ~~Added trainer.gradient_accumulation option for increasing effective batch size~~ Added trainer.gradient_accumulation_steps for increasing effective batch size Mar 29, 2023

Fixed line length

37b6678

justinxzhao approved these changes Mar 30, 2023

View reviewed changes

tgaddair merged commit 16fed3a into master Mar 31, 2023

tgaddair deleted the grad-accum branch March 31, 2023 17:17

arnavgarg1 reviewed Apr 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added `trainer.gradient_accumulation_steps` for increasing effective batch size #3305

Added `trainer.gradient_accumulation_steps` for increasing effective batch size #3305

tgaddair commented Mar 29, 2023

github-actions bot commented Mar 29, 2023 •

edited

justinxzhao left a comment

justinxzhao Mar 30, 2023

tgaddair Mar 31, 2023 •

edited

arnavgarg1 left a comment

Added trainer.gradient_accumulation_steps for increasing effective batch size #3305

Added trainer.gradient_accumulation_steps for increasing effective batch size #3305

Conversation

tgaddair commented Mar 29, 2023

github-actions bot commented Mar 29, 2023 • edited

Unit Test Results

justinxzhao left a comment

Choose a reason for hiding this comment

justinxzhao Mar 30, 2023

Choose a reason for hiding this comment

tgaddair Mar 31, 2023 • edited

Choose a reason for hiding this comment

arnavgarg1 left a comment

Choose a reason for hiding this comment

Added `trainer.gradient_accumulation_steps` for increasing effective batch size #3305

Added `trainer.gradient_accumulation_steps` for increasing effective batch size #3305

github-actions bot commented Mar 29, 2023 •

edited

tgaddair Mar 31, 2023 •

edited