Control ZeRO wall clock timers #849

tjruwase · 2021-03-10T22:59:06Z

Ensure ZeRO 2&3 wall clock timers are controlled from deepspeed config.

stas00 · 2021-03-10T23:28:47Z

This fixed zero2, thank you!

zero3 still still prints for each step, albeit it looks like a different debug:

[2021-03-10 15:27:46,893] [INFO] [utils.py:555:see_memory_usage] After zero_optimizer step
[2021-03-10 15:27:46,894] [INFO] [utils.py:556:see_memory_usage] MA 0.06 GB         Max_MA 0.26 GB         CA 0.71 GB         Max_CA 1 GB
[2021-03-10 15:27:46,895] [INFO] [utils.py:564:see_memory_usage] CPU Virtual Memory:  used = 62.62 GB, percent = 49.8%

stas00 · 2021-03-11T02:24:42Z

after your last commit other debug prints replaced the old ones, now I get these for each step:

[2021-03-10 18:22:31,608] [INFO] [logging.py:60:log_dist] [Rank 0] step=25, skipped=5, lr=[1.446140557591026e-05], mom=[[0.8, 0.999]]
[2021-03-10 18:22:31,608] [INFO] [timer.py:154:stop] 0/25, SamplesPerSec=6.302019280317165

It looks like those were added when you re-based this PR branch

tjruwase · 2021-03-11T13:59:40Z

after your last commit other debug prints replaced the old ones, now I get these for each step:
[2021-03-10 18:22:31,608] [INFO] [logging.py:60:log_dist] [Rank 0] step=25, skipped=5, lr=[1.446140557591026e-05], mom=[[0.8, 0.999]]
[2021-03-10 18:22:31,608] [INFO] [timer.py:154:stop] 0/25, SamplesPerSec=6.302019280317165
It looks like those were added when you re-based this PR branch

@stas00, what is your steps_per_print?

stas00 · 2021-03-11T16:42:57Z

You're correct, I had steps_per_print set to 1 in the config - my bad. I retested and all is perfect then!

Thank you!

* set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (microsoft#844) * less scary overflow notice (microsoft#833) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Add optimizers and schedules to RTD and updated the corresponding part in the website (microsoft#799) * add optimizers and schedules to rtd * update ds website and fix links * add optimizers and schedules to rtd * update ds website and fix links * add flops profiler to rtd * fix Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> * small tweaks (microsoft#839) * Control ZeRO wall clock timers (microsoft#849) * Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (microsoft#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO Stage 2: Clear reduced gradients (microsoft#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Squash stage3 v1 (microsoft#146) Co-authored-by: Samyam <samyamr@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com> * formatting fix (microsoft#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (microsoft#151) * fp16 Z3 API update and bugfix * revert debug change * docs * filling in allocation docs * better assumption docs * doc progress * config json * major docs edits * auto registration works for accessed cases * working on small models. * debugging large-model discovery? * fix discovery to first forward pass? * return obj ext param * support None parameters in auto-discovery Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: eltonzheng <eltonz@microsoft.com>

Control ZeRO wall clock timers

48672f8

tjruwase requested review from jeffra and samyam March 10, 2021 22:59

tjruwase requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, minjiaz, niumanar, RezaYazdaniAminabadi and ShadenSmith as code owners March 10, 2021 22:59

stas00 mentioned this pull request Mar 10, 2021

Disable extreme deepspeed logging stas00/porting#1

Open

tjruwase added 2 commits March 10, 2021 17:50

Merge branch 'master' into olruwase/control_zero_timers

5fae43e

Disable more ZeRO3 debug prints

a0049f6

jeffra approved these changes Mar 11, 2021

View reviewed changes

jeffra added 2 commits March 10, 2021 18:54

Merge branch 'master' into olruwase/control_zero_timers

f903f28

Merge branch 'master' into olruwase/control_zero_timers

f2d8f68

Merge branch 'master' into olruwase/control_zero_timers

fa79778

tjruwase merged commit 311795d into master Mar 11, 2021

mrwyattii deleted the olruwase/control_zero_timers branch July 7, 2023 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control ZeRO wall clock timers #849

Control ZeRO wall clock timers #849

tjruwase commented Mar 10, 2021

stas00 commented Mar 10, 2021 •

edited

Loading

stas00 commented Mar 11, 2021 •

edited

Loading

tjruwase commented Mar 11, 2021 •

edited

Loading

stas00 commented Mar 11, 2021

Control ZeRO wall clock timers #849

Control ZeRO wall clock timers #849

Conversation

tjruwase commented Mar 10, 2021

stas00 commented Mar 10, 2021 • edited Loading

stas00 commented Mar 11, 2021 • edited Loading

tjruwase commented Mar 11, 2021 • edited Loading

stas00 commented Mar 11, 2021

stas00 commented Mar 10, 2021 •

edited

Loading

stas00 commented Mar 11, 2021 •

edited

Loading

tjruwase commented Mar 11, 2021 •

edited

Loading