Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control ZeRO wall clock timers #849

Merged
merged 6 commits into from
Mar 11, 2021
Merged

Conversation

tjruwase
Copy link
Contributor

Ensure ZeRO 2&3 wall clock timers are controlled from deepspeed config.

@stas00
Copy link
Contributor

stas00 commented Mar 10, 2021

This fixed zero2, thank you!

zero3 still still prints for each step, albeit it looks like a different debug:

[2021-03-10 15:27:46,893] [INFO] [utils.py:555:see_memory_usage] After zero_optimizer step
[2021-03-10 15:27:46,894] [INFO] [utils.py:556:see_memory_usage] MA 0.06 GB         Max_MA 0.26 GB         CA 0.71 GB         Max_CA 1 GB
[2021-03-10 15:27:46,895] [INFO] [utils.py:564:see_memory_usage] CPU Virtual Memory:  used = 62.62 GB, percent = 49.8%

@stas00
Copy link
Contributor

stas00 commented Mar 11, 2021

after your last commit other debug prints replaced the old ones, now I get these for each step:

[2021-03-10 18:22:31,608] [INFO] [logging.py:60:log_dist] [Rank 0] step=25, skipped=5, lr=[1.446140557591026e-05], mom=[[0.8, 0.999]]
[2021-03-10 18:22:31,608] [INFO] [timer.py:154:stop] 0/25, SamplesPerSec=6.302019280317165

It looks like those were added when you re-based this PR branch

@tjruwase
Copy link
Contributor Author

tjruwase commented Mar 11, 2021

after your last commit other debug prints replaced the old ones, now I get these for each step:

[2021-03-10 18:22:31,608] [INFO] [logging.py:60:log_dist] [Rank 0] step=25, skipped=5, lr=[1.446140557591026e-05], mom=[[0.8, 0.999]]
[2021-03-10 18:22:31,608] [INFO] [timer.py:154:stop] 0/25, SamplesPerSec=6.302019280317165

It looks like those were added when you re-based this PR branch

@stas00, what is your steps_per_print?

@stas00
Copy link
Contributor

stas00 commented Mar 11, 2021

You're correct, I had steps_per_print set to 1 in the config - my bad. I retested and all is perfect then!

Thank you!

@tjruwase tjruwase merged commit 311795d into master Mar 11, 2021
jeffra added a commit to jeffra/DeepSpeed that referenced this pull request Aug 25, 2021
* set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (microsoft#844)

* less scary overflow notice (microsoft#833)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Add optimizers and schedules to RTD and updated the corresponding part in the website (microsoft#799)

* add optimizers and schedules to rtd

* update ds website and fix links

* add optimizers and schedules to rtd

* update ds website and fix links

* add flops profiler to rtd

* fix

Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

* small tweaks (microsoft#839)

* Control ZeRO wall clock timers (microsoft#849)

* Control ZeRO wall clock timers

* Disable more ZeRO3 debug prints

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Squash stage3 v1 (microsoft#146)

Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* formatting fix (microsoft#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (microsoft#151)

* fp16 Z3 API update and bugfix

* revert debug change

* docs

* filling in allocation docs

* better assumption docs

* doc progress

* config json

* major docs edits

* auto registration works for accessed cases

* working on small models.

* debugging large-model discovery?

* fix discovery to first forward pass?

* return obj ext param

* support None parameters in auto-discovery

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
@mrwyattii mrwyattii deleted the olruwase/control_zero_timers branch July 7, 2023 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants