-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeRO with non-zero loss scale crashes #90
Labels
bug
Something isn't working
Comments
I second that |
We're working on fixing this soon. However, in my case I was able to get around this by setting a smaller initial scale power. See the 'initial_scale_power' fp16 parameter here: https://github.com/microsoft/DeepSpeed/blob/master/docs/config_json.md#fp16-training-options |
samyam
pushed a commit
that referenced
this issue
Nov 24, 2020
* ZeRO stage 3 checkpointing support Add unit tests Minor fixes in installation scripts * Formatting fixes * Remove dead code * Add assert
delock
added a commit
to delock/DeepSpeedSYCLSupport
that referenced
this issue
Nov 8, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Typically users want to use dynamic loss scaling. During some development of a new feature for ZeRO I discovered that ZeRO crashes when given a non-zero loss scale value in DeepSpeed's config JSON. I've created a unit test that shows it passes when ZeRO is disabled and another test with ZeRO enabled showing it triggers this bug so we can go back and test when it is fixed.
https://github.com/microsoft/DeepSpeed/blob/jeffra/zero_loss_scale_bug/tests/unit/test_fp16.py#L211-L285
Failed test with stack trace: https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/results?buildId=198&view=logs&j=75347757-894e-5c54-3c11-df095f4d729a&t=50de4b86-57af-55e8-ca98-b1a0d42235e2
Here's the stack trace:
The text was updated successfully, but these errors were encountered: