Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeRO with non-zero loss scale crashes #90

Closed
jeffra opened this issue Feb 19, 2020 · 3 comments · Fixed by #166
Closed

ZeRO with non-zero loss scale crashes #90

jeffra opened this issue Feb 19, 2020 · 3 comments · Fixed by #166
Labels
bug Something isn't working

Comments

@jeffra
Copy link
Contributor

jeffra commented Feb 19, 2020

Typically users want to use dynamic loss scaling. During some development of a new feature for ZeRO I discovered that ZeRO crashes when given a non-zero loss scale value in DeepSpeed's config JSON. I've created a unit test that shows it passes when ZeRO is disabled and another test with ZeRO enabled showing it triggers this bug so we can go back and test when it is fixed.

https://github.com/microsoft/DeepSpeed/blob/jeffra/zero_loss_scale_bug/tests/unit/test_fp16.py#L211-L285

Failed test with stack trace: https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/results?buildId=198&view=logs&j=75347757-894e-5c54-3c11-df095f4d729a&t=50de4b86-57af-55e8-ca98-b1a0d42235e2

Here's the stack trace:
image

@jeffra jeffra added the bug Something isn't working label Feb 19, 2020
@LvanderGoten
Copy link

I second that

@jeffra
Copy link
Contributor Author

jeffra commented Feb 24, 2020

We're working on fixing this soon. However, in my case I was able to get around this by setting a smaller initial scale power. See the 'initial_scale_power' fp16 parameter here: https://github.com/microsoft/DeepSpeed/blob/master/docs/config_json.md#fp16-training-options

@ShadenSmith ShadenSmith linked a pull request Mar 23, 2020 that will close this issue
@ShadenSmith
Copy link
Contributor

@jeffra I believe PR #166 addresses this issue. Let me know what you think.

samyam pushed a commit that referenced this issue Nov 24, 2020
* ZeRO stage 3 checkpointing support
Add unit tests
Minor fixes in installation scripts

* Formatting fixes

* Remove dead code

* Add assert
delock added a commit to delock/DeepSpeedSYCLSupport that referenced this issue Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants