Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve loss overflow logs #3008

Merged
merged 4 commits into from Mar 15, 2023
Merged

Conversation

Quentin-Anthony
Copy link
Contributor

@Quentin-Anthony Quentin-Anthony commented Mar 13, 2023

This is a small QoL PR that resolves two small issues:

  1. The loss overflow code is duplicated across Zero-3 and Zero-1/2
  2. The loss overflow log messages don't handle fp16 hysteresis, leading to confusing messages where the loss scale doesn't change such as those reported in [loss OVERFLOW] Several Issues #1599:

[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 65536

With this PR, loss overflow logs instead look like:

...
OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 3. Reducing hysteresis to 2
OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
...

Fixes: #1599

@stas00 @jeffra @tjruwase

@stas00
Copy link
Contributor

stas00 commented Mar 14, 2023

Thank you for adding these improvements, @Quentin-Anthony!

You can add:

Fixes: https://github.com/microsoft/DeepSpeed/issues/1599

to your OP and it'll automatically close that issue on merge.

@stas00
Copy link
Contributor

stas00 commented Mar 14, 2023

I'm looking into the breakage on HF side

@stas00
Copy link
Contributor

stas00 commented Mar 14, 2023

ok, fixed on HF Transformers side, @Quentin-Anthony could you please run:

git commit --allow-empty -m "Trigger CI"
git push

to re-run the CI? Thank you!

@stas00
Copy link
Contributor

stas00 commented Mar 14, 2023

excellent. thank you for re-testing, @Quentin-Anthony!

the amd CI job seems to fail everywhere.

@tjruwase tjruwase merged commit ac2c9ff into microsoft:master Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[loss OVERFLOW] Several Issues
3 participants