fix: Validate fp16.loss_scale is finite and non-negative#7889
fix: Validate fp16.loss_scale is finite and non-negative#7889tohtana merged 12 commits intodeepspeedai:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0059a795a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Loss scaling value. Default value of 0 means dynamic loss scaling instead of static loss scale. | ||
| """ | ||
|
|
||
| @field_validator("loss_scale") |
There was a problem hiding this comment.
Run loss_scale validator before type coercion
This validator is declared with the default mode="after", so Pydantic will coerce inputs to float first; as a result, the new isinstance(v, bool) guard never triggers because true/false become 1.0/0.0 before _validate_loss_scale runs. In configs that set fp16.loss_scale to a boolean, the value is still silently accepted, which defeats the stated validation goal and can unexpectedly switch to static scaling (true -> 1.0).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think this comment makes sense. Can you address it? @nathon-lee
There was a problem hiding this comment.
Thanks — I agree this comment makes sense. I’ll address it and push an update shortly. @tohtana
There was a problem hiding this comment.
Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).
PKUWZP
left a comment
There was a problem hiding this comment.
Switch to mode = before and add some tests.
| """ | ||
|
|
||
| @field_validator("loss_scale") | ||
| @classmethod |
There was a problem hiding this comment.
-
Consider using
mode="before"for the entire validator rather than splitting into two validators. A single
mode="before"validator can handle both the bool check and the finite/negative checks:@classmethod def _validate_loss_scale(cls, v): if isinstance(v, bool): raise ValueError("fp16.loss_scale must be a number, not bool") v = float(v) if not math.isfinite(v): raise ValueError("fp16.loss_scale must be a finite number (not inf/-inf/nan)") if v < 0: raise ValueError("fp16.loss_scale must be >= 0 (0 enables dynamic loss scaling)") return v ``` -
Test coverage: There are no tests included. A few unit tests in
tests/unit/runtime/asserting that invalid loss_scale values(-1, float('inf'), float('nan'), True)raiseValidationErrorwould strengthen this PR and prevent regressions.
The existing pattern in the repo uses DeepSpeedFP16Config(loss_scale=...) directly, which makes such tests straightforward.
There was a problem hiding this comment.
Thanks — good suggestion. I’ll consolidate into a single mode="before" validator and add unit tests (e.g. -1, inf, nan, True -> ValidationError) using DeepSpeedFP16Config(loss_scale=...). I’ll push an update shortly. @PKUWZP
There was a problem hiding this comment.
Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).
There was a problem hiding this comment.
@nathon-lee If we pass an invalid value like [] and {}, won't float() raise TypeError now?
The current master raises Pydantic's ValidationError for these, which is clearer than a raw TypeError.
There was a problem hiding this comment.
@tohtana Thanks for catching this.
I added a try/except (TypeError, ValueError) around float(v) in the mode="before" validator so invalid types (e.g. [], {}) are converted into a clear ValueError, which Pydantic wraps as ValidationError (instead of surfacing a raw TypeError). I also added unit tests covering [] and {} to prevent regressions.
f0059a7 to
5c7b12e
Compare
f0059a7 to
3ead20d
Compare
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
3ead20d to
225ab4e
Compare
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
9ad6000 to
f2fc309
Compare
403cd4c to
39b2b3a
Compare
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
39b2b3a to
322707a
Compare
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
tohtana
left a comment
There was a problem hiding this comment.
Thank you for the fix! @nathon-lee
…minimal change) Signed-off-by: nathon-lee <leejianwoo@gmail.com>
…into fix_issue_7852
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
|
@PKUWZP Can you confirm that your request has been met? We need your confirmation to merge this PR. |
PKUWZP
left a comment
There was a problem hiding this comment.
The changes look good to me.
…#7889) Validate fp16.loss_scale is finite and non-negative Add a Pydantic field validator to DeepSpeedFP16Config to reject NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as dynamic loss scaling). This prevents invalid configs from silently initializing and causing NaNs during training. Test: Run pytest -q tests/unit/runtime/test_precision_config_loss_scale.py Result: ``` root@72170d0458e9:/home/DeepSpeed_woo# pytest -q tests/unit/runtime/test_precision_config_loss_scale.py =================================================================== test session starts =================================================================== platform linux -- Python 3.11.10, pytest-8.3.5, pluggy-1.6.0 -- /usr/bin/python cachedir: .pytest_cache Using --randomly-seed=1526199052 rootdir: /home/DeepSpeed_woo/tests configfile: pytest.ini plugins: xdist-3.8.0, randomly-4.0.1, forked-1.6.0, anyio-4.6.0 collected 10 items tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[3] PASSED [ 10%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[0] PASSED [ 20%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[inf] PASSED [ 30%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[1] PASSED [ 40%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[nan] PASSED [ 50%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[2.0] PASSED [ 60%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[True] PASSED [ 70%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale0] PASSED [ 80%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[-1] PASSED [ 90%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale1] PASSED [100%] (30 durations < 1s hidden. Use -vv to show these durations.) ============================================================= 10 passed, 16 warnings in 4.18s ============================================================= ``` Fix issue deepspeedai#7852 --------- Signed-off-by: nathon-lee <leejianwoo@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Validate fp16.loss_scale is finite and non-negative
Add a Pydantic field validator to DeepSpeedFP16Config to reject NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as dynamic loss scaling). This prevents invalid configs from silently initializing and causing NaNs during training.
Test:
Run pytest -q tests/unit/runtime/test_precision_config_loss_scale.py
Result:
Fix issue #7852