Skip to content

[deepseek_v4] keep hc_head / sinks / position_bias in fp32#46198

Merged
ArthurZucker merged 1 commit into
mainfrom
fix-deepseek-v4-keep-in-fp32
May 27, 2026
Merged

[deepseek_v4] keep hc_head / sinks / position_bias in fp32#46198
ArthurZucker merged 1 commit into
mainfrom
fix-deepseek-v4-keep-in-fp32

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented May 25, 2026

Fixes #46167 — adds hc_head, sinks, position_bias to _keep_in_fp32_modules_strict so the remaining 112 fp32 tensors stop being silently downcast to bf16.

image image image

Issue #46167: 417 fp32 plumbing tensors get downcast to bf16 because
`_keep_in_fp32_modules_strict` was missing entries for `hc_head` (top-level +
MTP), `sinks` (per-attention sink token), and `position_bias` (compressor and
indexer compressor). Adds the three patterns so save_pretrained preserves the
source dtype for the full set of 417 tensors instead of 305.
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v4

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested review from Cyrilvallez and vasqu May 25, 2026 10:21
Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite a lot of small params being strictly fp32, is remote the same or what are they doing?

Anyways, trusting you and seems reasonable to me :D

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented May 25, 2026

Ah seems like it only happens on save? Do we maybe have something silent like nn.Parameter(..., dtype=torch.float32)? They might cause the discrepancy between loading and saving

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

No no I think it happens on load as well, but I'll check the original parameter dtypes to be sure!

@ArthurZucker ArthurZucker merged commit 9ded3db into main May 27, 2026
43 checks passed
@ArthurZucker ArthurZucker deleted the fix-deepseek-v4-keep-in-fp32 branch May 27, 2026 09:49
yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request May 28, 2026
…ce#46198)

Issue huggingface#46167: 417 fp32 plumbing tensors get downcast to bf16 because
`_keep_in_fp32_modules_strict` was missing entries for `hc_head` (top-level +
MTP), `sinks` (per-attention sink token), and `position_bias` (compressor and
indexer compressor). Adds the three patterns so save_pretrained preserves the
source dtype for the full set of 417 tensors instead of 305.
kashif pushed a commit to kashif/transformers that referenced this pull request Jun 1, 2026
…ce#46198)

Issue huggingface#46167: 417 fp32 plumbing tensors get downcast to bf16 because
`_keep_in_fp32_modules_strict` was missing entries for `hc_head` (top-level +
MTP), `sinks` (per-attention sink token), and `position_bias` (compressor and
indexer compressor). Adds the three patterns so save_pretrained preserves the
source dtype for the full set of 417 tensors instead of 305.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[deepseek_v4] save_pretrained silently downcasts FP32 tensors to BF16 (hc_*, attn_sink, ffn.gate.bias, compressor.ape, indexer.compressor.ape)

4 participants