Update modeling_llama4.py by alfredo-etched · Pull Request #41557 · huggingface/transformers

alfredo-etched · 2025-10-14T02:32:15Z

Something is off. The code shows Llama4TextL2Norm function name, but it computes RMS Norm. Which one is the correct one for LLAMA4? RMS or L2? If so, the function should be renamed (if RMS) or modified (if L2) accordingly.

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Something is off. The code shows Llama4TextL2Norm function name, but it computes RMS Norm. Which one is the correct one for LLAMA4? RMS or L2? If so, the function should be renamed (if RMS) or modified (if L2) accordingly.

Rocketknight1 · 2025-10-14T12:45:25Z

Good point, this function is probably wrong! The main reason it wasn't visible as a bug is that Llama4TextL2Norm is only used under specific conditions:

  if self.config.use_qk_norm and self.use_rope:
      self.qk_norm = Llama4TextL2Norm(config.rms_norm_eps)

I believe use_qk_norm is False for the released Llama-4 checkpoints, and so this code is never called unless a user makes their own Llama-4 model using qk_norm. Still, might be worth fixing! cc @ArthurZucker do you know if this was in the original codebase?

alfredo-etched · 2025-10-14T22:20:14Z

@Rocketknight1 on the config.json for LLAMA4 Scout, use_qk_norm is set as true. But then if you look at the TextAttention function, it also requires that use_rope is true (line 310). But then use_rope is always true, because in config.json the noropelayers is an empty list. Therefore, qk_norm is always computed for llama 4 scout. My question though is: is the norm supposed to be actual L2 norm or RMS norm?

Rocketknight1 · 2025-10-15T12:41:30Z

Hi @alfredo-etched, I didn't realize that scout uses that code path - I only checked maverick, my bad!

If scout gives good output, though, then the code here is probably correct, just because of the large output difference we would have noticed if we'd swapped one norm for a different one. Therefore, I think the problem is just that the "L2Norm" class is mis-named. The difference is small, though - if they used .sum(-1) instead of .mean(-1) I believe this would be equivalent to L2 normalization.

github-actions · 2025-10-17T19:43:33Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama4

alfredo-etched · 2025-10-17T19:45:04Z

Hi @alfredo-etched, I didn't realize that scout uses that code path - I only checked maverick, my bad!

If scout gives good output, though, then the code here is probably correct, just because of the large output difference we would have noticed if we'd swapped one norm for a different one. Therefore, I think the problem is just that the "L2Norm" class is mis-named. The difference is small, though - if they used .sum(-1) instead of .mean(-1) I believe this would be equivalent to L2 normalization.

I hoped I could get a confirmation of which normalization is the correct one from someone at Meta who helped writing this code, rather than some guesswork. Exactly because the difference is small, it might be that the models behave ALMOST the same, but one is correct and the other one is not, and it might be complex to ascertain it by testing the model for correctness of inference.

Rocketknight1 · 2025-10-20T14:17:35Z

Hi @alfredo-etched, although the difference is small in terms of code changes, it's very large numerically! The difference between ``sqrt(sum())andsqrt(mean())` is a factor of `sqrt(dim)` in the output, and I'm pretty sure the model would break completely if it was trained with one and inferenced with the other.

Rocketknight1 · 2025-10-20T14:18:08Z

Feel free to test it, though - change L2Norm to be a true L2 Norm and see what effect that has on scout's output

Update modeling_llama4.py

1e4fbc5

Something is off. The code shows Llama4TextL2Norm function name, but it computes RMS Norm. Which one is the correct one for LLAMA4? RMS or L2? If so, the function should be renamed (if RMS) or modified (if L2) accordingly.

Merge branch 'main' into patch-1

8e878ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update modeling_llama4.py#41557

Update modeling_llama4.py#41557
alfredo-etched wants to merge 2 commits intohuggingface:mainfrom
alfredo-etched:patch-1

alfredo-etched commented Oct 14, 2025

Uh oh!

Rocketknight1 commented Oct 14, 2025

Uh oh!

alfredo-etched commented Oct 14, 2025

Uh oh!

Rocketknight1 commented Oct 15, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

alfredo-etched commented Oct 17, 2025

Uh oh!

Rocketknight1 commented Oct 20, 2025

Uh oh!

Rocketknight1 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alfredo-etched commented Oct 14, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Oct 14, 2025

Uh oh!

alfredo-etched commented Oct 14, 2025

Uh oh!

Rocketknight1 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

alfredo-etched commented Oct 17, 2025

Uh oh!

Rocketknight1 commented Oct 20, 2025

Uh oh!

Rocketknight1 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rocketknight1 commented Oct 15, 2025 •

edited

Loading