Skip to content

Update modeling_llama4.py#41557

Open
alfredo-etched wants to merge 2 commits intohuggingface:mainfrom
alfredo-etched:patch-1
Open

Update modeling_llama4.py#41557
alfredo-etched wants to merge 2 commits intohuggingface:mainfrom
alfredo-etched:patch-1

Conversation

@alfredo-etched
Copy link
Copy Markdown

Something is off. The code shows Llama4TextL2Norm function name, but it computes RMS Norm. Which one is the correct one for LLAMA4? RMS or L2? If so, the function should be renamed (if RMS) or modified (if L2) accordingly.

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Something is off. The code shows Llama4TextL2Norm function name, but it computes RMS Norm. Which one is the correct one for LLAMA4? RMS or L2? If so, the function should be renamed (if RMS) or modified (if L2) accordingly.
@Rocketknight1
Copy link
Copy Markdown
Member

Good point, this function is probably wrong! The main reason it wasn't visible as a bug is that Llama4TextL2Norm is only used under specific conditions:

  if self.config.use_qk_norm and self.use_rope:
      self.qk_norm = Llama4TextL2Norm(config.rms_norm_eps)

I believe use_qk_norm is False for the released Llama-4 checkpoints, and so this code is never called unless a user makes their own Llama-4 model using qk_norm. Still, might be worth fixing! cc @ArthurZucker do you know if this was in the original codebase?

@alfredo-etched
Copy link
Copy Markdown
Author

@Rocketknight1 on the config.json for LLAMA4 Scout, use_qk_norm is set as true. But then if you look at the TextAttention function, it also requires that use_rope is true (line 310). But then use_rope is always true, because in config.json the noropelayers is an empty list. Therefore, qk_norm is always computed for llama 4 scout. My question though is: is the norm supposed to be actual L2 norm or RMS norm?

@Rocketknight1
Copy link
Copy Markdown
Member

Rocketknight1 commented Oct 15, 2025

Hi @alfredo-etched, I didn't realize that scout uses that code path - I only checked maverick, my bad!

If scout gives good output, though, then the code here is probably correct, just because of the large output difference we would have noticed if we'd swapped one norm for a different one. Therefore, I think the problem is just that the "L2Norm" class is mis-named. The difference is small, though - if they used .sum(-1) instead of .mean(-1) I believe this would be equivalent to L2 normalization.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama4

@alfredo-etched
Copy link
Copy Markdown
Author

Hi @alfredo-etched, I didn't realize that scout uses that code path - I only checked maverick, my bad!

If scout gives good output, though, then the code here is probably correct, just because of the large output difference we would have noticed if we'd swapped one norm for a different one. Therefore, I think the problem is just that the "L2Norm" class is mis-named. The difference is small, though - if they used .sum(-1) instead of .mean(-1) I believe this would be equivalent to L2 normalization.

I hoped I could get a confirmation of which normalization is the correct one from someone at Meta who helped writing this code, rather than some guesswork. Exactly because the difference is small, it might be that the models behave ALMOST the same, but one is correct and the other one is not, and it might be complex to ascertain it by testing the model for correctness of inference.

@Rocketknight1
Copy link
Copy Markdown
Member

Hi @alfredo-etched, although the difference is small in terms of code changes, it's very large numerically! The difference between ``sqrt(sum())andsqrt(mean())` is a factor of `sqrt(dim)` in the output, and I'm pretty sure the model would break completely if it was trained with one and inferenced with the other.

@Rocketknight1
Copy link
Copy Markdown
Member

Feel free to test it, though - change L2Norm to be a true L2 Norm and see what effect that has on scout's output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants