Fix unused parameters in attention layers#462
Merged
jpata merged 3 commits intojpata:mainfrom Mar 20, 2026
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a training-time bug in PreLnSelfAttentionLayer where the FFN output was computed but never applied to the residual stream, which could leave FFN/LN parameters unused and trigger stricter DDP/Ray Train unused-parameter failures.
Changes:
- Add the missing residual update
x = residual + ffn_outafter the FFN block inPreLnSelfAttentionLayer.forward.
Comments suppressed due to low confidence (1)
mlpf/model/mlpf.py:319
save_attentionbranch can raise runtime errors forattention_type == LINEAR:att_matis never defined in the LINEAR path, andself.mha.in_proj_weightdoesn't exist onLinearAttention. Consider guarding this block withself.attention_type != AttentionType.LINEAR(and/or handling LinearAttention separately), and ensureatt_matis always defined before use.
if not self.use_simplified_attention and self.save_attention:
np.savez(
open("{}/attn_{}_{}.npz".format(self.outdir, self.name, self.att_mat_idx), "wb"),
att=att_mat,
x=x.detach().cpu().numpy(),
in_proj_weight=self.mha.in_proj_weight.detach().cpu().numpy(),
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Owner
|
oh, good catch! can you just run a quick before/after training, like 1h each, and post the losses for posterity? |
Collaborator
Author
Owner
|
That's awesome! Glad you spotted the issue! |
erwulff
added a commit
to erwulff/particleflow
that referenced
this pull request
Mar 23, 2026
* fix: self-attention layer missing residual connection * disable automatic metric logging in Comet ML * use mlpf_config instead of args in distributed_ray.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This PR fixes a critical bug in
mlpf/model/mlpf.pywhere the output of the FFN inPreLnSelfAttentionLayerwas computed but never added to the residual connection.This was discovered when running model training with Ray Train, and the following error was encountered:
This probably appears only when running with Ray Train, and not with bare DDP, because Ray Train enforces more strict checks by default.
The FFN block and second LayerNorm were effectively unused parameters, causing
RuntimeError: Expected to have finished reduction...failures during distributed training (DDP).TODO: