Skip to content

Fix unused parameters in attention layers#462

Merged
jpata merged 3 commits intojpata:mainfrom
erwulff:self-attention-fix
Mar 20, 2026
Merged

Fix unused parameters in attention layers#462
jpata merged 3 commits intojpata:mainfrom
erwulff:self-attention-fix

Conversation

@erwulff
Copy link
Copy Markdown
Collaborator

@erwulff erwulff commented Mar 19, 2026

This PR fixes a critical bug in mlpf/model/mlpf.py where the output of the FFN in PreLnSelfAttentionLayer was computed but never added to the residual connection.

This was discovered when running model training with Ray Train, and the following error was encountered:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 7: 102 103 104 105 106 107 114 115 116 117 118 119

This probably appears only when running with Ray Train, and not with bare DDP, because Ray Train enforces more strict checks by default.

The FFN block and second LayerNorm were effectively unused parameters, causing RuntimeError: Expected to have finished reduction... failures during distributed training (DDP).

TODO:

  • run quick before/after validation (1 GPU, 1 hour is enough), post train and val loss.

@erwulff erwulff changed the title fix: self-attention layer missing residual connection Fix Unused Parameters in Attention Layers Mar 19, 2026
@erwulff erwulff changed the title Fix Unused Parameters in Attention Layers Fix unused parameters in attention layers Mar 19, 2026
@erwulff erwulff marked this pull request as ready for review March 19, 2026 16:32
Copilot AI review requested due to automatic review settings March 19, 2026 16:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a training-time bug in PreLnSelfAttentionLayer where the FFN output was computed but never applied to the residual stream, which could leave FFN/LN parameters unused and trigger stricter DDP/Ray Train unused-parameter failures.

Changes:

  • Add the missing residual update x = residual + ffn_out after the FFN block in PreLnSelfAttentionLayer.forward.
Comments suppressed due to low confidence (1)

mlpf/model/mlpf.py:319

  • save_attention branch can raise runtime errors for attention_type == LINEAR: att_mat is never defined in the LINEAR path, and self.mha.in_proj_weight doesn't exist on LinearAttention. Consider guarding this block with self.attention_type != AttentionType.LINEAR (and/or handling LinearAttention separately), and ensure att_mat is always defined before use.
        if not self.use_simplified_attention and self.save_attention:
            np.savez(
                open("{}/attn_{}_{}.npz".format(self.outdir, self.name, self.att_mat_idx), "wb"),
                att=att_mat,
                x=x.detach().cpu().numpy(),
                in_proj_weight=self.mha.in_proj_weight.detach().cpu().numpy(),
            )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlpf/model/mlpf.py
@jpata
Copy link
Copy Markdown
Owner

jpata commented Mar 19, 2026

oh, good catch! can you just run a quick before/after training, like 1h each, and post the losses for posterity?

@erwulff
Copy link
Copy Markdown
Collaborator Author

erwulff commented Mar 20, 2026

oh, good catch! can you just run a quick before/after training, like 1h each, and post the losses for posterity?

Sure! I ran a slightly longer test than you suggested. 4h on 8xH100. Losses are slightly lower after the fix.

Figure: Total training loss before and after the fix. After fix in purple, before fix in yellow.
step_train_loss_Total VS step

@jpata
Copy link
Copy Markdown
Owner

jpata commented Mar 20, 2026

That's awesome! Glad you spotted the issue!

@jpata jpata merged commit 69b9178 into jpata:main Mar 20, 2026
2 checks passed
erwulff added a commit to erwulff/particleflow that referenced this pull request Mar 23, 2026
* fix: self-attention layer missing residual connection

* disable automatic metric logging in Comet ML

* use mlpf_config instead of args in distributed_ray.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants