Skip to content

Use attention interface in RoFormerSelfAttention#46147

Closed
HamzaDogann wants to merge 2 commits into
huggingface:mainfrom
HamzaDogann:fix/roformer-attention-interface
Closed

Use attention interface in RoFormerSelfAttention#46147
HamzaDogann wants to merge 2 commits into
huggingface:mainfrom
HamzaDogann:fix/roformer-attention-interface

Conversation

@HamzaDogann
Copy link
Copy Markdown

RoFormer's self-attention was using a hardcoded eager implementation,
making it impossible to use alternative attention backends.

This PR replaces the hardcoded computation with ALL_ATTENTION_FUNCTIONS
dispatch and adds a local eager_attention_forward as the default fallback,
preserving existing behavior while enabling flash_attention_2, sdpa,
and custom attention implementations via _attn_implementation.

Closes #46144

Replace hardcoded eager attention computation with `ALL_ATTENTION_FUNCTIONS`
dispatch, enabling users to plug in custom attention backends (e.g.
`flash_attention_2`, `sdpa`) via the standard `_attn_implementation` config
field. Adds a local `eager_attention_forward` function as the default
fallback that preserves existing behavior.

Closes huggingface#46144
Copilot AI review requested due to automatic review settings May 21, 2026 20:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates RoFormer’s self-attention to use the shared Transformers attention interface (ALL_ATTENTION_FUNCTIONS) instead of a hardcoded eager implementation, enabling alternative attention backends and custom attention implementations via config._attn_implementation.

Changes:

  • Added a local eager_attention_forward fallback and switched RoFormerSelfAttention to dispatch through ALL_ATTENTION_FUNCTIONS.get_interface(...).
  • Stored config and precomputed scaling in RoFormerSelfAttention for reuse by attention backends.

Comment thread src/transformers/models/roformer/modeling_roformer.py
Comment thread src/transformers/models/roformer/modeling_roformer.py
Comment thread src/transformers/models/roformer/modeling_roformer.py
…lags

- Pass an explicit `is_causal=False` and forward `output_attentions` to the
  attention interface, so `flash_attention_2`/`sdpa` no longer depend on a
  missing `module.is_causal` attribute (RoFormer precomputes a bidirectional
  mask, so no extra causal mask is needed).
- Mark `eager_attention_forward` as `# Copied from` BERT so `make fixup` keeps
  it in sync with the shared implementation.
- Declare `_supports_flash_attn`, `_supports_sdpa`, `_supports_flex_attn` and
  `_supports_attention_backend` on `RoFormerPreTrainedModel` so the new
  backends can actually be selected.
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: roformer

@Rocketknight1
Copy link
Copy Markdown
Member

When the issue creator says

The fix is simple and I would like to create a PR for it.

This means you should not rush in and immediately vibe code a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RoFormer attention implementation does not use attention interface

3 participants