Skip to content

🚨 Enable SDPA (and other attention backends) for T5 and propagate to the T5 family#47014

Open
jiqing-feng wants to merge 5 commits into
huggingface:mainfrom
jiqing-feng:sdpa
Open

🚨 Enable SDPA (and other attention backends) for T5 and propagate to the T5 family#47014
jiqing-feng wants to merge 5 commits into
huggingface:mainfrom
jiqing-feng:sdpa

Conversation

@jiqing-feng

@jiqing-feng jiqing-feng commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

CI

What this PR does

Refactors the T5 attention stack to route through ALL_ATTENTION_FUNCTIONS,
so T5 can dispatch to sdpa / eager instead of the old eager-only path, and
propagates the change to the copied-from family.

Now _supports_sdpa = True / _supports_attention_backend = True and running
test_eager_matches_sdpa_inference:

  • t5 (reference), mt5, udop, pop2piano
  • pix2struct — also required migrating the vision-tower attention
    (Pix2StructVisionAttention), a standard dense attention that was not on the
    interface, otherwise SDPA is unusable for the full model.
  • umt5 — its attention is its own dense implementation (not # Copied from
    T5); migrated the same way.

switch_transformers (modular, inherits T5 attention) is regenerated so its
modeling matches the refactored T5, but it stays eager-only (_supports_sdpa
not set): it is not on the SDPA path, this only propagates the eager refactor.

Alignment applied to migrated modules: attention goes through
ALL_ATTENTION_FUNCTIONS (with eager_attention_forward fallback), self.scaling,
is_causal; the relative position bias is folded into the additive attention mask;
forwards use **kwargs: Unpack[TransformersKwargs] + @can_return_tuple /
@auto_docstring / @merge_with_config_defaults / @capture_outputs, dropping the
manual return_dict / output_attentions / output_hidden_states resolution.

🚨 Behavior changes

  • With sdpa, softmax no longer force-upcasts to fp32 as the old path did;
    numerics stay within the eager_matches_sdpa tolerance.
  • output_attentions removed from internal block/layer tuple returns; attentions
    are collected via capture_outputs.

test_eager_matches_sdpa_inference is not skipped for any enabled model.

longt5 stays not-SDPA-enabled

The regular LongT5Attention (copied from T5) is on the interface, but the encoder
always runs the block-sparse LongT5LocalAttention / LongT5TransientGlobalAttention.
These operate on 5D blocked tensors (batch, num_blocks, heads, block_len, 3*block_len),
while sdpa_attention_forward assumes 4D (batch, heads, seq, dim) — they cannot go
through the interface at all. LongT5EncoderModel is purely block-sparse, so the
framework refuses to switch its attention implementation:

LongT5EncoderModel does not support setting its attention implementation
dynamically, because it does not follow the functional approach based on
AttentionInterface

Forcing _supports_sdpa = True makes the encoder-only sdpa tests fail (fp16
mean relative difference: nan). Same precedent as pegasus_x (block-sparse
encoder + regular decoder), which also sets _supports_sdpa = False.

Testing

utils/check_copies.py passes; full sweep green:

tests/models/{t5,mt5,umt5,udop,pop2piano,longt5,pix2struct,switch_transformers}/test_modeling_*.py
-> 1817 passed, 1958 skipped, 18 xfailed, 6 xpassed, 16699 subtests passed, 0 failed

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng changed the title enable t5 sdpa 🚨 Enable SDPA (and other attention backends) for T5 and propagate to the T5 family Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: longt5, mt5, pix2struct, pop2piano, switch_transformers, t5, udop, umt5

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

CI recap

Dashboard: View test results in Grafana
Latest run: 28567574479:2
Result: success | Jobs: 2 | Tests: 9 | Failures: 2 | Duration: 51s

@jiqing-feng jiqing-feng marked this pull request as ready for review July 2, 2026 06:45
@Rocketknight1

Rocketknight1 commented Jul 2, 2026

Copy link
Copy Markdown
Member

Attention dispatch so cc @ArthurZucker @Cyrilvallez

@vasqu vasqu self-assigned this Jul 2, 2026
@vasqu

vasqu commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

WIll check it out when I have time but it's a big PR so expect delays 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants