🚨 Enable SDPA (and other attention backends) for T5 and propagate to the T5 family by jiqing-feng · Pull Request #47014 · huggingface/transformers

jiqing-feng · 2026-07-02T01:54:46Z

What this PR does

Refactors the T5 attention stack to route through ALL_ATTENTION_FUNCTIONS,
so T5 can dispatch to sdpa / eager instead of the old eager-only path, and
propagates the change to the copied-from family.

Now _supports_sdpa = True / _supports_attention_backend = True and running
test_eager_matches_sdpa_inference:

t5 (reference), mt5, udop, pop2piano
pix2struct — also required migrating the vision-tower attention
(Pix2StructVisionAttention), a standard dense attention that was not on the
interface, otherwise SDPA is unusable for the full model.
umt5 — its attention is its own dense implementation (not # Copied from
T5); migrated the same way.

switch_transformers (modular, inherits T5 attention) is regenerated so its
modeling matches the refactored T5, but it stays eager-only (_supports_sdpa
not set): it is not on the SDPA path, this only propagates the eager refactor.

Alignment applied to migrated modules: attention goes through
ALL_ATTENTION_FUNCTIONS (with eager_attention_forward fallback), self.scaling,
is_causal; the relative position bias is folded into the additive attention mask;
forwards use **kwargs: Unpack[TransformersKwargs] + @can_return_tuple /
@auto_docstring / @merge_with_config_defaults / @capture_outputs, dropping the
manual return_dict / output_attentions / output_hidden_states resolution.

🚨 Behavior changes

With sdpa, softmax no longer force-upcasts to fp32 as the old path did;
numerics stay within the eager_matches_sdpa tolerance.
output_attentions removed from internal block/layer tuple returns; attentions
are collected via capture_outputs.

test_eager_matches_sdpa_inference is not skipped for any enabled model.

longt5 stays not-SDPA-enabled

The regular LongT5Attention (copied from T5) is on the interface, but the encoder
always runs the block-sparse LongT5LocalAttention / LongT5TransientGlobalAttention.
These operate on 5D blocked tensors (batch, num_blocks, heads, block_len, 3*block_len),
while sdpa_attention_forward assumes 4D (batch, heads, seq, dim) — they cannot go
through the interface at all. LongT5EncoderModel is purely block-sparse, so the
framework refuses to switch its attention implementation:

LongT5EncoderModel does not support setting its attention implementation
dynamically, because it does not follow the functional approach based on
AttentionInterface

Forcing _supports_sdpa = True makes the encoder-only sdpa tests fail (fp16
mean relative difference: nan). Same precedent as pegasus_x (block-sparse
encoder + regular decoder), which also sets _supports_sdpa = False.

Testing

utils/check_copies.py passes; full sweep green:

tests/models/{t5,mt5,umt5,udop,pop2piano,longt5,pix2struct,switch_transformers}/test_modeling_*.py
-> 1817 passed, 1958 skipped, 18 xfailed, 6 xpassed, 16699 subtests passed, 0 failed

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

github-actions · 2026-07-02T06:03:44Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: longt5, mt5, pix2struct, pop2piano, switch_transformers, t5, udop, umt5

github-actions · 2026-07-02T06:14:13Z

CI recap

Dashboard: View test results in Grafana
Latest run: 28567574479:2
Result: success | Jobs: 2 | Tests: 9 | Failures: 2 | Duration: 51s

Rocketknight1 · 2026-07-02T11:52:08Z

Attention dispatch so cc @ArthurZucker @Cyrilvallez

vasqu · 2026-07-02T14:07:43Z

WIll check it out when I have time but it's a big PR so expect delays 🙏

jiqing-feng added 5 commits July 1, 2026 02:03

enable t5 sdpa

42d8877

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update

2b821a2

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update

a172063

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update

e76fc48

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

0109081

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng changed the title ~~enable t5 sdpa~~ 🚨 Enable SDPA (and other attention backends) for T5 and propagate to the T5 family Jul 2, 2026

jiqing-feng marked this pull request as ready for review July 2, 2026 06:45

github-actions Bot requested review from ArthurZucker and zucchini-nlp July 2, 2026 06:45

vasqu self-assigned this Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚨 Enable SDPA (and other attention backends) for T5 and propagate to the T5 family#47014

🚨 Enable SDPA (and other attention backends) for T5 and propagate to the T5 family#47014
jiqing-feng wants to merge 5 commits into
huggingface:mainfrom
jiqing-feng:sdpa

jiqing-feng commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Rocketknight1 commented Jul 2, 2026 •

edited

Loading

Uh oh!

vasqu commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jiqing-feng commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

🚨 Behavior changes

longt5 stays not-SDPA-enabled

Testing

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

CI recap

Uh oh!

Rocketknight1 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiqing-feng commented Jul 2, 2026 •

edited

Loading

Rocketknight1 commented Jul 2, 2026 •

edited

Loading