🔴[`Attention`] Bert-based Models Attention Refactor #38301

vasqu · 2025-05-22T15:45:00Z

Keeping track of the models that are done:

Up to discussion:

Flash attention is flaky; I suspect the norm layers to be responsible (encountered similar things with gemma3)

Would need another round; questionable if worth it (ordered by prio):

HuggingFaceDocBuilderDev · 2025-05-22T15:58:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/bert/modeling_bert.py

…k with multiple classes (as recompile strats force a size call which is wrongly interpreted before)

vasqu · 2025-06-30T13:59:58Z

run-slow: bert

github-actions · 2025-06-30T14:01:27Z

This comment contains run-slow, running the specified jobs:

models: ['models/bert']
quantizations: [] ...

tests/test_modeling_common.py

ArthurZucker

Nice! I am asking a lot but I believe this will contribute to unbloat our code more!

src/transformers/models/bert/modeling_bert.py

…ds to be done for everything else)

ArthurZucker

Let's GOOOOOO

ArthurZucker · 2025-09-18T10:16:07Z

src/transformers/models/albert/modeling_albert.py


        if position_ids is None:
-            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
+            position_ids = self.position_ids[:, :seq_length]


that's because we expect the pos id to be correct right?

Yes, either

in base transformers fashion we expect the correct positions or use this as a fallback (will always work with right padding)

in vLLM we expect the correct position ids. This is also done in the padding vs padding free test. Also informed Harry about this requirement.

src/transformers/models/albert/modeling_albert.py

src/transformers/models/mobilebert/modeling_mobilebert.py

src/transformers/models/roberta_prelayernorm/modeling_roberta_prelayernorm.py

src/transformers/models/xlm_roberta/modeling_xlm_roberta.py

src/transformers/models/xlm_roberta_xl/modular_xlm_roberta_xl.py

ArthurZucker · 2025-09-18T15:38:47Z

run-slow: bert, auto, bart, roberta

github-actions · 2025-09-18T15:41:48Z

This comment contains run-slow, running the specified jobs:

models: ['models/auto', 'models/bart', 'models/bert', 'models/roberta']
quantizations: [] ...

vasqu · 2025-09-18T16:59:39Z

run-slow: bert, auto, bart, roberta

github-actions · 2025-09-18T17:01:09Z

This comment contains run-slow, running the specified jobs:

models: ['models/auto', 'models/bart', 'models/bert', 'models/roberta']
quantizations: [] ...

vasqu · 2025-09-18T17:09:22Z

Same tests fail on main (examples_torch) + some fa tests fail but they are known to be flaky. Otherwise, looks good!

Cyrilvallez

Was just checking the PR for other things and noticed a typing issue probably!

Cyrilvallez · 2025-09-23T09:59:00Z

src/transformers/models/bert/modeling_bert.py

-        past_key_values: Optional[Cache] = None,
+        past_key_values: Optional[Union[list[torch.FloatTensor], Cache]] = None,


This is a typo no @vasqu? I don't think we can ever have a list here

It's for legacy caches (the old typing we had there for this). Will be removed for v5 when cache classes are finally mandatory!

* clean start to bert refactor * some test fixes * style * fix last tests * be strict on positional embeddings, fixup according tests * cache support * more cache fixes, new causal API * simplify masks, fix tests for gen * flex attn, static cache support, round of fixes * ? * this time * style * fix flash attention tests, flex attention requires torch 2.7.x to work with multiple classes (as recompile strats force a size call which is wrongly interpreted before) * roberta * fixup sdpa remains * attention split, simplify args and kwargs, better typing * fix encoder decoder * fix test * modular roberta * albert * data2vectext, making it modular tomorrow * modular data2vec text * tmp disable * xmod + cache position fixes * whoops * electra + markuplm, small fixes * remove wrong copy * xlm_roberta + some embedding fixes * roberta prelayernorm * RemBert: remove copy, maybe doing it later * ernie * fix roberta offloading * camembert * copy fixes * bert generation + fixes on eager * xlm roberta xl * bridgetower (text) + seamlessv2 copy fixes * rocbert + small fixes * whoops * small round of fixups * NOTE: kernels didnt load with an earlier version, some fixup (needs another look bc cross deps) * the end of the tunnel? * fixup nllbmoe + style * we dont need this anymore * megatron bert is barely used, low prio skip for now * Modernize bert (template for others) NOTE: trying to push this through, might be overdue if not in time possible * check inputs for all others (if checkmarked) * fix bridgetower * style * fix encoder decoder (partially but cause found and fix also, just needs to be done for everything else) * proper fix for bert to force intermediate dict outputs * propagate to others * style * xlm roberta xl investigation, its the layernorm... * mobile bert * revert this, might cause issues with composed models * review * style

clean start to bert refactor

4c7d3dc

vasqu changed the title ~~🔴[Atttention] Bert-based Models Attention Refactor~~ 🔴[Attention] Bert-based Models Attention Refactor May 22, 2025

vasqu mentioned this pull request May 22, 2025

Refactor bert-based models to use global attention function #37495

Open

This was referenced Jun 19, 2025

SDPA for T5 Attention #31167

Open

Support Flex Attention for encoder only models (XLMRoberta, ModernBERT etc...) #36697

Open

vasqu added 9 commits June 27, 2025 13:41

Merge branch 'main' into vas-bert-attn-refactors

ae0adfe

some test fixes

6c1e5f4

style

d3c1f36

fix last tests

3076c99

be strict on positional embeddings, fixup according tests

6afc75b

cache support

1eaca54

more cache fixes, new causal API

3e59105

simplify masks, fix tests for gen

e376e3c

flex attn, static cache support, round of fixes

0122764

vasqu commented Jun 30, 2025

View reviewed changes

src/transformers/models/bert/modeling_bert.py Outdated Show resolved Hide resolved

vasqu added 5 commits June 30, 2025 11:56

?

f46d6a4

this time

13f5b49

style

82633af

fix flash attention tests, flex attention requires torch 2.7.x to wor…

41ddb57

…k with multiple classes (as recompile strats force a size call which is wrongly interpreted before)

Merge branch 'main' into vas-bert-attn-refactors

775573e

vasqu commented Jun 30, 2025

View reviewed changes

tests/test_modeling_common.py Show resolved Hide resolved

ArthurZucker reviewed Jun 30, 2025

View reviewed changes

vasqu added 5 commits June 30, 2025 17:53

roberta

6a7357d

fixup sdpa remains

d1c7690

Merge branch 'main' into vas-bert-attn-refactors

b82b47e

attention split, simplify args and kwargs, better typing

306a5c2

fix encoder decoder

38e8de3

vasqu added 5 commits September 16, 2025 22:28

style

e773a81

fix encoder decoder (partially but cause found and fix also, just nee…

fc9fd97

…ds to be done for everything else)

proper fix for bert to force intermediate dict outputs

58a0680

propagate to others

5179154

style

3da6c33

vasqu marked this pull request as ready for review September 17, 2025 13:55

vasqu and others added 6 commits September 17, 2025 15:55

Merge branch 'main' into vas-bert-attn-refactors

986aaef

xlm roberta xl investigation, its the layernorm...

df5aa36

mobile bert

b8ee99d

another day another merge conflict

f056025

Merge branch 'main' into vas-bert-attn-refactors

9d1a118

revert this, might cause issues with composed models

422aaa3

ArthurZucker approved these changes Sep 18, 2025

View reviewed changes

hmellor mentioned this pull request Sep 18, 2025

Encoder model support for the Transformers backend vllm-project/vllm#25174

Merged

1 task

vasqu and others added 3 commits September 18, 2025 17:25

review

ff2af47

style

f1ef07a

Merge branch 'main' into vas-bert-attn-refactors

4f7d3ad

another merge conflict, this will never end lol

1c3188f

ArthurZucker merged commit 155f7e2 into main Sep 19, 2025
21 of 25 checks passed

ArthurZucker deleted the vas-bert-attn-refactors branch September 19, 2025 09:24

Cyrilvallez reviewed Sep 23, 2025

View reviewed changes

Cyrilvallez mentioned this pull request Sep 23, 2025

Minor test addition for sdpa producing NaNs for pad tokens #40971

Closed

2 tasks

		past_key_values: Optional[Cache] = None,
		past_key_values: Optional[Union[list[torch.FloatTensor], Cache]] = None,

🔴[Attention] Bert-based Models Attention Refactor #38301

🔴[Attention] Bert-based Models Attention Refactor #38301

Uh oh!

Conversation

vasqu commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 22, 2025

Uh oh!

Uh oh!

vasqu commented Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker commented Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

vasqu commented Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

vasqu commented Sep 18, 2025

Uh oh!

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

🔴[`Attention`] Bert-based Models Attention Refactor #38301

🔴[`Attention`] Bert-based Models Attention Refactor #38301

vasqu commented May 22, 2025 •

edited

Loading