Skip to content

Remove cache_position in more models (2)#44602

Merged
Cyrilvallez merged 14 commits intomainfrom
more-cache-pos
Mar 12, 2026
Merged

Remove cache_position in more models (2)#44602
Cyrilvallez merged 14 commits intomainfrom
more-cache-pos

Conversation

@Cyrilvallez
Copy link
Member

@Cyrilvallez Cyrilvallez commented Mar 11, 2026

What does this PR do?

As per the title. Follow up of #44330

Also take the opportunity to simplify t5 and its children, because the way they computeposition_bias was super convoluted/overcomplicated, and required several additional unnecessary arguments etc...

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@huggingface huggingface deleted a comment from github-actions bot Mar 11, 2026
@Cyrilvallez
Copy link
Member Author

run-slow: bridgetower mt5 qwen2_5_vl xlm_roberta audioflamingo3 data2vec_text gpt2 qwen2_5_omni roberta smolvlm camembert gpt_neo paddleocr_vl pix2struct idefics3 llama4 bark mllama bert_generation qwen2_vl whisper xlm_roberta_xl electra esm clip bert decision_transformer idefics2 pop2piano t5 blt blip_text idefics udop bloom roc_bert roberta_prelayernorm autoformer qwen2_audio ernie switch_transformers longt5 xmod

@huggingface huggingface deleted a comment from github-actions bot Mar 11, 2026
@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/autoformer", "models/bark", "models/bert", "models/bert_generation", "models/bloom", "models/blt", "models/bridgetower", "models/camembert", "models/clip", "models/decision_transformer", "models/electra", "models/ernie", "models/esm", "models/gpt2", "models/gpt_neo", "models/idefics", "models/idefics2", "models/idefics3", "models/llama4", "models/longt5", "models/mllama", "models/mt5", "models/paddleocr_vl", "models/pix2struct", "models/pop2piano", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen2_audio", "models/qwen2_vl", "models/roberta", "models/roberta_prelayernorm", "models/roc_bert", "models/smolvlm", "models/switch_transformers", "models/t5", "models/udop", "models/whisper", "models/xlm_roberta", "models/xlm_roberta_xl", "models/xmod"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN d5c33a0e workflow commit (merge commit)
PR 17fd0942 branch commit (from PR)
main 852f785a base commit (on main)

Model CI Report

2 new failed tests from this PR 😭

  • idefics2:
    tests/models/idefics2/test_modeling_idefics2.py::Idefics2ForConditionalGenerationIntegrationTest::test_integration_test_4bit (❌ ⟹ ❌)
    tests/models/idefics2/test_modeling_idefics2.py::Idefics2ForConditionalGenerationIntegrationTest::test_integration_test_4bit_batch2 (❌ ⟹ ❌)

@Cyrilvallez
Copy link
Member Author

The 2 failed idefics2 tests are false positive: they actually both pass locally, on this PR and on main.
So no new failures at all!

@huggingface huggingface deleted a comment from github-actions bot Mar 12, 2026
@huggingface huggingface deleted a comment from github-actions bot Mar 12, 2026
@huggingface huggingface deleted a comment from github-actions bot Mar 12, 2026
Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few smaller comments, looks overall good, thanks

Probably needs to resolve some merge conflicts because of the capturing PR I merged, sorry 😬

return relative_buckets

def compute_bias(self, query_length, key_length, device=None, cache_position=None):
def compute_bias(self, query_length, key_length, device=None, past_seen_tokens=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the cleanup I suppose

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that and all the weird stuff about real_seq_length, passing a query_length as argument to the Attention's forward which is not the actual query length, etc...

class LongT5Block(GradientCheckpointingLayer):
def __init__(self, config, has_relative_attention_bias=False, layer_idx: int | None = None):
super().__init__()
self.layer_idx = layer_idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, was this not needed before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong leftover! Good catch!

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, autoformer, bark, bert, bert_generation, blip, bloom, blt, bridgetower, camembert, clip, data2vec, decision_transformer, electra, ernie, ernie4_5_vl_moe

@Cyrilvallez Cyrilvallez merged commit ca960f0 into main Mar 12, 2026
29 checks passed
@Cyrilvallez Cyrilvallez deleted the more-cache-pos branch March 12, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants