Remove cache_position in more models (2)#44602
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: bridgetower mt5 qwen2_5_vl xlm_roberta audioflamingo3 data2vec_text gpt2 qwen2_5_omni roberta smolvlm camembert gpt_neo paddleocr_vl pix2struct idefics3 llama4 bark mllama bert_generation qwen2_vl whisper xlm_roberta_xl electra esm clip bert decision_transformer idefics2 pop2piano t5 blt blip_text idefics udop bloom roc_bert roberta_prelayernorm autoformer qwen2_audio ernie switch_transformers longt5 xmod |
|
This comment contains models: ["models/audioflamingo3", "models/autoformer", "models/bark", "models/bert", "models/bert_generation", "models/bloom", "models/blt", "models/bridgetower", "models/camembert", "models/clip", "models/decision_transformer", "models/electra", "models/ernie", "models/esm", "models/gpt2", "models/gpt_neo", "models/idefics", "models/idefics2", "models/idefics3", "models/llama4", "models/longt5", "models/mllama", "models/mt5", "models/paddleocr_vl", "models/pix2struct", "models/pop2piano", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen2_audio", "models/qwen2_vl", "models/roberta", "models/roberta_prelayernorm", "models/roc_bert", "models/smolvlm", "models/switch_transformers", "models/t5", "models/udop", "models/whisper", "models/xlm_roberta", "models/xlm_roberta_xl", "models/xmod"] |
CI ResultsCommit Info
Model CI Report❌ 2 new failed tests from this PR 😭
|
|
The 2 failed idefics2 tests are false positive: they actually both pass locally, on this PR and on main. |
vasqu
left a comment
There was a problem hiding this comment.
Just a few smaller comments, looks overall good, thanks
Probably needs to resolve some merge conflicts because of the capturing PR I merged, sorry 😬
| return relative_buckets | ||
|
|
||
| def compute_bias(self, query_length, key_length, device=None, cache_position=None): | ||
| def compute_bias(self, query_length, key_length, device=None, past_seen_tokens=0): |
There was a problem hiding this comment.
That's the cleanup I suppose
There was a problem hiding this comment.
Well that and all the weird stuff about real_seq_length, passing a query_length as argument to the Attention's forward which is not the actual query length, etc...
| class LongT5Block(GradientCheckpointingLayer): | ||
| def __init__(self, config, has_relative_attention_bias=False, layer_idx: int | None = None): | ||
| super().__init__() | ||
| self.layer_idx = layer_idx |
There was a problem hiding this comment.
Interesting, was this not needed before?
There was a problem hiding this comment.
Wrong leftover! Good catch!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, autoformer, bark, bert, bert_generation, blip, bloom, blt, bridgetower, camembert, clip, data2vec, decision_transformer, electra, ernie, ernie4_5_vl_moe |
What does this PR do?
As per the title. Follow up of #44330
Also take the opportunity to simplify t5 and its children, because the way they compute
position_biaswas super convoluted/overcomplicated, and required several additional unnecessary arguments etc...