🚨 Generation cache preparation by zucchini-nlp · Pull Request #43679 · huggingface/transformers

zucchini-nlp · 2026-02-02T11:29:06Z

What does this PR do?

I also want to see if linear cache thing can be squeezed in this PR. If it requires big diffs, I'll split into two

Sidenote: kinda breaking but in a good way. Prev models init their cache in model.forward without passing config. That means sliding windows weren't always respected (e.g. Afmoe or remote models). From now on we always respect sliding window if it is in config

HuggingFaceDocBuilderDev · 2026-02-02T13:33:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-02-02T15:06:40Z

src/transformers/models/dia/generation_dia.py

-        # 10. Prefill
-        model_inputs.update({"output_attentions": generation_config.output_attentions})
-        model_inputs.update({"output_hidden_states": generation_config.output_hidden_states})
-        outputs = self(**model_inputs, return_dict=True)
-


the model was running prefill two times because self._sample also calls prefill. That caused the first prefill to flush its cache and the second prefill to start over

zucchini-nlp · 2026-02-02T15:07:03Z

src/transformers/models/dia/generation_dia.py

            logits_processor=prepared_logits_processor,
            stopping_criteria=prepared_stopping_criteria,
            generation_config=generation_config,
-            prefill_outputs=outputs,


not an expected arg for self._sample

zucchini-nlp · 2026-02-02T15:08:05Z

tests/models/afmoe/test_modeling_afmoe.py

        parent,
        batch_size=4,
-        seq_length=128,
+        seq_length=12,


the model has sliding window of 128 so when generating the cache is cropped to max=128. We need to either override many generation tests to match the expected length, or use a smaller seq length

Totally fine, this is absurdly high so better to reduce

zucchini-nlp · 2026-02-02T15:08:57Z

src/transformers/models/blt/modeling_blt.py

-        super().__init__(config.get_text_config())
+        super().__init__(config)
        self.text_config = config.get_text_config()


not a good practice if self.config is only the decoder-subconfig

zucchini-nlp · 2026-02-02T15:09:29Z

src/transformers/generation/utils.py

-    def _get_cache(self, cache_implementation: str, batch_size: int, max_cache_len: int, model_kwargs) -> Cache:
+    def _prepare_static_cache(
+        self, cache_implementation: str, batch_size: int, max_cache_len: int, model_kwargs
+    ) -> Cache:


just a naming change, imo this is more descriptive

vasqu

LGTM, added a few smaller comments but nothing major

Let's run slow tests for the special models: blt, dia, kyutai + general special ones mamba1/2, bamba, etc

src/transformers/generation/utils.py

vasqu · 2026-02-02T18:18:07Z

src/transformers/generation/utils.py

+            if backend == "quanto" and not is_optimum_quanto_available():
+                raise ImportError(
+                    "You need to install optimum-quanto in order to use KV cache quantization with optimum-quanto "
+                    "backend. Please install it via  with `pip install optimum-quanto`"
+                )
+            elif backend == "HQQ" and not is_hqq_available():
+                raise ImportError(
+                    "You need to install `HQQ` in order to use KV cache quantization with HQQ backend. "
+                    "Please install it via  with `pip install hqq`"
+                )
+            model_kwargs["past_key_values"] = QuantizedCache(backend=backend, **cache_config)


Imo, this error should be raised within the constructor of the cache - we should not have to check this ourselves here

actually it is checked inside cache class as well. I don't remember exactly what was the reason to check it here, I'll git blame and see

src/transformers/generation/utils.py

src/transformers/models/blt/modeling_blt.py

vasqu · 2026-02-02T18:25:36Z

tests/models/afmoe/test_modeling_afmoe.py

        parent,
        batch_size=4,
-        seq_length=128,
+        seq_length=12,


Totally fine, this is absurdly high so better to reduce

zucchini-nlp · 2026-02-03T10:33:41Z

Remark: kinda breaking but in a good way. Prev models init their cache in model.forward without passing config. That means sliding windows weren't always respected (e.g. Afmoe). From now on we always respect sliding window if it is in config

Added a 🚨

github-actions · 2026-02-04T09:13:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, blt, dia, janus, kyutai_speech_to_text

zucchini-nlp · 2026-02-04T09:54:40Z

@vasqu can you force merge, CI is super flaky still and I already retriggered 5-6 times 😢

zucchini-nlp added 2 commits February 2, 2026 12:27

do smth

dfad28c

how did this even happen O_o

b241557

zucchini-nlp added 3 commits February 2, 2026 15:13

fix models

bd31bb2

dia needs only one prefill, fix

c042373

fxi repo again

d58e5b7

zucchini-nlp commented Feb 2, 2026

View reviewed changes

zucchini-nlp requested a review from vasqu February 2, 2026 15:58

vasqu approved these changes Feb 2, 2026

View reviewed changes

comments, comments

524734f

zucchini-nlp changed the title ~~Generation cache preparation~~ 🚨 Generation cache preparation Feb 3, 2026

almost forgot

4493bb2

explicit keyword arg

e10455e

zucchini-nlp enabled auto-merge (squash) February 3, 2026 12:01

zucchini-nlp added 2 commits February 3, 2026 13:57

Merge branch 'main' into cache-prefill-chunk

57f2463

Merge branch 'main' into cache-prefill-chunk

0614fd0

Merge branch 'main' into cache-prefill-chunk

db40d72

zucchini-nlp merged commit 1687954 into huggingface:main Feb 4, 2026
25 checks passed

Conversation

zucchini-nlp commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

zucchini-nlp commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zucchini-nlp commented Feb 2, 2026 •

edited

Loading