Skip to content

🚨 Generation cache preparation#43679

Merged
zucchini-nlp merged 11 commits intohuggingface:mainfrom
zucchini-nlp:cache-prefill-chunk
Feb 4, 2026
Merged

🚨 Generation cache preparation#43679
zucchini-nlp merged 11 commits intohuggingface:mainfrom
zucchini-nlp:cache-prefill-chunk

Conversation

@zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Feb 2, 2026

What does this PR do?

I also want to see if linear cache thing can be squeezed in this PR. If it requires big diffs, I'll split into two

Fixes #43673

Sidenote: kinda breaking but in a good way. Prev models init their cache in model.forward without passing config. That means sliding windows weren't always respected (e.g. Afmoe or remote models). From now on we always respect sliding window if it is in config

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines -395 to -399
# 10. Prefill
model_inputs.update({"output_attentions": generation_config.output_attentions})
model_inputs.update({"output_hidden_states": generation_config.output_hidden_states})
outputs = self(**model_inputs, return_dict=True)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the model was running prefill two times because self._sample also calls prefill. That caused the first prefill to flush its cache and the second prefill to start over

logits_processor=prepared_logits_processor,
stopping_criteria=prepared_stopping_criteria,
generation_config=generation_config,
prefill_outputs=outputs,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an expected arg for self._sample

parent,
batch_size=4,
seq_length=128,
seq_length=12,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the model has sliding window of 128 so when generating the cache is cropped to max=128. We need to either override many generation tests to match the expected length, or use a smaller seq length

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally fine, this is absurdly high so better to reduce

Comment on lines -1380 to 1383
super().__init__(config.get_text_config())
super().__init__(config)
self.text_config = config.get_text_config()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a good practice if self.config is only the decoder-subconfig

Comment on lines -1861 to +1863
def _get_cache(self, cache_implementation: str, batch_size: int, max_cache_len: int, model_kwargs) -> Cache:
def _prepare_static_cache(
self, cache_implementation: str, batch_size: int, max_cache_len: int, model_kwargs
) -> Cache:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a naming change, imo this is more descriptive

@zucchini-nlp zucchini-nlp requested a review from vasqu February 2, 2026 15:58
Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a few smaller comments but nothing major

Let's run slow tests for the special models: blt, dia, kyutai + general special ones mamba1/2, bamba, etc

Comment on lines 2014 to 2024
if backend == "quanto" and not is_optimum_quanto_available():
raise ImportError(
"You need to install optimum-quanto in order to use KV cache quantization with optimum-quanto "
"backend. Please install it via with `pip install optimum-quanto`"
)
elif backend == "HQQ" and not is_hqq_available():
raise ImportError(
"You need to install `HQQ` in order to use KV cache quantization with HQQ backend. "
"Please install it via with `pip install hqq`"
)
model_kwargs["past_key_values"] = QuantizedCache(backend=backend, **cache_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, this error should be raised within the constructor of the cache - we should not have to check this ourselves here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it is checked inside cache class as well. I don't remember exactly what was the reason to check it here, I'll git blame and see

parent,
batch_size=4,
seq_length=128,
seq_length=12,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally fine, this is absurdly high so better to reduce

@zucchini-nlp zucchini-nlp changed the title Generation cache preparation 🚨 Generation cache preparation Feb 3, 2026
@zucchini-nlp
Copy link
Member Author

Remark: kinda breaking but in a good way. Prev models init their cache in model.forward without passing config. That means sliding windows weren't always respected (e.g. Afmoe). From now on we always respect sliding window if it is in config

Added a 🚨

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) February 3, 2026 12:01
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, blt, dia, janus, kyutai_speech_to_text

@zucchini-nlp
Copy link
Member Author

@vasqu can you force merge, CI is super flaky still and I already retriggered 5-6 times 😢

@zucchini-nlp zucchini-nlp merged commit 1687954 into huggingface:main Feb 4, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GenerationMixin cache missing in v5.0.0 during chunked_prefill

3 participants