enable cpu paged cache by jiqing-feng · Pull Request #42869 · huggingface/transformers

jiqing-feng · 2025-12-15T08:43:57Z

CPU can also use paged cache with eager or sdpa:
python continuous_batching_simple.py --attn sdpa

Without this change, the previous command error would be like:

Error in generation loop: unsupported operand type(s) for -: 'NoneType' and 'int'
Traceback (most recent call last):
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/continuous_api.py", line 1017, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
                            ^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 191, in __init__
    num_blocks, max_batch_tokens = memory_handler.infer_num_blocks_and_max_batch_tokens(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 481, in infer_num_blocks_and_max_batch_tokens
    num_blocks, max_batch_tokens = self.compute_num_blocks_and_max_batch_tokens(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 522, in compute_num_blocks_and_max_batch_tokens
    cache_memory = self.get_available_memory(max_memory_percent)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 456, in get_available_memory
    available_memory = total - max(allocated, reserved)
                       ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

remi-or · 2025-12-15T11:16:25Z

Hi @jiqing-feng , thanks for the contribution! Just letting you know that CPU-compatible continuous batching is not a priority right now, so even though this PR is small, it will not be reviewed right away. I am cautious about two things:

How device map "auto" behaves and how it affects the model's repartition
The lack of tests / benchmarks. We have a small template for continuous batching PRs, as in [CB] Easy optimizations for continuous batching #42839 if you can follow it, that would be great.

Will get to review this as soon as I have the bandwidth, thanks you!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2025-12-16T05:40:56Z

device_map="auto" will assign accelerator like cuda/xpu is exists, otherwise will use cpu. But I reverted this change and only added cpu as an option when cuda is not available, it will keep it as original if cuda exists.
OK, I will add tests and benchmarks for it.

jiqing-feng · 2025-12-16T08:28:08Z

Hi @remi-or . I have updated the tests and examples for CPU. Now the example and tests can pass on CPU. Please review this PR and let me know your opinion. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

examples/pytorch/continuous_batching_simple.py

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-01-07T06:43:41Z

Hi @remi-or . Do you have bandwidth to review this PR?

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-01-13T08:42:30Z

Hi @SunMarc . We have enabled flash varlen attention for CPU:https://huggingface.co/kernels-community/flash-attn2/tree/main/build.
With this change, the cpu can also use paged flash attention in contiguous batching case. The official example continuous_batching_simple.py can gain 1.6x speed up with kernels-community/flash_attention_2 compared to paged|sdpa
Would you please review this PR? Thanks!

jiqing-feng · 2026-01-13T09:04:51Z

For the failed tests. I can pass tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_num_return_sequences_1 on CPU but failed on NV A100. The test still failed on A100 without my changes.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

remi-or · 2026-01-13T13:29:14Z

For the failed tests. I can pass tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_num_return_sequences_1 on CPU but failed on NV A100. The test still failed on A100 without my changes.

Yes that test is a bit flaky, will look into it soon.

I just merged a big PR which caused conflict, my bad. Could you update your PR and I will review? Thanks

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

HuggingFaceDocBuilderDev · 2026-01-26T09:15:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

remi-or · 2026-01-26T09:52:02Z

Hi @jiqing-feng , I just run the test on my end and a lot do not pass. Here re the results:

FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_block_sharing_with_hybrid_model - RuntimeError: No CUDA GPUs are available
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_continuous_batching_config_combinations_08 - NotImplementedError: Could not run '_flash_attn2_588b404::varlen_fwd' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using c...
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_continuous_batching_config_combinations_09 - NotImplementedError: Could not run '_flash_attn2_588b404::varlen_fwd' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using c...
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_continuous_batching_config_combinations_20 - NotImplementedError: Could not run '_flash_attn2_588b404::varlen_fwd' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using c...
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_continuous_batching_config_combinations_21 - NotImplementedError: Could not run '_flash_attn2_588b404::varlen_fwd' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using c...
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_continuous_batching_diverse_models_0_TinyLlama_TinyLlama_1_1B_Chat_v1_0 - NotImplementedError: Could not run '_flash_attn2_588b404::varlen_fwd' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using c...
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_continuous_batching_diverse_models_1_TinyLlama_TinyLlama_1_1B_Chat_v1_0 - NotImplementedError: Could not run '_flash_attn2_588b404::varlen_fwd' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using c...
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_continuous_batching_diverse_models_4_google_gemma_2_2b_it - NotImplementedError: Could not run '_flash_attn2_588b404::varlen_fwd' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using c...
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_continuous_batching_diverse_models_5_google_gemma_2_2b_it - NotImplementedError: Could not run '_flash_attn2_588b404::varlen_fwd' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using c...
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_num_return_sequences_0 - AssertionError: 0 != 2 : Expected 2 results, but got len(results) = 0
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_num_return_sequences_1 - AssertionError: 0 != 2 : Expected 2 results, but got len(results) = 0
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_prefix_sharing - RuntimeError: No CUDA GPUs are available

I am surprised because I thought there was some version of flash that worked on CPU, but I might be wrong here. If not please add back the decorator for torch_accelerator for those tests or a sip clause.
For the OSError, you can probably solve that by connecting to HF using a token or the hf-cli . The tests I ran using CUDA_VISIBLE_DEVICES="" RUN_SLOW=1 pytest tests/generation/test_continuous_batching.py

jiqing-feng · 2026-01-27T03:01:33Z

Hi @remi-or . The most failed tests you listed can pass on my side, some failed tests like

FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_num_return_sequences_0 - AssertionError: 0 != 2 : Expected 2 results, but got len(results) = 0
FAILED tests/generation/test_continuous_batching.py::ContinuousBatchingGenerationTest::test_num_return_sequences_1 - AssertionError: 0 != 2 : Expected 2 results, but got len(results) = 0

are fixed in my last changes.

Here is my key packages:

torch                     2.10.0+cpu
kernels                   0.12.1
transformers              5.0.1.dev0

jiqing-feng · 2026-01-27T06:28:26Z

Hi @remi-or . It seems that you didn't correctly loaded the latest kernels here: https://huggingface.co/kernels-community/flash-attn2/tree/main/build.

I'd like to log in to your node to check the env if it is possible. My email is jiqing.feng@intel.com
Please let me know if you need me to check the env. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

src/transformers/generation/continuous_batching/continuous_api.py

jiqing-feng · 2026-01-28T01:22:41Z

Hi @remi-or . I have fixed your comment and check cuda before using cuda stream. Please review the new change. Thanks!

jiqing-feng · 2026-01-28T02:05:08Z

The failed CIs are not related to my changes. The main branch also failed.

src/transformers/generation/continuous_batching/continuous_api.py

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-01-29T01:25:23Z

Hi @remi-or . I've fixed your comment. Please review the new change. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-01-29T11:06:10Z

Hi @remi-or , since this PR has been open for a while, I’m hoping we can wrap it up today. I’ll be online for the next few hours to address your feedback immediately. The failed CI is not related to my changes.

Refactor the initialization of _graphs to simplify the condition for using CUDA graphs.

remi-or · 2026-01-29T13:48:21Z

Hi, I just modified something related to the _graphs attribute, because perhaps my last comment was unclear. Testing and merging if tests pass.

src/transformers/generation/continuous_batching/requests.py

src/transformers/generation/continuous_batching/continuous_api.py

remi-or

LGTM! Thanks for all the work you put into this. Please commit the 2 suggesions before merging! One is needed, the other will be very useful.

Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com>

jiqing-feng · 2026-01-29T15:16:41Z

LGTM! Thanks for all the work you put into this. Please commit the 2 suggesions before merging! One is needed, the other will be very useful.

Hi @remi-or . I have submitted your suggested commits. Thanks!

jiqing-feng marked this pull request as ready for review December 15, 2025 08:45

jiqing-feng mentioned this pull request Dec 15, 2025

sdpa_paged: How does it handle paged cache without padding? #42868

Closed

jiqing-feng added 4 commits December 15, 2025 14:53

enable cpu paged cache

f37459e

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

enable cpu example

9c6d115

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into cpu_paged

0b33ca9

Merge branch 'main' into cpu_paged

f3ec471

jiqing-feng force-pushed the cpu_paged branch from 4ed8d51 to 2a5e941 Compare December 16, 2025 08:15

jiqing-feng added 7 commits December 16, 2025 13:43

fix device map

a27ac08

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update tests

0263a64

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

revert xpu deterministic

cf58a7b

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

b27f7e8

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

039a5ff

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update test_paged_attention for CPU

2a5e941

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update cpu groud truth for CI

5d97d86

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

yao-matrix reviewed Dec 16, 2025

View reviewed changes

examples/pytorch/continuous_batching_simple.py Outdated Show resolved Hide resolved

jiqing-feng added 3 commits December 17, 2025 09:02

use accelerator

a4dd9bb

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into cpu_paged

72d4191

fix typo

be39410

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng added 3 commits January 12, 2026 10:14

Merge branch 'main' into cpu_paged

8d56c72

fix tests

9de6394

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into cpu_paged

8f9bc2a

fix example

e448a8f

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update tests

81c0825

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into cpu_paged

41acf41

fix tests

7e84e7c

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

remi-or reviewed Jan 27, 2026

View reviewed changes

src/transformers/generation/continuous_batching/continuous_api.py Outdated Show resolved Hide resolved

remi-or reviewed Jan 28, 2026

View reviewed changes

src/transformers/generation/continuous_batching/continuous_api.py Outdated Show resolved Hide resolved

jiqing-feng added 4 commits January 28, 2026 09:05

fix comment

aa87878

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into cpu_paged

7ff4cf1

fix ground truth check

cf82152

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into cpu_paged

fbc3f31

jiqing-feng added 2 commits January 29, 2026 08:58

fix graph check

4031a8d

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into cpu_paged

9bd60a1

jiqing-feng requested a review from remi-or January 29, 2026 13:04

Simplify _graphs initialization for CUDA graphs

d4845b8

Refactor the initialization of _graphs to simplify the condition for using CUDA graphs.

remi-or reviewed Jan 29, 2026

View reviewed changes

src/transformers/generation/continuous_batching/requests.py Show resolved Hide resolved

remi-or reviewed Jan 29, 2026

View reviewed changes

src/transformers/generation/continuous_batching/continuous_api.py Outdated Show resolved Hide resolved

remi-or approved these changes Jan 29, 2026

View reviewed changes

remi-or and others added 4 commits January 29, 2026 15:55

Merge branch 'main' into cpu_paged

5570644

Update src/transformers/generation/continuous_batching/requests.py

2c9fca2

Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com>

Update src/transformers/generation/continuous_batching/continuous_api.py

aebab62

Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com>

Merge branch 'main' into cpu_paged

c3fe49b

Cyrilvallez merged commit 071e178 into huggingface:main Jan 29, 2026
19 of 25 checks passed

Conversation

jiqing-feng commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remi-or commented Dec 15, 2025

Uh oh!

jiqing-feng commented Dec 16, 2025

Uh oh!

jiqing-feng commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Jan 7, 2026

Uh oh!

jiqing-feng commented Jan 13, 2026

Uh oh!

jiqing-feng commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remi-or commented Jan 13, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 26, 2026

Uh oh!

remi-or commented Jan 26, 2026

Uh oh!

jiqing-feng commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Jan 27, 2026

Uh oh!

Uh oh!

jiqing-feng commented Jan 28, 2026

Uh oh!

jiqing-feng commented Jan 28, 2026

Uh oh!

Uh oh!

jiqing-feng commented Jan 29, 2026

Uh oh!

jiqing-feng commented Jan 29, 2026

Uh oh!

remi-or commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

remi-or left a comment

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiqing-feng commented Dec 15, 2025 •

edited

Loading

jiqing-feng commented Dec 16, 2025 •

edited

Loading

jiqing-feng commented Jan 13, 2026 •

edited

Loading

jiqing-feng commented Jan 27, 2026 •

edited

Loading