fix: continuous batching in `transformers serve` #40479

McPatate · 2025-08-27T09:49:03Z

Fixing continuous batching in transformers serve.

added a --continuous_batching cmd line flag to enable, open to change this!
~~max_new_tokens can sometimes be None, set a default so it doesn't break CB which expects it to be set~~ can't repro my previous error, removed the added code and added a test to check if defaults are set correctly
fix server hang on shutdown, added a lifespan to the FastAPI instance
- calling continuous batching manager stop
- refactored TimedModel to make delete_model "public" so we cancel the threading.Timer that was causing the server to hang on SIGINT
added request_id_iter to iterate only on tokens linked to a given request_id
- refactored the get_result to requeue tokens if request_id is not None && req.request_id != request_id (before we were losing tokens while iterating directly on all output_queue tokens)
fix iterator to continue looping even if output_queue is empty
~~moved the DecodeStream object to live in the RequestState rather than being single instance linked to the manager~~ removed any trace of tokenizer within CB impl, didn't make sense to have here as we already are expecting encoded tokens. Leaving it up to the caller to decode (updated the serving code adequately)
removed next_token from RequestState as it wasn't used, in streaming I've used generated_tokens[-1] to get latest token
changed prepare_next_batch signature, now returns a bool to short circuit inner generation loop when it didn't prepare anything

HuggingFaceDocBuilderDev · 2025-08-27T09:58:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LysandreJik

Thanks @McPatate, clean!

Pinging @gante for second review as it touches some of his work as well

src/transformers/commands/serving.py

LysandreJik · 2025-08-27T10:14:23Z

src/transformers/commands/serving.py

-            num_blocks=1,
-            block_size=1024,
            do_sample=False,
-            max_batch_tokens=10,


thx @remi-or 😄

Happy to help 🤗 I just want to point out that while num_blocks and max_batch_tokens can be inferred from available GPU memory, if block_size is not given it simply defaults to 32, which is quite far from the previous 1024 here. Might not be important though!

ArthurZucker

presonally lgmt

…thing

LysandreJik

LGTM, thanks @McPatate! The only thing I'm a bit wary about is the change from attn_implementation toggling CB to the explicit flag continuous_batching, especially as the latter still requires the former to be set.

Would it be possible to have the flag --continuous_batching also correctly toggle a paged attention method if not set?

McPatate · 2025-09-02T08:23:45Z

I understand, I'm not super sure of which direction I want to go with this.
But which attn impl should we toggle? I assume the idea would be to have a list of CB compatible attn impls and throw an err if you do not choose from said list?
Wdyt?
Perhaps we can merge this as is for now and refactor this in a subsequent PR.

LysandreJik · 2025-09-02T08:27:13Z

Sounds good!

src/transformers/commands/serving.py

McPatate requested review from LysandreJik and ArthurZucker August 27, 2025 09:49

LysandreJik approved these changes Aug 27, 2025

View reviewed changes

LysandreJik requested a review from gante August 27, 2025 12:19

McPatate force-pushed the fix/continuous_batching_serve branch 2 times, most recently from fb0c732 to 5f8c994 Compare August 28, 2025 12:50

ArthurZucker approved these changes Aug 28, 2025

View reviewed changes

McPatate force-pushed the fix/continuous_batching_serve branch 3 times, most recently from a322182 to b93afe0 Compare September 1, 2025 14:43

McPatate added 7 commits September 1, 2025 16:43

fix: continuous batching in transformers serve

64272df

fix: short circuit inner gen loop when prepare_next_batch prepared no…

da47ca6

…thing

docs: add comment explaining FastAPI lifespan

b0b6555

test: add CB serving tests

1e4ae68

refactor: remove gen cfg max new tokens override bc unnecessary

6e011aa

docs: add docstring for ServeCommand::run

a27ce93

feat: use new DecodeStream API

bc392de

McPatate force-pushed the fix/continuous_batching_serve branch from b93afe0 to bc392de Compare September 1, 2025 14:44

McPatate mentioned this pull request Sep 1, 2025

feat: support request cancellation #40599

Merged

LysandreJik approved these changes Sep 2, 2025

View reviewed changes

LysandreJik merged commit b2b1c30 into main Sep 2, 2025
23 of 25 checks passed

LysandreJik deleted the fix/continuous_batching_serve branch September 2, 2025 08:45

McPatate mentioned this pull request Sep 2, 2025

feat: err when unsupported attn impl is set w/ --continuous_batching #40618

Merged

ArthurZucker reviewed Sep 3, 2025

View reviewed changes

src/transformers/commands/serving.py Show resolved Hide resolved

McPatate mentioned this pull request Sep 3, 2025

refactor: use tolist instead of list comprehension calling .item() #40646

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: continuous batching in `transformers serve` #40479

fix: continuous batching in `transformers serve` #40479

Uh oh!

McPatate commented Aug 27, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

LysandreJik left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LysandreJik Aug 27, 2025

Uh oh!

McPatate Aug 27, 2025

Uh oh!

remi-or Aug 27, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

LysandreJik left a comment

Uh oh!

McPatate commented Sep 2, 2025

Uh oh!

LysandreJik commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: continuous batching in transformers serve #40479

fix: continuous batching in transformers serve #40479

Uh oh!

Conversation

McPatate commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LysandreJik Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

McPatate Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

remi-or Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

McPatate commented Sep 2, 2025

Uh oh!

LysandreJik commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: continuous batching in `transformers serve` #40479

fix: continuous batching in `transformers serve` #40479

McPatate commented Aug 27, 2025 •

edited

Loading