Skip to content

Conversation

McPatate
Copy link
Member

@McPatate McPatate commented Aug 27, 2025

Fixing continuous batching in transformers serve.

  • added a --continuous_batching cmd line flag to enable, open to change this!
  • max_new_tokens can sometimes be None, set a default so it doesn't break CB which expects it to be set can't repro my previous error, removed the added code and added a test to check if defaults are set correctly
  • fix server hang on shutdown, added a lifespan to the FastAPI instance
    • calling continuous batching manager stop
    • refactored TimedModel to make delete_model "public" so we cancel the threading.Timer that was causing the server to hang on SIGINT
  • added request_id_iter to iterate only on tokens linked to a given request_id
    • refactored the get_result to requeue tokens if request_id is not None && req.request_id != request_id (before we were losing tokens while iterating directly on all output_queue tokens)
  • fix iterator to continue looping even if output_queue is empty
  • moved the DecodeStream object to live in the RequestState rather than being single instance linked to the manager removed any trace of tokenizer within CB impl, didn't make sense to have here as we already are expecting encoded tokens. Leaving it up to the caller to decode (updated the serving code adequately)
  • removed next_token from RequestState as it wasn't used, in streaming I've used generated_tokens[-1] to get latest token
  • changed prepare_next_batch signature, now returns a bool to short circuit inner generation loop when it didn't prepare anything

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @McPatate, clean!

Pinging @gante for second review as it touches some of his work as well

Comment on lines -777 to -780
num_blocks=1,
block_size=1024,
do_sample=False,
max_batch_tokens=10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @remi-or 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to help 🤗 I just want to point out that while num_blocks and max_batch_tokens can be inferred from available GPU memory, if block_size is not given it simply defaults to 32, which is quite far from the previous 1024 here. Might not be important though!

@LysandreJik LysandreJik requested a review from gante August 27, 2025 12:19
@McPatate McPatate force-pushed the fix/continuous_batching_serve branch 2 times, most recently from fb0c732 to 5f8c994 Compare August 28, 2025 12:50
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presonally lgmt

@McPatate McPatate force-pushed the fix/continuous_batching_serve branch 3 times, most recently from a322182 to b93afe0 Compare September 1, 2025 14:43
@McPatate McPatate force-pushed the fix/continuous_batching_serve branch from b93afe0 to bc392de Compare September 1, 2025 14:44
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @McPatate! The only thing I'm a bit wary about is the change from attn_implementation toggling CB to the explicit flag continuous_batching, especially as the latter still requires the former to be set.

Would it be possible to have the flag --continuous_batching also correctly toggle a paged attention method if not set?

@McPatate
Copy link
Member Author

McPatate commented Sep 2, 2025

I understand, I'm not super sure of which direction I want to go with this.
But which attn impl should we toggle? I assume the idea would be to have a list of CB compatible attn impls and throw an err if you do not choose from said list?
Wdyt?
Perhaps we can merge this as is for now and refactor this in a subsequent PR.

@LysandreJik
Copy link
Member

Sounds good!

@LysandreJik LysandreJik merged commit b2b1c30 into main Sep 2, 2025
23 of 25 checks passed
@LysandreJik LysandreJik deleted the fix/continuous_batching_serve branch September 2, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants