-
Notifications
You must be signed in to change notification settings - Fork 30.6k
fix: continuous batching in transformers serve
#40479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_blocks=1, | ||
block_size=1024, | ||
do_sample=False, | ||
max_batch_tokens=10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx @remi-or 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to help 🤗 I just want to point out that while num_blocks
and max_batch_tokens
can be inferred from available GPU memory, if block_size
is not given it simply defaults to 32
, which is quite far from the previous 1024
here. Might not be important though!
fb0c732
to
5f8c994
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presonally lgmt
a322182
to
b93afe0
Compare
b93afe0
to
bc392de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @McPatate! The only thing I'm a bit wary about is the change from attn_implementation
toggling CB to the explicit flag continuous_batching
, especially as the latter still requires the former to be set.
Would it be possible to have the flag --continuous_batching
also correctly toggle a paged attention method if not set?
I understand, I'm not super sure of which direction I want to go with this. |
Sounds good! |
Fixing continuous batching in
transformers serve
.--continuous_batching
cmd line flag to enable, open to change this!can't repro my previous error, removed the added code and added a test to check if defaults are set correctlymax_new_tokens
can sometimes beNone
, set a default so it doesn't break CB which expects it to be setlifespan
to theFastAPI
instancestop
TimedModel
to makedelete_model
"public" so we cancel thethreading.Timer
that was causing the server to hang on SIGINTrequest_id_iter
to iterate only on tokens linked to a given request_idget_result
to requeue tokens ifrequest_id is not None && req.request_id != request_id
(before we were losing tokens while iterating directly on all output_queue tokens)moved theremoved any trace of tokenizer within CB impl, didn't make sense to have here as we already are expecting encoded tokens. Leaving it up to the caller to decode (updated the serving code adequately)DecodeStream
object to live in theRequestState
rather than being single instance linked to the managernext_token
fromRequestState
as it wasn't used, in streaming I've usedgenerated_tokens[-1]
to get latest tokenprepare_next_batch
signature, now returns a bool to short circuit inner generation loop when it didn't prepare anything