-
Notifications
You must be signed in to change notification settings - Fork 30.6k
Fixes for continuous batching #40828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, supprised that sampling is not supported, but cudagraphs yes. Wondering if you just set slice_inputs = not cuda_graph ?
src/transformers/generation/continuous_batching/continuous_api.py
Outdated
Show resolved
Hide resolved
src/transformers/generation/continuous_batching/continuous_api.py
Outdated
Show resolved
Hide resolved
I think there were issues last time, but I can check
Tried that it does not work for a few reasons, will look into restoring asap |
# We centralize the logger here to coordinate between logging and progress bar | ||
logger = logging.getLogger("ContinuousBatchingLogger") | ||
logger.setLevel(logging.INFO) | ||
# logger.setLevel(logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this intentional btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems like default should not be INFO. I can remove the comment next time, I agree it will be cleaner :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice ty
* Fix for CB attn mask and refactor * Tests for CB (not all passing) * Passing tests and a logger fix * Fixed the KV metrics that were broken when we moved to hybrid alloc * Fix circular import and style * Added tests for FA * Unfolded test to have device expectations * Fixes for H100 * more fixes for h100 * H100 are good * Style * Adding some comments from huggingface#40831 * Rename test * Avoid 1 letter variables * Dictonnary is only removed during kwargs * Test for supported sample * Fix a unvoluntary slice * Fixes for non-sliced inputs and small example improvments * Slice inputs is more understandabe * Style
* Fix for CB attn mask and refactor * Tests for CB (not all passing) * Passing tests and a logger fix * Fixed the KV metrics that were broken when we moved to hybrid alloc * Fix circular import and style * Added tests for FA * Unfolded test to have device expectations * Fixes for H100 * more fixes for h100 * H100 are good * Style * Adding some comments from huggingface#40831 * Rename test * Avoid 1 letter variables * Dictonnary is only removed during kwargs * Test for supported sample * Fix a unvoluntary slice * Fixes for non-sliced inputs and small example improvments * Slice inputs is more understandabe * Style
* Fix for CB attn mask and refactor * Tests for CB (not all passing) * Passing tests and a logger fix * Fixed the KV metrics that were broken when we moved to hybrid alloc * Fix circular import and style * Added tests for FA * Unfolded test to have device expectations * Fixes for H100 * more fixes for h100 * H100 are good * Style * Adding some comments from huggingface#40831 * Rename test * Avoid 1 letter variables * Dictonnary is only removed during kwargs * Test for supported sample * Fix a unvoluntary slice * Fixes for non-sliced inputs and small example improvments * Slice inputs is more understandabe * Style
Some architectures like
llama
alter the attention mask if it is not a tensor, which was not compatible with the way CB created and handled the attention mask. Now, arguments likeattention_mask
,cumulative_seqlens_k
andmax_seqlen_k
are tensors or ints unless the model is hybrid, in which case they are dictonnaries. This is the main fix, but PR also:eager_paged