Skip to content

Conversation

remi-or
Copy link
Collaborator

@remi-or remi-or commented Sep 11, 2025

Some architectures like llama alter the attention mask if it is not a tensor, which was not compatible with the way CB created and handled the attention mask. Now, arguments like attention_mask, cumulative_seqlens_k and max_seqlen_k are tensors or ints unless the model is hybrid, in which case they are dictonnaries. This is the main fix, but PR also:

  • cleans up how those arguments are built, reset, and delivered to the model to make things tidier
  • adds support for attention sink in eager_paged
  • explicitly disables (for now) cuda graphs and sampling in CB
  • adds test for CB that pass on MI325 and H100 (but it's a little sketchy for some models there)
  • fixes some telemetry metrics that were broken when hybrid allocation were added

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This was referenced Sep 11, 2025
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, supprised that sampling is not supported, but cudagraphs yes. Wondering if you just set slice_inputs = not cuda_graph ?

@remi-or
Copy link
Collaborator Author

remi-or commented Sep 12, 2025

supprised that sampling is not supported

I think there were issues last time, but I can check

Wondering if you just set slice_inputs = not cuda_graph ?

Tried that it does not work for a few reasons, will look into restoring asap

@Cyrilvallez Cyrilvallez merged commit ef05393 into huggingface:main Sep 12, 2025
21 of 23 checks passed
# We centralize the logger here to coordinate between logging and progress bar
logger = logging.getLogger("ContinuousBatchingLogger")
logger.setLevel(logging.INFO)
# logger.setLevel(logging.INFO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this intentional btw?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems like default should not be INFO. I can remove the comment next time, I agree it will be cleaner :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ty

ErfanBaghaei pushed a commit to ErfanBaghaei/transformers that referenced this pull request Sep 25, 2025
* Fix for CB attn mask and refactor

* Tests for CB (not all passing)

* Passing tests and a logger fix

* Fixed the KV metrics that were broken when we moved to hybrid alloc

* Fix circular import and style

* Added tests for FA

* Unfolded test to have device expectations

* Fixes for H100

* more fixes for h100

* H100 are good

* Style

* Adding some comments from huggingface#40831

* Rename test

* Avoid 1 letter variables

* Dictonnary is only removed during kwargs

* Test for supported sample

* Fix a unvoluntary slice

* Fixes for non-sliced inputs and small example improvments

* Slice inputs is more understandabe

* Style
vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/transformers that referenced this pull request Oct 2, 2025
* Fix for CB attn mask and refactor

* Tests for CB (not all passing)

* Passing tests and a logger fix

* Fixed the KV metrics that were broken when we moved to hybrid alloc

* Fix circular import and style

* Added tests for FA

* Unfolded test to have device expectations

* Fixes for H100

* more fixes for h100

* H100 are good

* Style

* Adding some comments from huggingface#40831

* Rename test

* Avoid 1 letter variables

* Dictonnary is only removed during kwargs

* Test for supported sample

* Fix a unvoluntary slice

* Fixes for non-sliced inputs and small example improvments

* Slice inputs is more understandabe

* Style
yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request Oct 4, 2025
* Fix for CB attn mask and refactor

* Tests for CB (not all passing)

* Passing tests and a logger fix

* Fixed the KV metrics that were broken when we moved to hybrid alloc

* Fix circular import and style

* Added tests for FA

* Unfolded test to have device expectations

* Fixes for H100

* more fixes for h100

* H100 are good

* Style

* Adding some comments from huggingface#40831

* Rename test

* Avoid 1 letter variables

* Dictonnary is only removed during kwargs

* Test for supported sample

* Fix a unvoluntary slice

* Fixes for non-sliced inputs and small example improvments

* Slice inputs is more understandabe

* Style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants