Skip to content

Conversation

DN6
Copy link
Collaborator

@DN6 DN6 commented Oct 7, 2025

What does this PR do?

Currently CP inference will run with split hooks even if the attention backend doesn't support it. This can lead to weird results #12443

This PR

  1. Fixes the validation check in the AttentionBackendRegistry to return False for invalid backends
  2. Removes redundant checks in the ContextParallelConfig and adds properties to access device mesh config arguments in a more general way.
  3. Move CP config related validation checks into the CP Config itself, rather than in enable_parallelism

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits, otherwise looks very nice!

raise NotImplementedError(
f"Only rotate_method='allgather' is supported for now, but got {self.rotate_method}."

if self.ulysses_degree * self.ring_degree > world_size:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hit line as both cannot be set, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both can be set techinically, but currently both can't be > 1. Also this is for cases where you have 3 GPUs available and you set something like ulysses_degree=1 and ring_degree==4 (more GPUs being requested is greater than world_size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels slightly confusing to me but since we're erroring out early for unsupported ulysses_degree and ring_degree value combos, I think it's okay.

Comment on lines +118 to +122
self._flattened_mesh = self._mesh._flatten()
self._ring_mesh = self._mesh["ring"]
self._ulysses_mesh = self._mesh["ulysses"]
self._ring_local_rank = self._ring_mesh.get_local_rank()
self._ulysses_local_rank = self._ulysses_mesh.get_local_rank()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't they be None? Why unguard?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are internal attributes that are derived from mesh which is set through the setup method. The device mesh object is also only dynamically created when enabled_parallelism is called.

The guards are redundant, they would always be None unless set explicitly for some custom debugging.

mesh_shape=(cp_config.ring_degree, cp_config.ulysses_degree),
mesh_dim_names=("ring", "ulysses"),
)
from .attention_dispatch import AttentionBackendName, _AttentionBackendRegistry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit): Would prefer to keep the import at the beginning of the method implementation, if possible (after from .attention_processor import Attention, MochiAttention).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants