Skip to content

Conversation

@Cyrilvallez
Copy link
Member

@Cyrilvallez Cyrilvallez commented Jan 15, 2026

What does this PR do?

As per the title. Transpose needs to be more general in order to be used in reverse mode in a chain of several operations.
Supersedes #43201 as it needed more work on the Transpose Operation!

Will also help #43227

Comment on lines 266 to 280
def get_target_pattern(
self, input_dict: dict[str, torch.Tensor], source_patterns: list[str], target_patterns: list[str]
) -> str:
if len(input_dict) != 1:
raise ValueError("Undefined Operation encountered!")
# Here it's the first operation of a chain, so return the source
if len(target_patterns) > 1:
# Here it's the first operation of a chain, so return the source
if len(source_patterns) == 1:
return source_patterns[0]
else:
raise ValueError("Undefined Operation encountered!")
# Here it's the only operation, or the last operation in a chain, so we return the target
else:
return target_patterns[0]
Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it has to either be the first or the last in a chain of ops ? I can see the dequantization ops breaking this as they extend the chain 🥲 correct me if i'm wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nop, should be alright with quantization! They do not change the targets/sources!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good but concerned about TP + Transpose, we need a more abstract solution like updating the shard dim based on op at init of ops?

Comment on lines +253 to +255
def __init__(self, dim0: int = 0, dim1: int = 1):
self.dim0 = dim0
self.dim1 = dim1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we pass the full converter to change the shard dim at init?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but aren't you assuming that we need to change the shard dim based on transpose? At least the transpose that do use it make it so that they are aligned with other models, i.e. they will need to be sharded the same way as intended.

There might be a few models that shard differently tho, I would consider this not part of the conversion op tho - otherwise we will mix in our assumptions 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to ops order - sharding happens before Transpose, so then if you Transpose the dim that was sharded on, you've got an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I messed up the order, I thought it was transpose then shard

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Life would be too nice... 😆

@Cyrilvallez
Copy link
Member Author

Actually I checked and the params on which Transpose are used are not part of the TP plan anyway. So i'd rather merge this PR now as it fixes the model and unblocks other PRs, then we can see in the future about the TP shard dim if the issue comes up

@vasqu
Copy link
Contributor

vasqu commented Jan 15, 2026

Can we add a comment at least, TODO/FIXME or similar at least? This is quite important (although not used with TP atm)

@Cyrilvallez
Copy link
Member Author

Done. I think it's easier this way as it's not as straightforward as I thought it would be to always switch the shard dim if required

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed internally, we will need to double-check many current ops that we already have

@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43307&sha=9a937a

@Cyrilvallez Cyrilvallez merged commit dfdede2 into main Jan 16, 2026
24 of 26 checks passed
@Cyrilvallez Cyrilvallez deleted the fix-transpose branch January 16, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transformer version: 5.0.0.dev0 breaks Qwen3VL Moe models loading

6 participants