-
Notifications
You must be signed in to change notification settings - Fork 31.8k
[loading] Fix Transpose Operation, and qwen3_vl_moe mapping #43307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| def get_target_pattern( | ||
| self, input_dict: dict[str, torch.Tensor], source_patterns: list[str], target_patterns: list[str] | ||
| ) -> str: | ||
| if len(input_dict) != 1: | ||
| raise ValueError("Undefined Operation encountered!") | ||
| # Here it's the first operation of a chain, so return the source | ||
| if len(target_patterns) > 1: | ||
| # Here it's the first operation of a chain, so return the source | ||
| if len(source_patterns) == 1: | ||
| return source_patterns[0] | ||
| else: | ||
| raise ValueError("Undefined Operation encountered!") | ||
| # Here it's the only operation, or the last operation in a chain, so we return the target | ||
| else: | ||
| return target_patterns[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it has to either be the first or the last in a chain of ops ? I can see the dequantization ops breaking this as they extend the chain 🥲 correct me if i'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nop, should be alright with quantization! They do not change the targets/sources!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good but concerned about TP + Transpose, we need a more abstract solution like updating the shard dim based on op at init of ops?
| def __init__(self, dim0: int = 0, dim1: int = 1): | ||
| self.dim0 = dim0 | ||
| self.dim1 = dim1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we pass the full converter to change the shard dim at init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but aren't you assuming that we need to change the shard dim based on transpose? At least the transpose that do use it make it so that they are aligned with other models, i.e. they will need to be sharded the same way as intended.
There might be a few models that shard differently tho, I would consider this not part of the conversion op tho - otherwise we will mix in our assumptions 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's due to ops order - sharding happens before Transpose, so then if you Transpose the dim that was sharded on, you've got an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I messed up the order, I thought it was transpose then shard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Life would be too nice... 😆
|
Actually I checked and the params on which |
|
Can we add a comment at least, TODO/FIXME or similar at least? This is quite important (although not used with TP atm) |
|
Done. I think it's easier this way as it's not as straightforward as I thought it would be to always switch the shard dim if required |
…lly - tp fully broken rn
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed internally, we will need to double-check many current ops that we already have
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43307&sha=9a937a |
What does this PR do?
As per the title. Transpose needs to be more general in order to be used in reverse mode in a chain of several operations.
Supersedes #43201 as it needed more work on the Transpose Operation!
Will also help #43227