Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| ) | ||
| return selected_experts, routing_weights.to(hidden_states.dtype) | ||
|
|
||
| return selected_experts, routing_weights.to(hidden_states.dtype) |
There was a problem hiding this comment.
it's interesting that this dead code is not caught by styling
| router_logits, | ||
| jitter_eps=self.router_jitter_noise, | ||
| training=self.training, | ||
| router_logits, jitter_eps=self.router_jitter_noise, training=self.training, top_k=self.top_k |
There was a problem hiding this comment.
top_k was not passed and defaulted to 2
|
run-slow: hunyuan_v1_moe, phimoe |
|
This comment contains models: ["models/hunyuan_v1_moe", "models/phimoe"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
both models integration tests pass locally now |
| # Phimoe uses nn.LayerNorm | ||
| self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps, elementwise_affine=True) | ||
| self.post_attention_layernorm = nn.LayerNorm( | ||
| config.hidden_size, eps=config.rms_norm_eps, elementwise_affine=True | ||
| ) |
There was a problem hiding this comment.
Phimoe uses nn.LayerNorm with bias
vasqu
left a comment
There was a problem hiding this comment.
Thanks for checking another set of models, I guess we have them all fixed with this?
The phimoe one is an interesting case 😅
| expert_ids = top_k_index.reshape(-1) | ||
| token_idx = torch.arange(num_tokens, device=device).unsqueeze(1).expand(-1, num_top_k).reshape(-1) | ||
|
|
||
| # Resolve routing weights per selected sample, allowing top_k_weights to be either: |
There was a problem hiding this comment.
yeah much cleaner !
| # Phimoe uses nn.LayerNorm | ||
| self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps, elementwise_affine=True) | ||
| self.post_attention_layernorm = nn.LayerNorm( | ||
| config.hidden_size, eps=config.rms_norm_eps, elementwise_affine=True | ||
| ) |
There was a problem hiding this comment.
Oh wow, that's an insane find - this model must have been broken for a long time
There was a problem hiding this comment.
yeh ppl must've continued to use the remote code one
| def test_model_generation(self): | ||
| # we will compele this when model file change over | ||
| # pass |
There was a problem hiding this comment.
I have a PR here #43411 which fixes some wrong RoPE init, will this still work with that fix?
There was a problem hiding this comment.
i guess, here I only removed a comment, I might have misunderstood but the initialization added in the PR is already in the init of the class, why is it necessary in init_weights as well ?
There was a problem hiding this comment.
Apparently, it doesn't matter anymore what you init in __init__ and _init_weights will overwrite in any case - meaning that if there is custom logic in init, it will not be applied. I want to refactor this so that it is no longer the case or rather that we depend on another init function for rope that allows users to do whatever they want to as init
| [-3.4844, -2.4688, -1.1719, 0.5703, -0.4902, -0.0942, 0.7773, -0.2539, 0.3223, -1.0234], | ||
| [-0.9805, 0.0811, -0.5273, 2.3438, 0.6914, 3.0781, 0.3164, 0.2197, 0.5312, -2.1094], |
There was a problem hiding this comment.
I guess that's the gpu diff between t4 and a10
There was a problem hiding this comment.
ah yes, rerunning with eager experts impl to see if grouped is also contributing to the diff.
There was a problem hiding this comment.
hmm, grouped and eager are equivalent in terms of logits here (on A100). should I revert this change ? or maybe use the expectation class.
There was a problem hiding this comment.
We should check against A10 runners - let's revert for now and check with run-slow first. If it needs a change, I can update it
|
run-slow: hunyuan_v1_moe, phimoe |
|
This comment contains models: ["models/hunyuan_v1_moe", "models/phimoe"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
the ci is struggling with model loading again 😭, even though |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: hunyuan_v1_moe, phimoe |
| sample_weights = sample_weights.reshape(-1, 1) # (S, 1) | ||
|
|
||
| # Reshape for easier indexing | ||
| # S is the number of selected tokens-experts pairs (S = num_tokens * num_top_k) |
What does this PR do?
Same as #43288
Also fixes phimoe and its integration tests
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.