-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: Fix flaky test for NLLB-MoE #22880
tests: Fix flaky test for NLLB-MoE #22880
Conversation
The documentation is not available anymore as the PR was closed or merged. |
cc @ydshieh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this, for doing such a deep dive into the code, and for taking the time to write such detailed explanations here and on the issue 🙏 ❤️
The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that blah blah blah blah. | ||
In SwitchTransformers, once the masks are computed for each experts, we just index the current hidden_states with the routing mask, and feed the | ||
correct tokens to the expert. However here, the implementation varies a lot as the fairseq repository used a different approach. | ||
The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that for each input, only the top two experts are selected based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this 💪🏻
The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that blah blah blah blah. | ||
In SwitchTransformers, once the masks are computed for each experts, we just index the current hidden_states with the routing mask, and feed the | ||
correct tokens to the expert. However here, the implementation varies a lot as the fairseq repository used a different approach. | ||
The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that for each input, only the top two experts are selected based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this 😅
highest predicted probabilities from the gating network, and the remaining experts are ignored. In SwitchTransformers, once the masks are computed for each expert, | ||
we just index the current hidden_states with the routing mask, and feed the correct tokens to the expert. However here, the implementation varies a lot as the fairseq | ||
repository used a different approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
highest predicted probabilities from the gating network, and the remaining experts are ignored. In SwitchTransformers, once the masks are computed for each expert, | |
we just index the current hidden_states with the routing mask, and feed the correct tokens to the expert. However here, the implementation varies a lot as the fairseq | |
repository used a different approach. | |
highest predicted probabilities from the gating network, and the remaining experts are ignored. In `SwitchTransformers`, only the top-1 probabilities are computes, which means that tokens have less probability of being forwarded. Moreover, if a token is not routed to any expert, `SwitchTransformers` still adds its unmodified hidden states (kind of like a residual connexion) while they are masked in `NLLB`'s top-2 routing mecanism. |
Happy to! |
5e25bd3
to
0834d2a
Compare
* add test update and docs edits * docs edit suggestion
What does this PR do?
Fixes #22464 (and added some light docs edits I happened to notice)
From my comment in the issue:
Looked into this and I think the flakiness is caused by the natural variability in the sparse MoE layers. Specifically that when they calculate which experts to use in the gating logic, they’re computing probabilities imperfectly for two different sets of inputs: one with prior inputs concatenated with the past key values and one with just the past key values.
The test usually passes cause magnitude of the difference is usually likely to be small. Notably, when the vocab size is increased this pass rate goes up (and vice versa) since the increased representational capacity can help the model make more accurate decisions about which experts to use for each input. For example, increasing the vocab size in the config from its current 99 to 999 increases the pass rate from ~80% to ~95%.
I think this flakiness is inherent in the sparse layers, but if I understand right the point of the test is to check the decoder uses the past properly, so I edited the test to use dense layers and moved to the rtol down to 1e-3 to be in line with the other models’ version of this check. Wrote a loop to test this on a 1000 passes and they all passed.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker, @amyeroberts