Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Fix flaky test for NLLB-MoE #22880

Merged

Conversation

connor-henderson
Copy link
Contributor

What does this PR do?

Fixes #22464 (and added some light docs edits I happened to notice)

From my comment in the issue:
Looked into this and I think the flakiness is caused by the natural variability in the sparse MoE layers. Specifically that when they calculate which experts to use in the gating logic, they’re computing probabilities imperfectly for two different sets of inputs: one with prior inputs concatenated with the past key values and one with just the past key values.

The test usually passes cause magnitude of the difference is usually likely to be small. Notably, when the vocab size is increased this pass rate goes up (and vice versa) since the increased representational capacity can help the model make more accurate decisions about which experts to use for each input. For example, increasing the vocab size in the config from its current 99 to 999 increases the pass rate from ~80% to ~95%.

I think this flakiness is inherent in the sparse layers, but if I understand right the point of the test is to check the decoder uses the past properly, so I edited the test to use dense layers and moved to the rtol down to 1e-3 to be in line with the other models’ version of this check. Wrote a loop to test this on a 1000 passes and they all passed.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker, @amyeroberts

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 20, 2023

The documentation is not available anymore as the PR was closed or merged.

@connor-henderson connor-henderson marked this pull request as ready for review April 20, 2023 04:28
@amyeroberts
Copy link
Collaborator

cc @ydshieh

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, for doing such a deep dive into the code, and for taking the time to write such detailed explanations here and on the issue 🙏 ❤️

The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that blah blah blah blah.
In SwitchTransformers, once the masks are computed for each experts, we just index the current hidden_states with the routing mask, and feed the
correct tokens to the expert. However here, the implementation varies a lot as the fairseq repository used a different approach.
The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that for each input, only the top two experts are selected based on the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this 😅

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this 💪🏻

The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that blah blah blah blah.
In SwitchTransformers, once the masks are computed for each experts, we just index the current hidden_states with the routing mask, and feed the
correct tokens to the expert. However here, the implementation varies a lot as the fairseq repository used a different approach.
The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that for each input, only the top two experts are selected based on the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this 😅

Comment on lines 47 to 49
highest predicted probabilities from the gating network, and the remaining experts are ignored. In SwitchTransformers, once the masks are computed for each expert,
we just index the current hidden_states with the routing mask, and feed the correct tokens to the expert. However here, the implementation varies a lot as the fairseq
repository used a different approach.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
highest predicted probabilities from the gating network, and the remaining experts are ignored. In SwitchTransformers, once the masks are computed for each expert,
we just index the current hidden_states with the routing mask, and feed the correct tokens to the expert. However here, the implementation varies a lot as the fairseq
repository used a different approach.
highest predicted probabilities from the gating network, and the remaining experts are ignored. In `SwitchTransformers`, only the top-1 probabilities are computes, which means that tokens have less probability of being forwarded. Moreover, if a token is not routed to any expert, `SwitchTransformers` still adds its unmodified hidden states (kind of like a residual connexion) while they are masked in `NLLB`'s top-2 routing mecanism.

@connor-henderson
Copy link
Contributor Author

Happy to!

@amyeroberts amyeroberts merged commit b950c38 into huggingface:main Apr 21, 2023
@connor-henderson connor-henderson deleted the fix-flaky-nllb_moe-test branch April 28, 2023 19:58
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
* add test update and docs edits

* docs edit suggestion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test for NLLB-MoE Model
4 participants