tests: Fix flaky test for NLLB-MoE #22880

connor-henderson · 2023-04-20T03:22:50Z

What does this PR do?

Fixes #22464 (and added some light docs edits I happened to notice)

From my comment in the issue:
Looked into this and I think the flakiness is caused by the natural variability in the sparse MoE layers. Specifically that when they calculate which experts to use in the gating logic, they’re computing probabilities imperfectly for two different sets of inputs: one with prior inputs concatenated with the past key values and one with just the past key values.

The test usually passes cause magnitude of the difference is usually likely to be small. Notably, when the vocab size is increased this pass rate goes up (and vice versa) since the increased representational capacity can help the model make more accurate decisions about which experts to use for each input. For example, increasing the vocab size in the config from its current 99 to 999 increases the pass rate from ~80% to ~95%.

I think this flakiness is inherent in the sparse layers, but if I understand right the point of the test is to check the decoder uses the past properly, so I edited the test to use dense layers and moved to the rtol down to 1e-3 to be in line with the other models’ version of this check. Wrote a loop to test this on a 1000 passes and they all passed.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker, @amyeroberts

HuggingFaceDocBuilderDev · 2023-04-20T03:39:02Z

The documentation is not available anymore as the PR was closed or merged.

amyeroberts · 2023-04-21T13:52:39Z

cc @ydshieh

amyeroberts

Thanks for fixing this, for doing such a deep dive into the code, and for taking the time to write such detailed explanations here and on the issue 🙏 ❤️

amyeroberts · 2023-04-21T13:49:17Z

docs/source/en/model_doc/nllb-moe.mdx

-The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that blah blah blah blah. 
-In SwitchTransformers, once the masks are computed for each experts, we just index the current hidden_states with the routing mask, and feed the 
-correct tokens to the expert. However here, the implementation varies a lot as the fairseq repository used a different approach.
+The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that for each input, only the top two experts are selected based on the 


Thanks a lot for this 😅

ArthurZucker

Thanks for tackling this 💪🏻

ArthurZucker · 2023-04-21T13:59:46Z

docs/source/en/model_doc/nllb-moe.mdx

-The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that blah blah blah blah. 
-In SwitchTransformers, once the masks are computed for each experts, we just index the current hidden_states with the routing mask, and feed the 
-correct tokens to the expert. However here, the implementation varies a lot as the fairseq repository used a different approach.
+The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that for each input, only the top two experts are selected based on the 


Thanks a lot for this 😅

ArthurZucker · 2023-04-21T14:04:25Z

docs/source/en/model_doc/nllb-moe.mdx

+highest predicted probabilities from the gating network, and the remaining experts are ignored. In SwitchTransformers, once the masks are computed for each expert, 
+we just index the current hidden_states with the routing mask, and feed the correct tokens to the expert. However here, the implementation varies a lot as the fairseq 
+repository used a different approach.


Suggested change

highest predicted probabilities from the gating network, and the remaining experts are ignored. In SwitchTransformers, once the masks are computed for each expert,

we just index the current hidden_states with the routing mask, and feed the correct tokens to the expert. However here, the implementation varies a lot as the fairseq

repository used a different approach.

highest predicted probabilities from the gating network, and the remaining experts are ignored. In `SwitchTransformers`, only the top-1 probabilities are computes, which means that tokens have less probability of being forwarded. Moreover, if a token is not routed to any expert, `SwitchTransformers` still adds its unmodified hidden states (kind of like a residual connexion) while they are masked in `NLLB`'s top-2 routing mecanism.

connor-henderson · 2023-04-21T14:54:22Z

Happy to!

* add test update and docs edits * docs edit suggestion

connor-henderson marked this pull request as ready for review April 20, 2023 04:28

amyeroberts approved these changes Apr 21, 2023

View reviewed changes

ArthurZucker approved these changes Apr 21, 2023

View reviewed changes

Connor Henderson added 2 commits April 21, 2023 11:20

add test update and docs edits

b264023

docs edit suggestion

0834d2a

connor-henderson force-pushed the fix-flaky-nllb_moe-test branch from 5e25bd3 to 0834d2a Compare April 21, 2023 15:26

amyeroberts merged commit b950c38 into huggingface:main Apr 21, 2023

connor-henderson deleted the fix-flaky-nllb_moe-test branch April 28, 2023 19:58

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

tests: Fix flaky test for NLLB-MoE (huggingface#22880)

b9ab4a8

* add test update and docs edits * docs edit suggestion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: Fix flaky test for NLLB-MoE #22880

tests: Fix flaky test for NLLB-MoE #22880

connor-henderson commented Apr 20, 2023

HuggingFaceDocBuilderDev commented Apr 20, 2023 •

edited

Loading

amyeroberts commented Apr 21, 2023

amyeroberts left a comment

amyeroberts Apr 21, 2023

ArthurZucker Apr 21, 2023

ArthurZucker left a comment •

edited

Loading

ArthurZucker Apr 21, 2023

ArthurZucker Apr 21, 2023

connor-henderson commented Apr 21, 2023

tests: Fix flaky test for NLLB-MoE #22880

tests: Fix flaky test for NLLB-MoE #22880

Conversation

connor-henderson commented Apr 20, 2023

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 20, 2023 • edited Loading

amyeroberts commented Apr 21, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Apr 21, 2023

Choose a reason for hiding this comment

ArthurZucker Apr 21, 2023

Choose a reason for hiding this comment

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

ArthurZucker Apr 21, 2023

Choose a reason for hiding this comment

ArthurZucker Apr 21, 2023

Choose a reason for hiding this comment

connor-henderson commented Apr 21, 2023

HuggingFaceDocBuilderDev commented Apr 20, 2023 •

edited

Loading

ArthurZucker left a comment •

edited

Loading