New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨🚨🚨 [NLLB Tokenizer]
Fix the prefix tokens 🚨🚨🚨
#22313
🚨🚨🚨 [NLLB Tokenizer]
Fix the prefix tokens 🚨🚨🚨
#22313
Conversation
The documentation is not available anymore as the PR was closed or merged. |
…to fix-nllb-tokenizer
[NLLB Tokenizer]
Fix the prefix tokens [NLLB Tokenizer]
Fix the prefix tokens 🚨🚨🚨
Thank you for the implementation, @ArthurZucker! Could you please put in the PR description the gist of the breaking change with a code sample, and how to revert to the previous behavior if users would like that? Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @LysandreJik said, we would need a way to enable the old behavior for users who rely on it.
Indeed thanks for the tip on how to enable that swiftly! |
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks!
…to fix-nllb-tokenizer
One test is failing with NLLB (running slow ones locally, |
docs/source/en/model_doc/nllb.mdx
Outdated
**DISCLAIMER:** The default behaviour for the tokenizer has recently been fixed (and thus changed! | ||
|
||
The previous version adds [self.eos_token_id, self.cur_lang_code] at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) : | ||
|
||
Note that we prefix the source sequence with the source language, as opposed to the target | ||
language as previously done in several works (Arivazhagan et al., 2019; Johnson et al., | ||
2017). This is primarily because we prioritize optimizing zero-shot performance of our | ||
model on any pair of 200 languages at a minor cost to supervised performance. | ||
|
||
Previous behaviour: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs proofing :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ArthurZucker!
Co-authored-by: Lysandre Debut <hi@lysand.re>
Thanks both for proof reading! 👍🏻 |
* fix the prefix tokens * update fast and test values * add legacy behaviour Co-authored-by: sgugger <sylvain.gugger@gmail.com> * update disclaimer, linkissue PR and behaviral changes * Apply suggestions from code review Co-authored-by: Lysandre Debut <hi@lysand.re> * styling * make a quote * quote this time --------- Co-authored-by: sgugger <sylvain.gugger@gmail.com> Co-authored-by: Lysandre Debut <hi@lysand.re>
* fix the prefix tokens * update fast and test values * add legacy behaviour Co-authored-by: sgugger <sylvain.gugger@gmail.com> * update disclaimer, linkissue PR and behaviral changes * Apply suggestions from code review Co-authored-by: Lysandre Debut <hi@lysand.re> * styling * make a quote * quote this time --------- Co-authored-by: sgugger <sylvain.gugger@gmail.com> Co-authored-by: Lysandre Debut <hi@lysand.re>
What does this PR do?
The NLLB tokenizer's suffix and prefix token were wrong w.r.t to the paper.
This breaking change fixes the tokenizer.
Could be none breaking if we add these to the configuration file maybe? But it is a required change
Have to update the tests but should be good.
The big problem was the
prefix
andsuffix
tokens.The previous version adds
[self.eos_token_id, self.cur_lang_code]
at the end of the token sequence for both target and source tokenization. This is wrong as theNLLB
paper mentions (page 48, 6.1.1. Model Architecture) :Previous behaviour:
New behaviour
Enabling the old behaviour:
This parameter should be part of the
tokenizer_config.json
.