Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨🚨 [NLLB Tokenizer] Fix the prefix tokens 🚨🚨🚨 #22313

Merged
merged 10 commits into from Apr 4, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Mar 22, 2023

What does this PR do?

The NLLB tokenizer's suffix and prefix token were wrong w.r.t to the paper.
This breaking change fixes the tokenizer.
Could be none breaking if we add these to the configuration file maybe? But it is a required change
Have to update the tests but should be good.

The big problem was the prefix and suffix tokens.
The previous version adds [self.eos_token_id, self.cur_lang_code] at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) :

Note that we prefix the source sequence with the source language, as opposed to the target
language as previously done in several works (Arivazhagan et al., 2019; Johnson et al.,
2017). This is primarily because we prioritize optimizing zero-shot performance of our
model on any pair of 200 languages at a minor cost to supervised performance.

Previous behaviour:

>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer("How was your day?").input_ids
[13374, 1398, 4260, 4039, 248130, 2, 256047]
>>> # 2: '</s>'
>>> # 256047 : 'eng_Latn'

New behaviour

>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer("How was your day?").input_ids
[256047, 13374, 1398, 4260, 4039, 248130, 2]

Enabling the old behaviour:

>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour = True)

This parameter should be part of the tokenizer_config.json.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 22, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker marked this pull request as ready for review March 31, 2023 13:54
@ArthurZucker ArthurZucker changed the title [NLLB Tokenizer] Fix the prefix tokens 🚨🚨🚨 [NLLB Tokenizer] Fix the prefix tokens 🚨🚨🚨 Mar 31, 2023
@LysandreJik
Copy link
Member

Thank you for the implementation, @ArthurZucker!

Could you please put in the PR description the gist of the breaking change with a code sample, and how to revert to the previous behavior if users would like that?

Thank you

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @LysandreJik said, we would need a way to enable the old behavior for users who rely on it.

@ArthurZucker
Copy link
Collaborator Author

Indeed thanks for the tip on how to enable that swiftly!

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks!

@ArthurZucker
Copy link
Collaborator Author

ArthurZucker commented Apr 3, 2023

One test is failing with NLLB (running slow ones locally, test_encode_decode_with_spaces), fixing this before merging.
Edit: Fast and slow have a different behaviour! space_between_special_tokens does not exist in rust (yet, PR coming soon)

@LysandreJik
Copy link
Member

Cool, I like the flag :)

Can the doc be shown more prominently? Maybe to replace the disclaimer mentioning to tag me? A disclaimer mentioning that we changed it to what it is now, with the code snippet?

image

Comment on lines 15 to 24
**DISCLAIMER:** The default behaviour for the tokenizer has recently been fixed (and thus changed!

The previous version adds [self.eos_token_id, self.cur_lang_code] at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) :

Note that we prefix the source sequence with the source language, as opposed to the target
language as previously done in several works (Arivazhagan et al., 2019; Johnson et al.,
2017). This is primarily because we prioritize optimizing zero-shot performance of our
model on any pair of 200 languages at a minor cost to supervised performance.

Previous behaviour:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs proofing :-)

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ArthurZucker!

docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved
docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved
docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved
docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved
ArthurZucker and others added 2 commits April 4, 2023 11:11
Co-authored-by: Lysandre Debut <hi@lysand.re>
@ArthurZucker
Copy link
Collaborator Author

Thanks both for proof reading! 👍🏻

@ArthurZucker ArthurZucker merged commit 00b5887 into huggingface:main Apr 4, 2023
22 checks passed
raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023
* fix the prefix tokens

* update fast and test values

* add legacy behaviour

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

* update disclaimer, linkissue PR and behaviral changes

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <hi@lysand.re>

* styling

* make a quote

* quote this time

---------

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Lysandre Debut <hi@lysand.re>
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
* fix the prefix tokens

* update fast and test values

* add legacy behaviour

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

* update disclaimer, linkissue PR and behaviral changes

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <hi@lysand.re>

* styling

* make a quote

* quote this time

---------

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Lysandre Debut <hi@lysand.re>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NllbTokenizer/NllbTokenizerFast inserts language code incorrectly when tokenizing target text
4 participants