🚨🚨🚨 `[NLLB Tokenizer]` Fix the prefix tokens 🚨🚨🚨 #22313

ArthurZucker · 2023-03-22T14:43:28Z

What does this PR do?

The NLLB tokenizer's suffix and prefix token were wrong w.r.t to the paper.
This breaking change fixes the tokenizer.
Could be none breaking if we add these to the configuration file maybe? But it is a required change
Have to update the tests but should be good.

The big problem was the prefix and suffix tokens.
The previous version adds [self.eos_token_id, self.cur_lang_code] at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) :

Note that we prefix the source sequence with the source language, as opposed to the target
language as previously done in several works (Arivazhagan et al., 2019; Johnson et al.,
2017). This is primarily because we prioritize optimizing zero-shot performance of our
model on any pair of 200 languages at a minor cost to supervised performance.

Previous behaviour:

>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer("How was your day?").input_ids
[13374, 1398, 4260, 4039, 248130, 2, 256047]
>>> # 2: '</s>'
>>> # 256047 : 'eng_Latn'

New behaviour

>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer("How was your day?").input_ids
[256047, 13374, 1398, 4260, 4039, 248130, 2]

Enabling the old behaviour:

>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour = True)

This parameter should be part of the tokenizer_config.json.

HuggingFaceDocBuilderDev · 2023-03-22T15:00:53Z

The documentation is not available anymore as the PR was closed or merged.

…to fix-nllb-tokenizer

LysandreJik · 2023-03-31T18:56:37Z

Thank you for the implementation, @ArthurZucker!

Could you please put in the PR description the gist of the breaking change with a code sample, and how to revert to the previous behavior if users would like that?

Thank you

sgugger

As @LysandreJik said, we would need a way to enable the old behavior for users who rely on it.

ArthurZucker · 2023-04-03T14:58:51Z

Indeed thanks for the tip on how to enable that swiftly!

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

sgugger

Perfect, thanks!

…to fix-nllb-tokenizer

ArthurZucker · 2023-04-03T16:17:35Z

One test is failing with NLLB (running slow ones locally, test_encode_decode_with_spaces), fixing this before merging.
Edit: Fast and slow have a different behaviour! space_between_special_tokens does not exist in rust (yet, PR coming soon)

LysandreJik · 2023-04-03T16:23:12Z

Cool, I like the flag :)

Can the doc be shown more prominently? Maybe to replace the disclaimer mentioning to tag me? A disclaimer mentioning that we changed it to what it is now, with the code snippet?

sgugger · 2023-04-03T19:28:19Z

docs/source/en/model_doc/nllb.mdx

+**DISCLAIMER:** The default behaviour for the tokenizer has recently been fixed (and thus changed! 
+
+The previous version adds [self.eos_token_id, self.cur_lang_code] at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) :
+
+Note that we prefix the source sequence with the source language, as opposed to the target
+language as previously done in several works (Arivazhagan et al., 2019; Johnson et al.,
+2017). This is primarily because we prioritize optimizing zero-shot performance of our
+model on any pair of 200 languages at a minor cost to supervised performance.
+
+Previous behaviour:


Needs proofing :-)

LysandreJik

Thank you @ArthurZucker!

docs/source/en/model_doc/nllb.mdx

Co-authored-by: Lysandre Debut <hi@lysand.re>

ArthurZucker · 2023-04-04T09:40:58Z

Thanks both for proof reading! 👍🏻

* fix the prefix tokens * update fast and test values * add legacy behaviour Co-authored-by: sgugger <sylvain.gugger@gmail.com> * update disclaimer, linkissue PR and behaviral changes * Apply suggestions from code review Co-authored-by: Lysandre Debut <hi@lysand.re> * styling * make a quote * quote this time --------- Co-authored-by: sgugger <sylvain.gugger@gmail.com> Co-authored-by: Lysandre Debut <hi@lysand.re>

fix the prefix tokens

79e6c83

ArthurZucker linked an issue Mar 22, 2023 that may be closed by this pull request

NllbTokenizer/NllbTokenizerFast inserts language code incorrectly when tokenizing target text #19943

Closed

4 tasks

ArthurZucker added 2 commits March 31, 2023 12:47

update fast and test values

f4e0eff

Merge branch 'main' of https://github.com/huggingface/transformers in…

5b3d111

…to fix-nllb-tokenizer

ArthurZucker marked this pull request as ready for review March 31, 2023 13:54

ArthurZucker requested a review from LysandreJik March 31, 2023 14:30

ArthurZucker changed the title ~~[NLLB Tokenizer] Fix the prefix tokens~~ 🚨🚨🚨 [NLLB Tokenizer] Fix the prefix tokens 🚨🚨🚨 Mar 31, 2023

avidale mentioned this pull request Mar 31, 2023

NllbTokenizer/NllbTokenizerFast inserts language code incorrectly when tokenizing target text #19943

Closed

4 tasks

ArthurZucker requested a review from sgugger April 3, 2023 13:50

sgugger reviewed Apr 3, 2023

View reviewed changes

add legacy behaviour

152fcdf

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

sgugger approved these changes Apr 3, 2023

View reviewed changes

Merge branch 'main' of https://github.com/huggingface/transformers in…

b0e9c03

…to fix-nllb-tokenizer

update disclaimer, linkissue PR and behaviral changes

f4c4c77

sgugger reviewed Apr 3, 2023

View reviewed changes

LysandreJik approved these changes Apr 3, 2023

View reviewed changes

docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved

docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved

docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved

docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved

ArthurZucker and others added 2 commits April 4, 2023 11:11

Apply suggestions from code review

550503e

Co-authored-by: Lysandre Debut <hi@lysand.re>

styling

20df4e9

ArthurZucker added 2 commits April 4, 2023 09:41

make a quote

aa0fbb9

quote this time

59c4209

ArthurZucker merged commit 00b5887 into huggingface:main Apr 4, 2023
22 checks passed

guillaumekln mentioned this pull request May 10, 2023

How to add context to translation models? OpenNMT/CTranslate2#1213

Closed

xenova mentioned this pull request Aug 19, 2023

Add M2M100TokenizerFast (+ convert_slow_tokenizer implementation) #25478

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨🚨🚨 `[NLLB Tokenizer]` Fix the prefix tokens 🚨🚨🚨 #22313

🚨🚨🚨 `[NLLB Tokenizer]` Fix the prefix tokens 🚨🚨🚨 #22313

ArthurZucker commented Mar 22, 2023 •

edited

HuggingFaceDocBuilderDev commented Mar 22, 2023 •

edited

LysandreJik commented Mar 31, 2023

sgugger left a comment

ArthurZucker commented Apr 3, 2023

sgugger left a comment

ArthurZucker commented Apr 3, 2023 •

edited

LysandreJik commented Apr 3, 2023

sgugger Apr 3, 2023

LysandreJik left a comment

ArthurZucker commented Apr 4, 2023

🚨🚨🚨 [NLLB Tokenizer] Fix the prefix tokens 🚨🚨🚨 #22313

🚨🚨🚨 [NLLB Tokenizer] Fix the prefix tokens 🚨🚨🚨 #22313

Conversation

ArthurZucker commented Mar 22, 2023 • edited

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 22, 2023 • edited

LysandreJik commented Mar 31, 2023

sgugger left a comment

Choose a reason for hiding this comment

ArthurZucker commented Apr 3, 2023

sgugger left a comment

Choose a reason for hiding this comment

ArthurZucker commented Apr 3, 2023 • edited

LysandreJik commented Apr 3, 2023

sgugger Apr 3, 2023

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

ArthurZucker commented Apr 4, 2023

🚨🚨🚨 `[NLLB Tokenizer]` Fix the prefix tokens 🚨🚨🚨 #22313

🚨🚨🚨 `[NLLB Tokenizer]` Fix the prefix tokens 🚨🚨🚨 #22313

ArthurZucker commented Mar 22, 2023 •

edited

HuggingFaceDocBuilderDev commented Mar 22, 2023 •

edited

ArthurZucker commented Apr 3, 2023 •

edited