fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast#44915
fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast#44915maxsloef-goodfire wants to merge 6 commits intohuggingface:mainfrom
clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast#44915Conversation
clean_up_tokenization applies BERT-era string replacements (` .` → `.`, ` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces are encoded as part of tokens. This adds a guard that skips the cleanup when the backend model is BPE and emits a warning_once suggesting the user set clean_up_tokenization_spaces=False. Fixes huggingface#35175
Verifies that BPE tokenizers preserve spaces before punctuation even when clean_up_tokenization_spaces=True.
clean_up_tokenization is always skipped for BPE tokenizers, even when explicitly requested, because the cleanup is fundamentally wrong for BPE (it strips legitimate spaces that are part of the token encoding). Users who need those string replacements can call clean_up_tokenization() directly. Updated test_tokenization_utils.py to expect preserved spacing for GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the guard works with an explicit True parameter.
Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name). Update test_clean_up_tokenization_spaces to use normal text without artificial WordPiece artifacts — BPE roundtrip preserves originals.
ByteLevel BPE tokenizers prepend a space during encoding. Use " Hello world." so the roundtrip matches exactly.
ArthurZucker
left a comment
There was a problem hiding this comment.
Hey! Ty for the PR!
This was actually quite a rollercoaster. (#42898)
We decided to deprecate the flag in #31938. Then we had to introduce it back in #43426.
At this point, and again just as we could not break the uploaded models on the hub we have 2 choices.
- 🔴 this PR as a breaking change to enforce decoding is 1-1. This would literally affect ALL BPE models in the current state and does not allow to opt-out. I think this would break
gpt2which is also a BPE and had this since a long long time ago. - We just document this better, pin this issue idk but I don't think there's much to do here.
This has been around for a while, the main issue for me is that for Llama the original repo indeed does not clean it up.
I don't mind trying to fix for llama3 specifically! but in this state its absolutely breaking
| if type(self.backend_tokenizer.model).__name__ == "BPE": | ||
| logger.warning_once( | ||
| "Ignoring clean_up_tokenization_spaces=True for BPE tokenizer" | ||
| f" {self.__class__.__name__}. The clean_up_tokenization post-processing" | ||
| " step is designed for WordPiece tokenizers and is destructive for BPE" | ||
| " (it strips spaces before punctuation). Set" | ||
| " clean_up_tokenization_spaces=False to suppress this warning." | ||
| ) |
There was a problem hiding this comment.
I don't think this is something we can do... its too breaking for anyone that relies on this behavior.
Its a very big breaking change.
Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag so users who rely on the old behavior can opt back in. The warning message now mentions this flag. Added test for the override path.
d11ef36 to
ddca57e
Compare
|
Hey @ArthurZucker, thanks for the review and the context on the history here. Two arguments for why this should go in: Correctness: cleanup is definitionally wrong for BPE BPE tokenizers encode whitespace as part of their tokens. That's the whole point — the Compare with BERT, where WordPiece splits This is also why a Llama-3-specific fix isn't quite right — the problem isn't that Llama 3 has a bad config, it's that cleanup is fundamentally incompatible with BPE as a tokenizer class. A model-specific fix would need to be repeated for every new BPE model that ships with Impact: the asymmetry is stark Llama 3 is one of the most popular model families on the Hub, and every Llama 3.x model — plus the massive ecosystem of fine-tunes and derivatives — ships with On the other side, the concern is breaking someone who feeds pre-tokenized text with artificial spacing through a BPE encode→decode→cleanup pipeline as a convenience post-processor. That's a rare and unusual pattern — using a feature designed for WordPiece artifacts as a general-purpose space collapser on a tokenizer that doesn't produce those artifacts. Escape hatch added We've added |
What does this PR do?
clean_up_tokenizationapplies English-specific string replacements (.→.,?→?,,→,, etc.) to decoded text. This was designed for BERT-era WordPiece tokenizers where decoding produced artifacts like"Hello , world .".For BPE tokenizers (Llama 3, GPT-2, etc.), spaces are encoded as part of tokens and decoding does not produce these artifacts. The cleanup is actively destructive — it strips legitimate spaces from correctly decoded text. For example,
"x != y"becomes"x!= y".This PR adds a guard in
PreTrainedTokenizerFast._decode()that unconditionally skips the cleanup when the backend model is BPE, and emits alogger.warning_oncewhenclean_up_tokenization_spaces=Trueis set in the tokenizer config. Users who need the string replacements for other purposes can calltokenizer.clean_up_tokenization()directly.Why this matters
All 24 Llama 3.x models on the Hub have
clean_up_tokenization_spaces=truebaked into theirtokenizer_config.json(inherited from a library default when Llama 3 switched tokenizer classes — see #35175, #31187, #32575). Fixing the config on every model repo (and every downstream fine-tune) is a game of whack-a-mole. This library-level guard ensures the cleanup is never applied to tokenizers where it's incorrect, even if the config says otherwise.Minimal reproduction (before fix)
Changes
src/transformers/tokenization_utils_tokenizers.py— inPreTrainedTokenizerFast._decode(), checktype(self.backend_tokenizer.model).__name__and skipclean_up_tokenization()for BPE models, emitting a warning vialogger.warning_once.tests/tokenization/test_tokenization_fast.py— addedtest_bpe_tokenizer_skips_clean_up_tokenization_spacesverifying BPE roundtrip preserves text.tests/utils/test_tokenization_utils.py— updatedtest_clean_up_tokenization_spacesto use clean roundtrip text (GPT-2 is BPE, so cleanup is now correctly skipped).Fixes #35175
Fixes #31187
Before submitting
transformersversion #31187, Llama3 Tokenizer Decode Removing Space Character #32575Who can review?
@ArthurZucker @itazap — tokenizer maintainers. Arthur previously acknowledged this should be
Falsein #35175.