fix: skip `clean_up_tokenization` for BPE tokenizers in `PreTrainedTokenizerFast` by maxsloef-goodfire · Pull Request #44915 · huggingface/transformers

maxsloef-goodfire · 2026-03-21T20:45:03Z

What does this PR do?

clean_up_tokenization applies English-specific string replacements ( . → ., ? → ?, , → ,, etc.) to decoded text. This was designed for BERT-era WordPiece tokenizers where decoding produced artifacts like "Hello , world .".

For BPE tokenizers (Llama 3, GPT-2, etc.), spaces are encoded as part of tokens and decoding does not produce these artifacts. The cleanup is actively destructive — it strips legitimate spaces from correctly decoded text. For example, "x != y" becomes "x!= y".

This PR adds a guard in PreTrainedTokenizerFast._decode() that unconditionally skips the cleanup when the backend model is BPE, and emits a logger.warning_once when clean_up_tokenization_spaces=True is set in the tokenizer config. Users who need the string replacements for other purposes can call tokenizer.clean_up_tokenization() directly.

Why this matters

All 24 Llama 3.x models on the Hub have clean_up_tokenization_spaces=true baked into their tokenizer_config.json (inherited from a library default when Llama 3 switched tokenizer classes — see #35175, #31187, #32575). Fixing the config on every model repo (and every downstream fine-tune) is a game of whack-a-mole. This library-level guard ensures the cleanup is never applied to tokenizers where it's incorrect, even if the config says otherwise.

Minimal reproduction (before fix)

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)
print(repr(tokenizer.decode(ids)))
# 'x!= y and a.b == c'  ← space before != silently dropped

Changes

src/transformers/tokenization_utils_tokenizers.py — in PreTrainedTokenizerFast._decode(), check type(self.backend_tokenizer.model).__name__ and skip clean_up_tokenization() for BPE models, emitting a warning via logger.warning_once.
tests/tokenization/test_tokenization_fast.py — added test_bpe_tokenizer_skips_clean_up_tokenization_spaces verifying BPE roundtrip preserves text.
tests/utils/test_tokenization_utils.py — updated test_clean_up_tokenization_spaces to use clean roundtrip text (GPT-2 is BPE, so cleanup is now correctly skipped).

Fixes #35175
Fixes #31187

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. — Detokenization discrepancy with Llama3.1 #35175, Original Llama-3 tokenizer behaves differently from transformers version #31187, Llama3 Tokenizer Decode Removing Space Character #32575
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap — tokenizer maintainers. Arthur previously acknowledged this should be False in #35175.

clean_up_tokenization applies BERT-era string replacements (` .` → `.`, ` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces are encoded as part of tokens. This adds a guard that skips the cleanup when the backend model is BPE and emits a warning_once suggesting the user set clean_up_tokenization_spaces=False. Fixes huggingface#35175

Verifies that BPE tokenizers preserve spaces before punctuation even when clean_up_tokenization_spaces=True.

clean_up_tokenization is always skipped for BPE tokenizers, even when explicitly requested, because the cleanup is fundamentally wrong for BPE (it strips legitimate spaces that are part of the token encoding). Users who need those string replacements can call clean_up_tokenization() directly. Updated test_tokenization_utils.py to expect preserved spacing for GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the guard works with an explicit True parameter.

Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name). Update test_clean_up_tokenization_spaces to use normal text without artificial WordPiece artifacts — BPE roundtrip preserves originals.

ByteLevel BPE tokenizers prepend a space during encoding. Use " Hello world." so the roundtrip matches exactly.

ArthurZucker

Hey! Ty for the PR!
This was actually quite a rollercoaster. (#42898)
We decided to deprecate the flag in #31938. Then we had to introduce it back in #43426.

At this point, and again just as we could not break the uploaded models on the hub we have 2 choices.

🔴 this PR as a breaking change to enforce decoding is 1-1. This would literally affect ALL BPE models in the current state and does not allow to opt-out. I think this would break gpt2 which is also a BPE and had this since a long long time ago.
We just document this better, pin this issue idk but I don't think there's much to do here.

This has been around for a while, the main issue for me is that for Llama the original repo indeed does not clean it up.

I don't mind trying to fix for llama3 specifically! but in this state its absolutely breaking

ArthurZucker · 2026-03-23T08:16:36Z

src/transformers/tokenization_utils_tokenizers.py

+            if type(self.backend_tokenizer.model).__name__ == "BPE":
+                logger.warning_once(
+                    "Ignoring clean_up_tokenization_spaces=True for BPE tokenizer"
+                    f" {self.__class__.__name__}. The clean_up_tokenization post-processing"
+                    " step is designed for WordPiece tokenizers and is destructive for BPE"
+                    " (it strips spaces before punctuation). Set"
+                    " clean_up_tokenization_spaces=False to suppress this warning."
+                )


I don't think this is something we can do... its too breaking for anyone that relies on this behavior.
Its a very big breaking change.

Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag so users who rely on the old behavior can opt back in. The warning message now mentions this flag. Added test for the override path.

maxsloef-goodfire · 2026-03-23T18:45:51Z

Hey @ArthurZucker, thanks for the review and the context on the history here. Two arguments for why this should go in:

Correctness: cleanup is definitionally wrong for BPE

BPE tokenizers encode whitespace as part of their tokens. That's the whole point — the Ġ prefix in GPT-2, the byte-level encoding in Llama. Decode produces a perfect roundtrip without any post-processing. There are no WordPiece-style artifacts to clean up, so clean_up_tokenization can only destroy information, never add correctness. This isn't a Llama-specific quirk, it's a property of how BPE works.

Compare with BERT, where WordPiece splits "it's" → ["it", "'", "s"] and decodes to "it ' s" — cleanup is genuinely needed there. BPE never produces those artifacts.

This is also why a Llama-3-specific fix isn't quite right — the problem isn't that Llama 3 has a bad config, it's that cleanup is fundamentally incompatible with BPE as a tokenizer class. A model-specific fix would need to be repeated for every new BPE model that ships with clean_up_tokenization_spaces: true, and there's nothing stopping that from happening again.

Impact: the asymmetry is stark

Llama 3 is one of the most popular model families on the Hub, and every Llama 3.x model — plus the massive ecosystem of fine-tunes and derivatives — ships with clean_up_tokenization_spaces: true in its config. Every user who doesn't know to manually override this gets silently corrupted output: "x != y" → "x!= y", "! ! !" → "!!!".

On the other side, the concern is breaking someone who feeds pre-tokenized text with artificial spacing through a BPE encode→decode→cleanup pipeline as a convenience post-processor. That's a rare and unusual pattern — using a feature designed for WordPiece artifacts as a general-purpose space collapser on a tokenizer that doesn't produce those artifacts.

Escape hatch added

We've added clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output as an override flag for anyone who truly needs the old behavior.

maxsloef-goodfire added 5 commits March 21, 2026 20:42

test: add test for BPE tokenizer skipping clean_up_tokenization

2993578

Verifies that BPE tokenizers preserve spaces before punctuation even when clean_up_tokenization_spaces=True.

fix: add leading space to test string for ByteLevel BPE prefix

247ee6f

ByteLevel BPE tokenizers prepend a space during encoding. Use " Hello world." so the roundtrip matches exactly.

ArthurZucker reviewed Mar 23, 2026

View reviewed changes

feat: add escape hatch for BPE cleanup override

ddca57e

Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag so users who rely on the old behavior can opt back in. The warning message now mentions this flag. Added test for the override path.

maxsloef-goodfire force-pushed the fix/skip-cleanup-for-bpe branch from d11ef36 to ddca57e Compare March 23, 2026 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip `clean_up_tokenization` for BPE tokenizers in `PreTrainedTokenizerFast`#44915

fix: skip `clean_up_tokenization` for BPE tokenizers in `PreTrainedTokenizerFast`#44915
maxsloef-goodfire wants to merge 6 commits intohuggingface:mainfrom
maxsloef-goodfire:fix/skip-cleanup-for-bpe

maxsloef-goodfire commented Mar 21, 2026 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Mar 23, 2026

Uh oh!

maxsloef-goodfire commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxsloef-goodfire commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why this matters

Minimal reproduction (before fix)

Changes

Before submitting

Who can review?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

maxsloef-goodfire commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxsloef-goodfire commented Mar 21, 2026 •

edited

Loading