Skip to content

fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast#44915

Open
maxsloef-goodfire wants to merge 6 commits intohuggingface:mainfrom
maxsloef-goodfire:fix/skip-cleanup-for-bpe
Open

fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast#44915
maxsloef-goodfire wants to merge 6 commits intohuggingface:mainfrom
maxsloef-goodfire:fix/skip-cleanup-for-bpe

Conversation

@maxsloef-goodfire
Copy link
Copy Markdown
Contributor

@maxsloef-goodfire maxsloef-goodfire commented Mar 21, 2026

What does this PR do?

clean_up_tokenization applies English-specific string replacements ( .., ??, ,,, etc.) to decoded text. This was designed for BERT-era WordPiece tokenizers where decoding produced artifacts like "Hello , world .".

For BPE tokenizers (Llama 3, GPT-2, etc.), spaces are encoded as part of tokens and decoding does not produce these artifacts. The cleanup is actively destructive — it strips legitimate spaces from correctly decoded text. For example, "x != y" becomes "x!= y".

This PR adds a guard in PreTrainedTokenizerFast._decode() that unconditionally skips the cleanup when the backend model is BPE, and emits a logger.warning_once when clean_up_tokenization_spaces=True is set in the tokenizer config. Users who need the string replacements for other purposes can call tokenizer.clean_up_tokenization() directly.

Why this matters

All 24 Llama 3.x models on the Hub have clean_up_tokenization_spaces=true baked into their tokenizer_config.json (inherited from a library default when Llama 3 switched tokenizer classes — see #35175, #31187, #32575). Fixing the config on every model repo (and every downstream fine-tune) is a game of whack-a-mole. This library-level guard ensures the cleanup is never applied to tokenizers where it's incorrect, even if the config says otherwise.

Minimal reproduction (before fix)

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)
print(repr(tokenizer.decode(ids)))
# 'x!= y and a.b == c'  ← space before != silently dropped

Changes

  • src/transformers/tokenization_utils_tokenizers.py — in PreTrainedTokenizerFast._decode(), check type(self.backend_tokenizer.model).__name__ and skip clean_up_tokenization() for BPE models, emitting a warning via logger.warning_once.
  • tests/tokenization/test_tokenization_fast.py — added test_bpe_tokenizer_skips_clean_up_tokenization_spaces verifying BPE roundtrip preserves text.
  • tests/utils/test_tokenization_utils.py — updated test_clean_up_tokenization_spaces to use clean roundtrip text (GPT-2 is BPE, so cleanup is now correctly skipped).

Fixes #35175
Fixes #31187

Before submitting

Who can review?

@ArthurZucker @itazap — tokenizer maintainers. Arthur previously acknowledged this should be False in #35175.

clean_up_tokenization applies BERT-era string replacements (` .` → `.`,
` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces
are encoded as part of tokens. This adds a guard that skips the cleanup
when the backend model is BPE and emits a warning_once suggesting the
user set clean_up_tokenization_spaces=False.

Fixes huggingface#35175
Verifies that BPE tokenizers preserve spaces before punctuation
even when clean_up_tokenization_spaces=True.
clean_up_tokenization is always skipped for BPE tokenizers, even when
explicitly requested, because the cleanup is fundamentally wrong for
BPE (it strips legitimate spaces that are part of the token encoding).
Users who need those string replacements can call
clean_up_tokenization() directly.

Updated test_tokenization_utils.py to expect preserved spacing for
GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the
guard works with an explicit True parameter.
Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to
PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name).
Update test_clean_up_tokenization_spaces to use normal text without
artificial WordPiece artifacts — BPE roundtrip preserves originals.
ByteLevel BPE tokenizers prepend a space during encoding. Use
" Hello world." so the roundtrip matches exactly.
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Ty for the PR!
This was actually quite a rollercoaster. (#42898)
We decided to deprecate the flag in #31938. Then we had to introduce it back in #43426.

At this point, and again just as we could not break the uploaded models on the hub we have 2 choices.

  1. 🔴 this PR as a breaking change to enforce decoding is 1-1. This would literally affect ALL BPE models in the current state and does not allow to opt-out. I think this would break gpt2 which is also a BPE and had this since a long long time ago.
  2. We just document this better, pin this issue idk but I don't think there's much to do here.

This has been around for a while, the main issue for me is that for Llama the original repo indeed does not clean it up.

I don't mind trying to fix for llama3 specifically! but in this state its absolutely breaking

Comment on lines +1021 to +1028
if type(self.backend_tokenizer.model).__name__ == "BPE":
logger.warning_once(
"Ignoring clean_up_tokenization_spaces=True for BPE tokenizer"
f" {self.__class__.__name__}. The clean_up_tokenization post-processing"
" step is designed for WordPiece tokenizers and is destructive for BPE"
" (it strips spaces before punctuation). Set"
" clean_up_tokenization_spaces=False to suppress this warning."
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is something we can do... its too breaking for anyone that relies on this behavior.
Its a very big breaking change.

Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag
so users who rely on the old behavior can opt back in. The warning
message now mentions this flag. Added test for the override path.
@maxsloef-goodfire maxsloef-goodfire force-pushed the fix/skip-cleanup-for-bpe branch from d11ef36 to ddca57e Compare March 23, 2026 17:12
@maxsloef-goodfire
Copy link
Copy Markdown
Contributor Author

Hey @ArthurZucker, thanks for the review and the context on the history here. Two arguments for why this should go in:

Correctness: cleanup is definitionally wrong for BPE

BPE tokenizers encode whitespace as part of their tokens. That's the whole point — the Ġ prefix in GPT-2, the byte-level encoding in Llama. Decode produces a perfect roundtrip without any post-processing. There are no WordPiece-style artifacts to clean up, so clean_up_tokenization can only destroy information, never add correctness. This isn't a Llama-specific quirk, it's a property of how BPE works.

Compare with BERT, where WordPiece splits "it's"["it", "'", "s"] and decodes to "it ' s" — cleanup is genuinely needed there. BPE never produces those artifacts.

This is also why a Llama-3-specific fix isn't quite right — the problem isn't that Llama 3 has a bad config, it's that cleanup is fundamentally incompatible with BPE as a tokenizer class. A model-specific fix would need to be repeated for every new BPE model that ships with clean_up_tokenization_spaces: true, and there's nothing stopping that from happening again.

Impact: the asymmetry is stark

Llama 3 is one of the most popular model families on the Hub, and every Llama 3.x model — plus the massive ecosystem of fine-tunes and derivatives — ships with clean_up_tokenization_spaces: true in its config. Every user who doesn't know to manually override this gets silently corrupted output: "x != y""x!= y", "! ! !""!!!".

On the other side, the concern is breaking someone who feeds pre-tokenized text with artificial spacing through a BPE encode→decode→cleanup pipeline as a convenience post-processor. That's a rare and unusual pattern — using a feature designed for WordPiece artifacts as a general-purpose space collapser on a tokenizer that doesn't produce those artifacts.

Escape hatch added

We've added clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output as an override flag for anyone who truly needs the old behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detokenization discrepancy with Llama3.1 Original Llama-3 tokenizer behaves differently from transformers version

2 participants