Skip to content

Fix convert_tokens_to_ids performance regression for slow tokenizers (#46315)#46323

Merged
itazap merged 2 commits into
huggingface:mainfrom
ishan-1010:fix/added-tokens-encoder-perf-46315
Jun 3, 2026
Merged

Fix convert_tokens_to_ids performance regression for slow tokenizers (#46315)#46323
itazap merged 2 commits into
huggingface:mainfrom
ishan-1010:fix/added-tokens-encoder-perf-46315

Conversation

@ishan-1010

Copy link
Copy Markdown

What this fixes

Fixes #46315. For slow (PreTrainedTokenizer) tokenizers, convert_tokens_to_ids became much slower after the v5 tokenizer refactor (#40936). It went from roughly O(T) to roughly O(T * N * log N), where T is the number of tokens converted and N is the number of added tokens.

_convert_token_to_id_with_added_voc resolves each token through the added_tokens_encoder property. That property rebuilds and re-sorts the entire added-token mapping on every access, and the method reads it twice per token. In v4 this lookup used the maintained self._added_tokens_encoder cache; the refactor switched it to the property. The property's own docstring still says the mapping is cached in self._added_tokens_encoder, and every other method in the file already reads that cache directly.

This hits hardest on a slow tokenizer with many added tokens where most tokens land in the added vocabulary, for example a Chinese BertTokenizerLegacy model with many single-character added tokens.

The fix

Read the maintained self._added_tokens_encoder cache instead of the rebuild-on-access property, which is what v4 did. One method, two lines. The cache is kept in sync at every place _added_tokens_decoder is mutated (__init__, the added_tokens_decoder setter, and _add_tokens), so the result is identical and only faster.

Impact

Slow tokenizer, N added tokens, converting T = N tokens that all hit the added vocabulary (measured locally):

N before after
500 59 ms 0.06 ms
2,000 957 ms 0.20 ms
5,000 7,294 ms 0.50 ms

Correctness

The cache (self._added_tokens_encoder) and the property (added_tokens_encoder) hold the same key-to-value mapping; they differ only in sort order, which this method does not depend on. I checked that convert_tokens_to_ids returns the same results before and after for added and special tokens across multiple add_tokens calls.

Tests

Added tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_convert_tokens_to_ids_does_not_rebuild_added_vocab. It is network-free and uses CTRLTokenizer, a regular non-legacy slow tokenizer (per maintainer request). It asserts the added-token mapping is not rebuilt during convert_tokens_to_ids and that the ids resolve correctly. It fails on main and passes with this change.

# regression test (fails on main, passes with this change)
python -m pytest "tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_convert_tokens_to_ids_does_not_rebuild_added_vocab"
# full file
python -m pytest tests/tokenization/test_tokenization_utils.py
# result: 15 passed, 4 skipped, 2 failed

The 2 failures (test_encode_message, test_special_tokens_overwrite) also fail on unmodified main and require network or model downloads, so they are unrelated to this change.

Not a duplicate

Checked per the contribution policy: no open PR references #46315, and no open PR touches added_tokens_encoder or convert_tokens_to_ids.

Coordination

I raised this on the issue before opening the PR. Maintainer @itazap approved and asked for the regression test to use a non-legacy tokenizer, which is now CTRLTokenizer: #46315 (comment)

AI assistance

I used an AI coding assistant (Claude Code) while preparing this. I reviewed every changed line, verified the diagnosis against the source and the v4 implementation myself, ran the tests and the benchmark above, and can defend the change.

…uggingface#46315)

Slow (PreTrainedTokenizer) tokenizers resolved added-vocabulary tokens
through the added_tokens_encoder property, which rebuilds and re-sorts the
whole added-token mapping on every access and is read twice per token. That
made convert_tokens_to_ids roughly O(T * N * log N) for N added tokens, a
regression from the v5 tokenizer refactor (huggingface#40936).

Read the maintained _added_tokens_encoder cache instead, restoring the v4
behavior that every other method in the file already relies on. Adds a
network-free regression test using CTRLTokenizer.
@ishan-1010

Copy link
Copy Markdown
Author

Thanks for the quick look on the issue, @itazap. I moved the regression test to CTRLTokenizer (a regular non-legacy slow tokenizer) as you suggested. It's network-free and checks that the added-token mapping isn't rebuilt per token during convert_tokens_to_ids. Ready for review whenever you have a moment.

Comment thread tests/tokenization/test_tokenization_utils.py Outdated
@ishan-1010

Copy link
Copy Markdown
Author

Good call, removed the test. The fix and the inline comment in tokenization_python.py stand on their own. Thanks for reviewing.

@ishan-1010 ishan-1010 requested a review from itazap June 2, 2026 17:36
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46323&sha=d2eaaf

@itazap itazap left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap

itazap commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

run-slow: auto, llama

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/auto", "models/llama"]
quantizations: []

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 19acfcd0 workflow commit (merge commit)
PR d2eaaf78 branch commit (from PR)
main 595721c4 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@itazap itazap added this pull request to the merge queue Jun 3, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 3, 2026
@ishan-1010

Copy link
Copy Markdown
Author

The merge queue reds look unrelated: tests_processors was a worker crash (0 test failures), and tests_torch's only fail was test_from_pretrained_dynamic_model_distant (remote-code loading). Tokenization jobs all passed.

@itazap itazap added this pull request to the merge queue Jun 3, 2026
Merged via the queue into huggingface:main with commit d3f0591 Jun 3, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: convert_tokens_to_ids is much slower in v5 than v4 for slow tokenizers with many added tokens

3 participants