Fix convert_tokens_to_ids performance regression for slow tokenizers (#46315) by ishan-1010 · Pull Request #46323 · huggingface/transformers

ishan-1010 · 2026-06-01T12:12:04Z

What this fixes

Fixes #46315. For slow (PreTrainedTokenizer) tokenizers, convert_tokens_to_ids became much slower after the v5 tokenizer refactor (#40936). It went from roughly O(T) to roughly O(T * N * log N), where T is the number of tokens converted and N is the number of added tokens.

_convert_token_to_id_with_added_voc resolves each token through the added_tokens_encoder property. That property rebuilds and re-sorts the entire added-token mapping on every access, and the method reads it twice per token. In v4 this lookup used the maintained self._added_tokens_encoder cache; the refactor switched it to the property. The property's own docstring still says the mapping is cached in self._added_tokens_encoder, and every other method in the file already reads that cache directly.

This hits hardest on a slow tokenizer with many added tokens where most tokens land in the added vocabulary, for example a Chinese BertTokenizerLegacy model with many single-character added tokens.

The fix

Read the maintained self._added_tokens_encoder cache instead of the rebuild-on-access property, which is what v4 did. One method, two lines. The cache is kept in sync at every place _added_tokens_decoder is mutated (__init__, the added_tokens_decoder setter, and _add_tokens), so the result is identical and only faster.

Impact

Slow tokenizer, N added tokens, converting T = N tokens that all hit the added vocabulary (measured locally):

N	before	after
500	59 ms	0.06 ms
2,000	957 ms	0.20 ms
5,000	7,294 ms	0.50 ms

Correctness

The cache (self._added_tokens_encoder) and the property (added_tokens_encoder) hold the same key-to-value mapping; they differ only in sort order, which this method does not depend on. I checked that convert_tokens_to_ids returns the same results before and after for added and special tokens across multiple add_tokens calls.

Tests

Added tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_convert_tokens_to_ids_does_not_rebuild_added_vocab. It is network-free and uses CTRLTokenizer, a regular non-legacy slow tokenizer (per maintainer request). It asserts the added-token mapping is not rebuilt during convert_tokens_to_ids and that the ids resolve correctly. It fails on main and passes with this change.

# regression test (fails on main, passes with this change)
python -m pytest "tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_convert_tokens_to_ids_does_not_rebuild_added_vocab"
# full file
python -m pytest tests/tokenization/test_tokenization_utils.py
# result: 15 passed, 4 skipped, 2 failed

The 2 failures (test_encode_message, test_special_tokens_overwrite) also fail on unmodified main and require network or model downloads, so they are unrelated to this change.

Not a duplicate

Checked per the contribution policy: no open PR references #46315, and no open PR touches added_tokens_encoder or convert_tokens_to_ids.

Coordination

I raised this on the issue before opening the PR. Maintainer @itazap approved and asked for the regression test to use a non-legacy tokenizer, which is now CTRLTokenizer: #46315 (comment)

AI assistance

I used an AI coding assistant (Claude Code) while preparing this. I reviewed every changed line, verified the diagnosis against the source and the v4 implementation myself, ran the tests and the benchmark above, and can defend the change.

…uggingface#46315) Slow (PreTrainedTokenizer) tokenizers resolved added-vocabulary tokens through the added_tokens_encoder property, which rebuilds and re-sorts the whole added-token mapping on every access and is read twice per token. That made convert_tokens_to_ids roughly O(T * N * log N) for N added tokens, a regression from the v5 tokenizer refactor (huggingface#40936). Read the maintained _added_tokens_encoder cache instead, restoring the v4 behavior that every other method in the file already relies on. Adds a network-free regression test using CTRLTokenizer.

ishan-1010 · 2026-06-01T12:19:27Z

Thanks for the quick look on the issue, @itazap. I moved the regression test to CTRLTokenizer (a regular non-legacy slow tokenizer) as you suggested. It's network-free and checks that the added-token mapping isn't rebuilt per token during convert_tokens_to_ids. Ready for review whenever you have a moment.

ishan-1010 · 2026-06-02T17:25:07Z

Good call, removed the test. The fix and the inline comment in tokenization_python.py stand on their own. Thanks for reviewing.

github-actions · 2026-06-02T17:38:32Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46323&sha=d2eaaf

itazap

Thank you!

HuggingFaceDocBuilderDev · 2026-06-03T09:22:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

itazap · 2026-06-03T12:36:42Z

run-slow: auto, llama

github-actions · 2026-06-03T12:37:58Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/auto", "models/llama"]
quantizations: []

github-actions · 2026-06-03T13:03:48Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	19acfcd0	workflow commit (merge commit)
PR	d2eaaf78	branch commit (from PR)
main	595721c4	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

ishan-1010 · 2026-06-03T15:09:51Z

The merge queue reds look unrelated: tests_processors was a worker crash (0 test failures), and tests_torch's only fail was test_from_pretrained_dynamic_model_distant (remote-code loading). Tokenization jobs all passed.

itazap reviewed Jun 2, 2026

View reviewed changes

Comment thread tests/tokenization/test_tokenization_utils.py Outdated

Remove regression test per reviewer feedback

d2eaaf7

ishan-1010 requested a review from itazap June 2, 2026 17:36

itazap approved these changes Jun 3, 2026

View reviewed changes

itazap added this pull request to the merge queue Jun 3, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 3, 2026

itazap added this pull request to the merge queue Jun 3, 2026

Merged via the queue into huggingface:main with commit d3f0591 Jun 3, 2026
30 checks passed

ishan-1010 mentioned this pull request Jun 4, 2026

added_tokens_encoder rebuilds the entire added-token map on every access #46396

Open

Achyuthan-S mentioned this pull request Jun 6, 2026

fix: use _added_tokens_encoder cache instead of property in convert_tokens_to_string and get_vocab #46464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix convert_tokens_to_ids performance regression for slow tokenizers (#46315)#46323

Fix convert_tokens_to_ids performance regression for slow tokenizers (#46315)#46323
itazap merged 2 commits into
huggingface:mainfrom
ishan-1010:fix/added-tokens-encoder-perf-46315

ishan-1010 commented Jun 1, 2026

Uh oh!

ishan-1010 commented Jun 1, 2026

Uh oh!

Uh oh!

ishan-1010 commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

itazap left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 3, 2026

Uh oh!

itazap commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Uh oh!

ishan-1010 commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ishan-1010 commented Jun 1, 2026

What this fixes

The fix

Impact

Correctness

Tests

Not a duplicate

Coordination

AI assistance

Uh oh!

ishan-1010 commented Jun 1, 2026

Uh oh!

Uh oh!

ishan-1010 commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

itazap left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 3, 2026

Uh oh!

itazap commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

CI Results

Commit Info

Uh oh!

Uh oh!

ishan-1010 commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants