Fix convert_tokens_to_ids performance regression for slow tokenizers (#46315)#46323
Conversation
…uggingface#46315) Slow (PreTrainedTokenizer) tokenizers resolved added-vocabulary tokens through the added_tokens_encoder property, which rebuilds and re-sorts the whole added-token mapping on every access and is read twice per token. That made convert_tokens_to_ids roughly O(T * N * log N) for N added tokens, a regression from the v5 tokenizer refactor (huggingface#40936). Read the maintained _added_tokens_encoder cache instead, restoring the v4 behavior that every other method in the file already relies on. Adds a network-free regression test using CTRLTokenizer.
|
Thanks for the quick look on the issue, @itazap. I moved the regression test to |
|
Good call, removed the test. The fix and the inline comment in tokenization_python.py stand on their own. Thanks for reviewing. |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46323&sha=d2eaaf |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: auto, llama |
|
This comment contains models: ["models/auto", "models/llama"] |
|
The merge queue reds look unrelated: tests_processors was a worker crash (0 test failures), and tests_torch's only fail was test_from_pretrained_dynamic_model_distant (remote-code loading). Tokenization jobs all passed. |
What this fixes
Fixes #46315. For slow (
PreTrainedTokenizer) tokenizers,convert_tokens_to_idsbecame much slower after the v5 tokenizer refactor (#40936). It went from roughly O(T) to roughly O(T * N * log N), where T is the number of tokens converted and N is the number of added tokens._convert_token_to_id_with_added_vocresolves each token through theadded_tokens_encoderproperty. That property rebuilds and re-sorts the entire added-token mapping on every access, and the method reads it twice per token. In v4 this lookup used the maintainedself._added_tokens_encodercache; the refactor switched it to the property. The property's own docstring still says the mapping is cached inself._added_tokens_encoder, and every other method in the file already reads that cache directly.This hits hardest on a slow tokenizer with many added tokens where most tokens land in the added vocabulary, for example a Chinese
BertTokenizerLegacymodel with many single-character added tokens.The fix
Read the maintained
self._added_tokens_encodercache instead of the rebuild-on-access property, which is what v4 did. One method, two lines. The cache is kept in sync at every place_added_tokens_decoderis mutated (__init__, theadded_tokens_decodersetter, and_add_tokens), so the result is identical and only faster.Impact
Slow tokenizer, N added tokens, converting T = N tokens that all hit the added vocabulary (measured locally):
Correctness
The cache (
self._added_tokens_encoder) and the property (added_tokens_encoder) hold the same key-to-value mapping; they differ only in sort order, which this method does not depend on. I checked thatconvert_tokens_to_idsreturns the same results before and after for added and special tokens across multipleadd_tokenscalls.Tests
Added
tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_convert_tokens_to_ids_does_not_rebuild_added_vocab. It is network-free and usesCTRLTokenizer, a regular non-legacy slow tokenizer (per maintainer request). It asserts the added-token mapping is not rebuilt duringconvert_tokens_to_idsand that the ids resolve correctly. It fails onmainand passes with this change.The 2 failures (
test_encode_message,test_special_tokens_overwrite) also fail on unmodifiedmainand require network or model downloads, so they are unrelated to this change.Not a duplicate
Checked per the contribution policy: no open PR references #46315, and no open PR touches
added_tokens_encoderorconvert_tokens_to_ids.Coordination
I raised this on the issue before opening the PR. Maintainer @itazap approved and asked for the regression test to use a non-legacy tokenizer, which is now
CTRLTokenizer: #46315 (comment)AI assistance
I used an AI coding assistant (Claude Code) while preparing this. I reviewed every changed line, verified the diagnosis against the source and the v4 implementation myself, ran the tests and the benchmark above, and can defend the change.