You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Summary
Since the v5 tokenizer refactor (#40936, "rm slow tokenizers"), PreTrainedTokenizer._convert_token_to_id_with_added_voc looks tokens up via the added_tokens_encoder property, which re-sorts and rebuilds a dict of all added tokens on every access (and is accessed up to twice per token). Because convert_tokens_to_ids calls this method once per token, converting a sequence of T tokens for a tokenizer that has N added tokens now costs O(T · N · logN) instead of the previous O(T).
This is a silent performance regression. It is most severe for tokenizers that carry a large N of added tokens and produce many tokens that hit the added vocabulary — e.g. Chinese BertTokenizerLegacy models whose added_tokens contain many CJK single characters, where essentially every token is an added token, so each one triggers two full rebuild-and-sort passes.
Versions
Affected: v5.x including current main (5.10.0.dev0).
Last good: v4.x (e.g. 4.57.6).
Introduced by: rm slow tokenizers #40936 (commit 05c0e1d39), which renamed tokenization_utils.py → tokenization_python.py and, in that move, changed the lookup from the cached dict to the property.
Root cause
Line references below point at the v5.9.0 tag; the same code is present on current main.
convert_tokens_to_ids loops over every token, calling _convert_token_to_id_with_added_voc once per token:
_convert_token_to_id_with_added_voc looks the token up via the added_tokens_encoder property, accessed up to twice per token (token in ... then ...[token]):
self._added_tokens_encoder is still maintained in v5 (initialized in __init__, updated in _add_tokens), so it is in sync with the property and safe to use.
Reproduction
On v5 (affected):
# transformers v5 (e.g. 5.9.0)importtimefromtransformersimportBertTokenizerLegacytok=BertTokenizerLegacy.from_pretrained("google-bert/bert-base-chinese")
# Mimic a model that ships many single-character added tokens.tok.add_tokens([chr(c) forcinrange(0x4E00, 0x4E00+5000)])
words=tok.tokenize("这是一个用于测试分词速度的中文句子。"*20)
print("N (added tokens) =", len(tok._added_tokens_encoder), " T (tokens) =", len(words))
t=time.perf_counter()
for_inrange(100):
tok.convert_tokens_to_ids(words)
print("convert_tokens_to_ids x100:", time.perf_counter() -t, "s")
The same script on v4 for comparison. v4 has no BertTokenizerLegacy; its slow tokenizer is BertTokenizer, the same Python implementation that was later renamed to BertTokenizerLegacy in v5:
With the same N and T, v5 runtime scales with N (the number of added tokens) — even though N is irrelevant to converting an already-tokenized sequence — while on v4 the same loop is effectively independent of N.
Example measurements
Time per convert_tokens_to_ids call, converting T = 200 tokens that all hit the added vocabulary (the Chinese worst case). Same machine, CPU, Python 3.12.13 for both; small local vocab:
N (added tokens)
v4.57.6 (BertTokenizer)
v5.9.0 (BertTokenizerLegacy)
slowdown
0
0.06 ms
1.20 ms
~19×
1,000
0.06 ms
146 ms
~2,300×
5,000
0.11 ms
782 ms
~7,400×
20,000
0.09 ms
3,734 ms
~40,000×
50,000
0.07 ms
12,508 ms
~190,000×
v4 is flat (independent of N); v5 grows roughly linearly in N. At N = 20,000, converting a single 200-token sequence takes ~3.7 s on v5 versus ~0.1 ms on v4.
Expected behavior
convert_tokens_to_ids should be O(T) and independent of N, as it was in v4.
System Info
transformersversion: 5.9.0Who can help?
@itazap
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Summary
Since the v5 tokenizer refactor (#40936, "rm slow tokenizers"),
PreTrainedTokenizer._convert_token_to_id_with_added_voclooks tokens up via theadded_tokens_encoderproperty, which re-sorts and rebuilds a dict of all added tokens on every access (and is accessed up to twice per token). Becauseconvert_tokens_to_idscalls this method once per token, converting a sequence ofTtokens for a tokenizer that hasNadded tokens now costs O(T · N · logN) instead of the previous O(T).This is a silent performance regression. It is most severe for tokenizers that carry a large
Nof added tokens and produce many tokens that hit the added vocabulary — e.g. ChineseBertTokenizerLegacymodels whoseadded_tokenscontain many CJK single characters, where essentially every token is an added token, so each one triggers two full rebuild-and-sort passes.Versions
main(5.10.0.dev0).4.57.6).05c0e1d39), which renamedtokenization_utils.py→tokenization_python.pyand, in that move, changed the lookup from the cached dict to the property.Root cause
Line references below point at the
v5.9.0tag; the same code is present on currentmain.convert_tokens_to_idsloops over every token, calling_convert_token_to_id_with_added_voconce per token:https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_utils_base.py#L1477
_convert_token_to_id_with_added_voclooks the token up via theadded_tokens_encoderproperty, accessed up to twice per token (token in ...then...[token]):https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_python.py#L689-L692
That property re-sorts and rebuilds a dict of all added tokens on every access:
https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_python.py#L457-L463
Its own docstring states the cache lives in
self._added_tokens_encoder"for performance optimisation" — but the call site no longer uses that cache.In v4 the call site used the cached
self._added_tokens_encoderdict directly, giving O(1) per token:https://github.com/huggingface/transformers/blob/v4.57.6/src/transformers/tokenization_utils.py#L732-L738
self._added_tokens_encoderis still maintained in v5 (initialized in__init__, updated in_add_tokens), so it is in sync with the property and safe to use.Reproduction
On v5 (affected):
The same script on v4 for comparison. v4 has no
BertTokenizerLegacy; its slow tokenizer isBertTokenizer, the same Python implementation that was later renamed toBertTokenizerLegacyin v5:With the same
NandT, v5 runtime scales withN(the number of added tokens) — even thoughNis irrelevant to converting an already-tokenized sequence — while on v4 the same loop is effectively independent ofN.Example measurements
Time per
convert_tokens_to_idscall, convertingT = 200tokens that all hit the added vocabulary (the Chinese worst case). Same machine, CPU, Python 3.12.13 for both; small local vocab:BertTokenizer)BertTokenizerLegacy)v4 is flat (independent of
N); v5 grows roughly linearly inN. AtN = 20,000, converting a single 200-token sequence takes ~3.7 s on v5 versus ~0.1 ms on v4.Expected behavior
convert_tokens_to_idsshould be O(T) and independent ofN, as it was in v4.