Skip to content

Regression: convert_tokens_to_ids is much slower in v5 than v4 for slow tokenizers with many added tokens #46315

@ichizok

Description

@ichizok

System Info

  • transformers version: 5.9.0
  • Platform: Linux-6.8.0-117-generic-x86_64-with-glibc2.39
  • Python version: 3.12.13
  • PyTorch version (accelerator?): 2.8.0+cu128 (NA)

Who can help?

@itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Summary

Since the v5 tokenizer refactor (#40936, "rm slow tokenizers"), PreTrainedTokenizer._convert_token_to_id_with_added_voc looks tokens up via the added_tokens_encoder property, which re-sorts and rebuilds a dict of all added tokens on every access (and is accessed up to twice per token). Because convert_tokens_to_ids calls this method once per token, converting a sequence of T tokens for a tokenizer that has N added tokens now costs O(T · N · logN) instead of the previous O(T).

This is a silent performance regression. It is most severe for tokenizers that carry a large N of added tokens and produce many tokens that hit the added vocabulary — e.g. Chinese BertTokenizerLegacy models whose added_tokens contain many CJK single characters, where essentially every token is an added token, so each one triggers two full rebuild-and-sort passes.

Versions

  • Affected: v5.x including current main (5.10.0.dev0).
  • Last good: v4.x (e.g. 4.57.6).
  • Introduced by: rm slow tokenizers #40936 (commit 05c0e1d39), which renamed tokenization_utils.pytokenization_python.py and, in that move, changed the lookup from the cached dict to the property.

Root cause

Line references below point at the v5.9.0 tag; the same code is present on current main.

convert_tokens_to_ids loops over every token, calling _convert_token_to_id_with_added_voc once per token:

https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_utils_base.py#L1477

_convert_token_to_id_with_added_voc looks the token up via the added_tokens_encoder property, accessed up to twice per token (token in ... then ...[token]):

https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_python.py#L689-L692

That property re-sorts and rebuilds a dict of all added tokens on every access:

https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_python.py#L457-L463

Its own docstring states the cache lives in self._added_tokens_encoder "for performance optimisation" — but the call site no longer uses that cache.

In v4 the call site used the cached self._added_tokens_encoder dict directly, giving O(1) per token:

https://github.com/huggingface/transformers/blob/v4.57.6/src/transformers/tokenization_utils.py#L732-L738

self._added_tokens_encoder is still maintained in v5 (initialized in __init__, updated in _add_tokens), so it is in sync with the property and safe to use.

Reproduction

On v5 (affected):

# transformers v5 (e.g. 5.9.0)
import time
from transformers import BertTokenizerLegacy

tok = BertTokenizerLegacy.from_pretrained("google-bert/bert-base-chinese")

# Mimic a model that ships many single-character added tokens.
tok.add_tokens([chr(c) for c in range(0x4E00, 0x4E00 + 5000)])

words = tok.tokenize("这是一个用于测试分词速度的中文句子。" * 20)
print("N (added tokens) =", len(tok._added_tokens_encoder), " T (tokens) =", len(words))

t = time.perf_counter()
for _ in range(100):
    tok.convert_tokens_to_ids(words)
print("convert_tokens_to_ids x100:", time.perf_counter() - t, "s")

The same script on v4 for comparison. v4 has no BertTokenizerLegacy; its slow tokenizer is BertTokenizer, the same Python implementation that was later renamed to BertTokenizerLegacy in v5:

# transformers v4 (e.g. 4.57.6)
import time
from transformers import BertTokenizer  # slow / Python tokenizer

tok = BertTokenizer.from_pretrained("google-bert/bert-base-chinese")
tok.add_tokens([chr(c) for c in range(0x4E00, 0x4E00 + 5000)])

words = tok.tokenize("这是一个用于测试分词速度的中文句子。" * 20)
print("N (added tokens) =", len(tok._added_tokens_encoder), " T (tokens) =", len(words))

t = time.perf_counter()
for _ in range(100):
    tok.convert_tokens_to_ids(words)
print("convert_tokens_to_ids x100:", time.perf_counter() - t, "s")

With the same N and T, v5 runtime scales with N (the number of added tokens) — even though N is irrelevant to converting an already-tokenized sequence — while on v4 the same loop is effectively independent of N.

Example measurements

Time per convert_tokens_to_ids call, converting T = 200 tokens that all hit the added vocabulary (the Chinese worst case). Same machine, CPU, Python 3.12.13 for both; small local vocab:

N (added tokens) v4.57.6 (BertTokenizer) v5.9.0 (BertTokenizerLegacy) slowdown
0 0.06 ms 1.20 ms ~19×
1,000 0.06 ms 146 ms ~2,300×
5,000 0.11 ms 782 ms ~7,400×
20,000 0.09 ms 3,734 ms ~40,000×
50,000 0.07 ms 12,508 ms ~190,000×

v4 is flat (independent of N); v5 grows roughly linearly in N. At N = 20,000, converting a single 200-token sequence takes ~3.7 s on v5 versus ~0.1 ms on v4.

Expected behavior

convert_tokens_to_ids should be O(T) and independent of N, as it was in v4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions