Regression: `convert_tokens_to_ids` is much slower in v5 than v4 for slow tokenizers with many added tokens

### System Info

- `transformers` version: 5.9.0
- Platform: Linux-6.8.0-117-generic-x86_64-with-glibc2.39
- Python version: 3.12.13
- PyTorch version (accelerator?): 2.8.0+cu128 (NA)

### Who can help?

@itazap 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Summary

Since the v5 tokenizer refactor (#40936, "rm slow tokenizers"), `PreTrainedTokenizer._convert_token_to_id_with_added_voc` looks tokens up via the `added_tokens_encoder` property, which re-sorts and rebuilds a dict of all added tokens on every access (and is accessed up to twice per token). Because `convert_tokens_to_ids` calls this method once per token, converting a sequence of `T` tokens for a tokenizer that has `N` added tokens now costs O(T · N · logN) instead of the previous O(T).

This is a silent performance regression. It is most severe for tokenizers that carry a large `N` of added tokens and produce many tokens that hit the added vocabulary — e.g. Chinese `BertTokenizerLegacy` models whose `added_tokens` contain many CJK single characters, where essentially every token is an added token, so each one triggers two full rebuild-and-sort passes.

### Versions

- Affected: v5.x including current `main` (`5.10.0.dev0`).
- Last good: v4.x (e.g. `4.57.6`).
- Introduced by: #40936 (commit `05c0e1d39`), which renamed `tokenization_utils.py` → `tokenization_python.py` and, in that move, changed the lookup from the cached dict to the property.

### Root cause

Line references below point at the `v5.9.0` tag; the same code is present on current `main`.

`convert_tokens_to_ids` loops over every token, calling `_convert_token_to_id_with_added_voc` once per token:

https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_utils_base.py#L1477

`_convert_token_to_id_with_added_voc` looks the token up via the `added_tokens_encoder` property, accessed up to twice per token (`token in ...` then `...[token]`):

https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_python.py#L689-L692

That property re-sorts and rebuilds a dict of all added tokens on every access:

https://github.com/huggingface/transformers/blob/v5.9.0/src/transformers/tokenization_python.py#L457-L463

Its own docstring states the cache lives in `self._added_tokens_encoder` "for performance optimisation" — but the call site no longer uses that cache.


In v4 the call site used the cached `self._added_tokens_encoder` dict directly, giving O(1) per token:

https://github.com/huggingface/transformers/blob/v4.57.6/src/transformers/tokenization_utils.py#L732-L738

`self._added_tokens_encoder` is still maintained in v5 (initialized in `__init__`, updated in `_add_tokens`), so it is in sync with the property and safe to use.

### Reproduction

On v5 (affected):

```python
# transformers v5 (e.g. 5.9.0)
import time
from transformers import BertTokenizerLegacy

tok = BertTokenizerLegacy.from_pretrained("google-bert/bert-base-chinese")

# Mimic a model that ships many single-character added tokens.
tok.add_tokens([chr(c) for c in range(0x4E00, 0x4E00 + 5000)])

words = tok.tokenize("这是一个用于测试分词速度的中文句子。" * 20)
print("N (added tokens) =", len(tok._added_tokens_encoder), " T (tokens) =", len(words))

t = time.perf_counter()
for _ in range(100):
    tok.convert_tokens_to_ids(words)
print("convert_tokens_to_ids x100:", time.perf_counter() - t, "s")
```

The same script on v4 for comparison. v4 has no `BertTokenizerLegacy`; its slow tokenizer is `BertTokenizer`, the same Python implementation that was later renamed to `BertTokenizerLegacy` in v5:

```python
# transformers v4 (e.g. 4.57.6)
import time
from transformers import BertTokenizer  # slow / Python tokenizer

tok = BertTokenizer.from_pretrained("google-bert/bert-base-chinese")
tok.add_tokens([chr(c) for c in range(0x4E00, 0x4E00 + 5000)])

words = tok.tokenize("这是一个用于测试分词速度的中文句子。" * 20)
print("N (added tokens) =", len(tok._added_tokens_encoder), " T (tokens) =", len(words))

t = time.perf_counter()
for _ in range(100):
    tok.convert_tokens_to_ids(words)
print("convert_tokens_to_ids x100:", time.perf_counter() - t, "s")
```

With the same `N` and `T`, v5 runtime scales with `N` (the number of added tokens) — even though `N` is irrelevant to converting an already-tokenized sequence — while on v4 the same loop is effectively independent of `N`.

### Example measurements

Time per `convert_tokens_to_ids` call, converting `T = 200` tokens that all hit the added vocabulary (the Chinese worst case). Same machine, CPU, Python 3.12.13 for both; small local vocab:

| N (added tokens) | v4.57.6 (`BertTokenizer`) | v5.9.0 (`BertTokenizerLegacy`) | slowdown |
|---:|---:|---:|---:|
| 0 | 0.06 ms | 1.20 ms | ~19× |
| 1,000 | 0.06 ms | 146 ms | ~2,300× |
| 5,000 | 0.11 ms | 782 ms | ~7,400× |
| 20,000 | 0.09 ms | 3,734 ms | ~40,000× |
| 50,000 | 0.07 ms | 12,508 ms | ~190,000× |

v4 is flat (independent of `N`); v5 grows roughly linearly in `N`. At `N = 20,000`, converting a single 200-token sequence takes ~3.7 s on v5 versus ~0.1 ms on v4.

### Expected behavior

`convert_tokens_to_ids` should be O(T) and independent of `N`, as it was in v4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: `convert_tokens_to_ids` is much slower in v5 than v4 for slow tokenizers with many added tokens #46315

System Info

Who can help?

Information

Tasks

Summary

Versions

Root cause

Reproduction

Example measurements

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

N (added tokens)	v4.57.6 (`BertTokenizer`)	v5.9.0 (`BertTokenizerLegacy`)	slowdown
0	0.06 ms	1.20 ms	~19×
1,000	0.06 ms	146 ms	~2,300×
5,000	0.11 ms	782 ms	~7,400×
20,000	0.09 ms	3,734 ms	~40,000×
50,000	0.07 ms	12,508 ms	~190,000×

Regression: convert_tokens_to_ids is much slower in v5 than v4 for slow tokenizers with many added tokens #46315

Description

System Info

Who can help?

Information

Tasks

Summary

Versions

Root cause

Reproduction

Example measurements

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Regression: `convert_tokens_to_ids` is much slower in v5 than v4 for slow tokenizers with many added tokens #46315