BPE cache: per-thread read-through cache to avoid RwLock atomics on hits by sebpop · Pull Request #2028 · huggingface/tokenizers

sebpop · 2026-04-22T13:17:22Z

On origin/main each BPE cache hit acquires a read-lock on a shared
Cache<String, Word>, emitting ldadd4_rel LSE atomics on aarch64
that dominate tokenize_with_cache under parallel encode_batch.

Remove the shared RwLock<AHashMap> from the BPE hot path entirely.
BPE::cache becomes an Option<BpeCache> carrying only an
AtomicU64 generation id and a capacity; the actual cache lives in a
thread-local AHashMap<u64, AHashMap<String, Word>> where the outer
key is the BpeCache::id. Lookups and inserts then need no atomic
synchronization: per-thread access, no RwLock, no Mutex.

Multiple BPE instances sharing the same rayon worker thread never
see each other's entries because each BpeCache gets a distinct id
assigned from a process-wide AtomicU64 counter at construction.
BPE::clear_cache() bumps the id, invalidating every thread's
entries for this BPE in a single fetch_add; BPE::resize_cache(n)
updates the capacity field. utils/cache.rs is untouched;
Unigram keeps using Cache<K,V> unchanged.

New unit test test_cache_is_per_bpe_instance builds two BPEs with
different merges and confirms that alternating tokenization on the
same thread returns each model's own tokenization. Without the
per-instance keying the test panics on vocab_r[&id] with a
wrong-model token id.

cargo test --lib --features http: 201 passed, 0 failed.

Perf evidence on Vera (88-core Olympus, 176 logical),
bpe_benchmark/bpe-encode/BPE GPT2 encode batch at 88T,
perf record -g --call-graph fp -F 4999:

                                       BASE    PATCHED
  ModelWrapper::tokenize               60.79%   ~5%
  __aarch64_ldadd4_rel (RwLock read)    9.47%  <0.01%   (eliminated)
  crossbeam_epoch::try_advance          8.31%  25.93%   (now dominant)
  crossbeam_epoch::with_handle          6.03%  21.41%
  rayon_core::WorkerThread::wait        3.05%   8.40%

The BPE cache lock is gone from the hot path; the remaining ceiling
at 88T is rayon::broadcast / crossbeam-epoch dispatch, not BPE
synchronization.

Throughput on Vera, bpe-encode/BPE GPT2 encode batch
(data/big.txt, encode_batch through the full post-processor):

  threads  before         after          change
  -------  ------         ------         ------
  1T       3.92 MiB/s     3.98 MiB/s     +1.5% (noise)
  88T      12.95 MiB/s    20.97 MiB/s    +62%
  176T     12.79 MiB/s    18.83 MiB/s    +47%

ArthurZucker · 2026-04-22T13:41:47Z

/benchmark

HuggingFaceDocBuilderDev · 2026-04-22T13:44:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

McPatate · 2026-04-22T15:08:55Z

+            let ret: Vec<Token> = self.word_to_tokens(&word).collect();
            if sequence.len() < MAX_LENGTH {
-                cache.set(sequence.to_owned(), word);
+                cache.set(sequence.to_owned(), word.clone());


we never cache.get, why do we set it?

Right, once we check only the thread-local map, the shared Cache is write-only.
I removed this.

Removing this dead code also has some minor speedup effect at 88t:

threads before version 0 version 1 ------- ------ ------ ------ 1T 3.92 MiB/s 3.86 MiB/s 3.98 MiB/s 88T 12.95 MiB/s 20.51 MiB/s 20.97 MiB/s 176T 12.79 MiB/s 18.73 MiB/s 18.83 MiB/s

McPatate · 2026-04-22T15:35:21Z

I wonder if wouldn't be better to have a specific struct BpeCache or something like that in the bpe model file rather than modify the global cache implementation, for something that is linked to bpe atm. Also, since in Bpe we never read the global/shared cache + have local cache impl over there, it makes sens for the struct to look like this:

struct BpeCache { id: AtomicU64, capacity: usize, }

wdyt?

Right.
I've refactored along the lines you suggested: utils/cache.rs is back to main and there's a small local BpeCache { id: AtomicU64, capacity: usize } in models/bpe/model.rs. The miss path no longer writes to any shared structure.

On origin/main each BPE cache hit acquires a read-lock on a shared `Cache<String, Word>`, emitting `ldadd4_rel` LSE atomics on aarch64 that dominate `tokenize_with_cache` under parallel `encode_batch`. Remove the shared `RwLock<AHashMap>` from the BPE hot path entirely. `BPE::cache` becomes an `Option<BpeCache>` carrying only an `AtomicU64` generation id and a capacity; the actual cache lives in a thread-local `AHashMap<u64, AHashMap<String, Word>>` where the outer key is the `BpeCache::id`. Lookups and inserts then need no atomic synchronization: per-thread access, no `RwLock`, no `Mutex`. Multiple `BPE` instances sharing the same rayon worker thread never see each other's entries because each `BpeCache` gets a distinct id assigned from a process-wide `AtomicU64` counter at construction. `BPE::clear_cache()` bumps the id, invalidating every thread's entries for this BPE in a single `fetch_add`; `BPE::resize_cache(n)` updates the capacity field. `utils/cache.rs` is untouched; Unigram keeps using `Cache<K,V>` unchanged. New unit test `test_cache_is_per_bpe_instance` builds two BPEs with different merges and confirms that alternating tokenization on the same thread returns each model's own tokenization. Without the per-instance keying the test panics on `vocab_r[&id]` with a wrong-model token id. cargo test --lib --features http: 201 passed, 0 failed. Perf evidence on Vera (88-core Olympus, 176 logical), `bpe_benchmark`/`bpe-encode/BPE GPT2 encode batch` at 88T, `perf record -g --call-graph fp -F 4999`: BASE PATCHED ModelWrapper::tokenize 60.79% ~5% __aarch64_ldadd4_rel (RwLock read) 9.47% <0.01% (eliminated) crossbeam_epoch::try_advance 8.31% 25.93% (now dominant) crossbeam_epoch::with_handle 6.03% 21.41% rayon_core::WorkerThread::wait 3.05% 8.40% The BPE cache lock is gone from the hot path; the remaining ceiling at 88T is rayon::broadcast / crossbeam-epoch dispatch, not BPE synchronization. Throughput on Vera, `bpe-encode/BPE GPT2 encode batch` (data/big.txt, encode_batch through the full post-processor): threads before after change ------- ------ ------ ------ 1T 3.92 MiB/s 3.98 MiB/s +1.5% (noise) 88T 12.95 MiB/s 20.97 MiB/s +62% 176T 12.79 MiB/s 18.83 MiB/s +47%

ArthurZucker

LGTM great addition! 🤗

ArthurZucker requested a review from McPatate April 22, 2026 13:41

McPatate reviewed Apr 22, 2026

View reviewed changes

sebpop force-pushed the p1 branch from 4c6b9f4 to 952f8b4 Compare April 22, 2026 21:25

ArthurZucker approved these changes Apr 23, 2026

View reviewed changes

ArthurZucker merged commit bcdd25b into huggingface:main Apr 23, 2026
33 of 34 checks passed

vyalamar mentioned this pull request Apr 24, 2026

Batch encode: lock-free work queue with dynamic window sizing #2029

Open

sebpop mentioned this pull request Apr 25, 2026

Performance: batch_encode scales poorly on high-core Server CPUs compared to sharded tokenizer instances #1900

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE cache: per-thread read-through cache to avoid RwLock atomics on hits#2028

BPE cache: per-thread read-through cache to avoid RwLock atomics on hits#2028
ArthurZucker merged 1 commit intohuggingface:mainfrom
sebpop:p1

sebpop commented Apr 22, 2026 •

edited

Loading

Uh oh!

ArthurZucker commented Apr 22, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

McPatate Apr 22, 2026

Uh oh!

sebpop Apr 22, 2026

Uh oh!

sebpop Apr 22, 2026

Uh oh!

McPatate Apr 22, 2026

Uh oh!

sebpop Apr 22, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sebpop commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Apr 22, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

McPatate Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

sebpop Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

sebpop Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

McPatate Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

sebpop Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sebpop commented Apr 22, 2026 •

edited

Loading