Skip to content

BPE cache: per-thread read-through cache to avoid RwLock atomics on hits#2028

Merged
ArthurZucker merged 1 commit intohuggingface:mainfrom
sebpop:p1
Apr 23, 2026
Merged

BPE cache: per-thread read-through cache to avoid RwLock atomics on hits#2028
ArthurZucker merged 1 commit intohuggingface:mainfrom
sebpop:p1

Conversation

@sebpop
Copy link
Copy Markdown
Contributor

@sebpop sebpop commented Apr 22, 2026

On origin/main each BPE cache hit acquires a read-lock on a shared
Cache<String, Word>, emitting ldadd4_rel LSE atomics on aarch64
that dominate tokenize_with_cache under parallel encode_batch.

Remove the shared RwLock<AHashMap> from the BPE hot path entirely.
BPE::cache becomes an Option<BpeCache> carrying only an
AtomicU64 generation id and a capacity; the actual cache lives in a
thread-local AHashMap<u64, AHashMap<String, Word>> where the outer
key is the BpeCache::id. Lookups and inserts then need no atomic
synchronization: per-thread access, no RwLock, no Mutex.

Multiple BPE instances sharing the same rayon worker thread never
see each other's entries because each BpeCache gets a distinct id
assigned from a process-wide AtomicU64 counter at construction.
BPE::clear_cache() bumps the id, invalidating every thread's
entries for this BPE in a single fetch_add; BPE::resize_cache(n)
updates the capacity field. utils/cache.rs is untouched;
Unigram keeps using Cache<K,V> unchanged.

New unit test test_cache_is_per_bpe_instance builds two BPEs with
different merges and confirms that alternating tokenization on the
same thread returns each model's own tokenization. Without the
per-instance keying the test panics on vocab_r[&id] with a
wrong-model token id.

cargo test --lib --features http: 201 passed, 0 failed.

Perf evidence on Vera (88-core Olympus, 176 logical),
bpe_benchmark/bpe-encode/BPE GPT2 encode batch at 88T,
perf record -g --call-graph fp -F 4999:

                                       BASE    PATCHED
  ModelWrapper::tokenize               60.79%   ~5%
  __aarch64_ldadd4_rel (RwLock read)    9.47%  <0.01%   (eliminated)
  crossbeam_epoch::try_advance          8.31%  25.93%   (now dominant)
  crossbeam_epoch::with_handle          6.03%  21.41%
  rayon_core::WorkerThread::wait        3.05%   8.40%

The BPE cache lock is gone from the hot path; the remaining ceiling
at 88T is rayon::broadcast / crossbeam-epoch dispatch, not BPE
synchronization.

Throughput on Vera, bpe-encode/BPE GPT2 encode batch
(data/big.txt, encode_batch through the full post-processor):

  threads  before         after          change
  -------  ------         ------         ------
  1T       3.92 MiB/s     3.98 MiB/s     +1.5% (noise)
  88T      12.95 MiB/s    20.97 MiB/s    +62%
  176T     12.79 MiB/s    18.83 MiB/s    +47%

@ArthurZucker ArthurZucker requested a review from McPatate April 22, 2026 13:41
@ArthurZucker
Copy link
Copy Markdown
Collaborator

/benchmark

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread tokenizers/src/models/bpe/model.rs Outdated
let ret: Vec<Token> = self.word_to_tokens(&word).collect();
if sequence.len() < MAX_LENGTH {
cache.set(sequence.to_owned(), word);
cache.set(sequence.to_owned(), word.clone());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never cache.get, why do we set it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, once we check only the thread-local map, the shared Cache is write-only.
I removed this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this dead code also has some minor speedup effect at 88t:

  threads  before         version 0      version 1
  -------  ------         ------         ------
  1T       3.92 MiB/s     3.86 MiB/s     3.98 MiB/s
  88T      12.95 MiB/s    20.51 MiB/s    20.97 MiB/s
  176T     12.79 MiB/s    18.73 MiB/s    18.83 MiB/s

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if wouldn't be better to have a specific struct BpeCache or something like that in the bpe model file rather than modify the global cache implementation, for something that is linked to bpe atm. Also, since in Bpe we never read the global/shared cache + have local cache impl over there, it makes sens for the struct to look like this:

struct BpeCache {
  id: AtomicU64,
  capacity: usize,
}

wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.
I've refactored along the lines you suggested: utils/cache.rs is back to main and there's a small local BpeCache { id: AtomicU64, capacity: usize } in models/bpe/model.rs. The miss path no longer writes to any shared structure.

On origin/main each BPE cache hit acquires a read-lock on a shared
`Cache<String, Word>`, emitting `ldadd4_rel` LSE atomics on aarch64
that dominate `tokenize_with_cache` under parallel `encode_batch`.

Remove the shared `RwLock<AHashMap>` from the BPE hot path entirely.
`BPE::cache` becomes an `Option<BpeCache>` carrying only an
`AtomicU64` generation id and a capacity; the actual cache lives in a
thread-local `AHashMap<u64, AHashMap<String, Word>>` where the outer
key is the `BpeCache::id`.  Lookups and inserts then need no atomic
synchronization: per-thread access, no `RwLock`, no `Mutex`.

Multiple `BPE` instances sharing the same rayon worker thread never
see each other's entries because each `BpeCache` gets a distinct id
assigned from a process-wide `AtomicU64` counter at construction.
`BPE::clear_cache()` bumps the id, invalidating every thread's
entries for this BPE in a single `fetch_add`; `BPE::resize_cache(n)`
updates the capacity field.  `utils/cache.rs` is untouched;
Unigram keeps using `Cache<K,V>` unchanged.

New unit test `test_cache_is_per_bpe_instance` builds two BPEs with
different merges and confirms that alternating tokenization on the
same thread returns each model's own tokenization.  Without the
per-instance keying the test panics on `vocab_r[&id]` with a
wrong-model token id.

cargo test --lib --features http: 201 passed, 0 failed.

Perf evidence on Vera (88-core Olympus, 176 logical),
`bpe_benchmark`/`bpe-encode/BPE GPT2 encode batch` at 88T,
`perf record -g --call-graph fp -F 4999`:

                                       BASE    PATCHED
  ModelWrapper::tokenize               60.79%   ~5%
  __aarch64_ldadd4_rel (RwLock read)    9.47%  <0.01%   (eliminated)
  crossbeam_epoch::try_advance          8.31%  25.93%   (now dominant)
  crossbeam_epoch::with_handle          6.03%  21.41%
  rayon_core::WorkerThread::wait        3.05%   8.40%

The BPE cache lock is gone from the hot path; the remaining ceiling
at 88T is rayon::broadcast / crossbeam-epoch dispatch, not BPE
synchronization.

Throughput on Vera, `bpe-encode/BPE GPT2 encode batch`
(data/big.txt, encode_batch through the full post-processor):

  threads  before         after          change
  -------  ------         ------         ------
  1T       3.92 MiB/s     3.98 MiB/s     +1.5% (noise)
  88T      12.95 MiB/s    20.97 MiB/s    +62%
  176T     12.79 MiB/s    18.83 MiB/s    +47%
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM great addition! 🤗

@ArthurZucker ArthurZucker merged commit bcdd25b into huggingface:main Apr 23, 2026
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants