BPE cache: per-thread read-through cache to avoid RwLock atomics on hits#2028
BPE cache: per-thread read-through cache to avoid RwLock atomics on hits#2028ArthurZucker merged 1 commit intohuggingface:mainfrom
Conversation
|
/benchmark |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| let ret: Vec<Token> = self.word_to_tokens(&word).collect(); | ||
| if sequence.len() < MAX_LENGTH { | ||
| cache.set(sequence.to_owned(), word); | ||
| cache.set(sequence.to_owned(), word.clone()); |
There was a problem hiding this comment.
we never cache.get, why do we set it?
There was a problem hiding this comment.
Right, once we check only the thread-local map, the shared Cache is write-only.
I removed this.
There was a problem hiding this comment.
Removing this dead code also has some minor speedup effect at 88t:
threads before version 0 version 1
------- ------ ------ ------
1T 3.92 MiB/s 3.86 MiB/s 3.98 MiB/s
88T 12.95 MiB/s 20.51 MiB/s 20.97 MiB/s
176T 12.79 MiB/s 18.73 MiB/s 18.83 MiB/s
There was a problem hiding this comment.
I wonder if wouldn't be better to have a specific struct BpeCache or something like that in the bpe model file rather than modify the global cache implementation, for something that is linked to bpe atm. Also, since in Bpe we never read the global/shared cache + have local cache impl over there, it makes sens for the struct to look like this:
struct BpeCache {
id: AtomicU64,
capacity: usize,
}wdyt?
There was a problem hiding this comment.
Right.
I've refactored along the lines you suggested: utils/cache.rs is back to main and there's a small local BpeCache { id: AtomicU64, capacity: usize } in models/bpe/model.rs. The miss path no longer writes to any shared structure.
On origin/main each BPE cache hit acquires a read-lock on a shared
`Cache<String, Word>`, emitting `ldadd4_rel` LSE atomics on aarch64
that dominate `tokenize_with_cache` under parallel `encode_batch`.
Remove the shared `RwLock<AHashMap>` from the BPE hot path entirely.
`BPE::cache` becomes an `Option<BpeCache>` carrying only an
`AtomicU64` generation id and a capacity; the actual cache lives in a
thread-local `AHashMap<u64, AHashMap<String, Word>>` where the outer
key is the `BpeCache::id`. Lookups and inserts then need no atomic
synchronization: per-thread access, no `RwLock`, no `Mutex`.
Multiple `BPE` instances sharing the same rayon worker thread never
see each other's entries because each `BpeCache` gets a distinct id
assigned from a process-wide `AtomicU64` counter at construction.
`BPE::clear_cache()` bumps the id, invalidating every thread's
entries for this BPE in a single `fetch_add`; `BPE::resize_cache(n)`
updates the capacity field. `utils/cache.rs` is untouched;
Unigram keeps using `Cache<K,V>` unchanged.
New unit test `test_cache_is_per_bpe_instance` builds two BPEs with
different merges and confirms that alternating tokenization on the
same thread returns each model's own tokenization. Without the
per-instance keying the test panics on `vocab_r[&id]` with a
wrong-model token id.
cargo test --lib --features http: 201 passed, 0 failed.
Perf evidence on Vera (88-core Olympus, 176 logical),
`bpe_benchmark`/`bpe-encode/BPE GPT2 encode batch` at 88T,
`perf record -g --call-graph fp -F 4999`:
BASE PATCHED
ModelWrapper::tokenize 60.79% ~5%
__aarch64_ldadd4_rel (RwLock read) 9.47% <0.01% (eliminated)
crossbeam_epoch::try_advance 8.31% 25.93% (now dominant)
crossbeam_epoch::with_handle 6.03% 21.41%
rayon_core::WorkerThread::wait 3.05% 8.40%
The BPE cache lock is gone from the hot path; the remaining ceiling
at 88T is rayon::broadcast / crossbeam-epoch dispatch, not BPE
synchronization.
Throughput on Vera, `bpe-encode/BPE GPT2 encode batch`
(data/big.txt, encode_batch through the full post-processor):
threads before after change
------- ------ ------ ------
1T 3.92 MiB/s 3.98 MiB/s +1.5% (noise)
88T 12.95 MiB/s 20.97 MiB/s +62%
176T 12.79 MiB/s 18.83 MiB/s +47%
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM great addition! 🤗
On origin/main each BPE cache hit acquires a read-lock on a shared
Cache<String, Word>, emittingldadd4_relLSE atomics on aarch64that dominate
tokenize_with_cacheunder parallelencode_batch.Remove the shared
RwLock<AHashMap>from the BPE hot path entirely.BPE::cachebecomes anOption<BpeCache>carrying only anAtomicU64generation id and a capacity; the actual cache lives in athread-local
AHashMap<u64, AHashMap<String, Word>>where the outerkey is the
BpeCache::id. Lookups and inserts then need no atomicsynchronization: per-thread access, no
RwLock, noMutex.Multiple
BPEinstances sharing the same rayon worker thread neversee each other's entries because each
BpeCachegets a distinct idassigned from a process-wide
AtomicU64counter at construction.BPE::clear_cache()bumps the id, invalidating every thread'sentries for this BPE in a single
fetch_add;BPE::resize_cache(n)updates the capacity field.
utils/cache.rsis untouched;Unigram keeps using
Cache<K,V>unchanged.New unit test
test_cache_is_per_bpe_instancebuilds two BPEs withdifferent merges and confirms that alternating tokenization on the
same thread returns each model's own tokenization. Without the
per-instance keying the test panics on
vocab_r[&id]with awrong-model token id.
cargo test --lib --features http: 201 passed, 0 failed.
Perf evidence on Vera (88-core Olympus, 176 logical),
bpe_benchmark/bpe-encode/BPE GPT2 encode batchat 88T,perf record -g --call-graph fp -F 4999:The BPE cache lock is gone from the hot path; the remaining ceiling
at 88T is rayon::broadcast / crossbeam-epoch dispatch, not BPE
synchronization.
Throughput on Vera,
bpe-encode/BPE GPT2 encode batch(data/big.txt, encode_batch through the full post-processor):