Batch encode: simple lock-free scheduler#2044
Open
sebpop wants to merge 1 commit intohuggingface:mainfrom
Open
Batch encode: simple lock-free scheduler#2044sebpop wants to merge 1 commit intohuggingface:mainfrom
sebpop wants to merge 1 commit intohuggingface:mainfrom
Conversation
Replace the batch encode parallel-iterator collect path with a compact scoped scheduler that uses one cache-line-padded atomic counter to hand out contiguous windows of work. This keeps the throughput win with less code than the standalone batch utility while preserving the fork-safety parallelism marker for direct Rayon use. The speedup comes from avoiding Rayon per-item iterator plumbing and mutex-backed worker wakeups in this hot loop. On aarch64 those wakeups show up as LSE atomic helper calls such as __aarch64_cas4_acq and __aarch64_ldadd8_acq_rel. The scheduler claims contiguous windows so one producer thread fills a run of adjacent input and result slots before another thread touches the next run. Completed windows flow through shared L3 cache as whole runs, avoiding producer and consumer ping-pong where two cores repeatedly alternate ownership of the same L1 cache line. Tests cover the window sizing policy, ordered output for all three batch encode entry points, parallel error propagation, and the parallelism-used marker. Verification: - cargo fmt --manifest-path tokenizers/Cargo.toml. - cargo test --manifest-path tokenizers/Cargo.toml --lib --features http: 205 passed. - on Vera, 88 Rayon threads, bpe-encode/BPE GPT2 encode batch: LSE samples dropped from ~5.5% on main to ~0.9% on p2-prime; throughput measured at main 20.67 MiB/s, p2 24.53 MiB/s, p2-prime 23.55 MiB/s.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace the batch encode parallel-iterator collect path with a compact scoped scheduler that uses one cache-line-padded atomic counter to hand out contiguous windows of work. This keeps the throughput win with less code than the standalone batch utility while preserving the fork-safety parallelism marker for direct Rayon use.
The speedup comes from avoiding Rayon per-item iterator plumbing and mutex-backed worker wakeups in this hot loop. On aarch64 those wakeups show up as LSE atomic helper calls such as __aarch64_cas4_acq and __aarch64_ldadd8_acq_rel.
The scheduler claims contiguous windows so one producer thread fills a run of adjacent input and result slots before another thread touches the next run. Completed windows flow through shared L3 cache as whole runs, avoiding producer and consumer ping-pong where two cores repeatedly alternate ownership of the same L1 cache line.
Tests cover the window sizing policy, ordered output for all three batch encode entry points, parallel error propagation, and the parallelism-used marker.
Verification: