Batch encode: simple lock-free scheduler by sebpop · Pull Request #2044 · huggingface/tokenizers

sebpop · 2026-04-28T10:42:18Z

Replace the batch encode parallel-iterator collect path with a compact scoped scheduler that uses one cache-line-padded atomic counter to hand out contiguous windows of work. This keeps the throughput win with less code than the standalone batch utility while preserving the fork-safety parallelism marker for direct Rayon use.

The speedup comes from avoiding Rayon per-item iterator plumbing and mutex-backed worker wakeups in this hot loop. On aarch64 those wakeups show up as LSE atomic helper calls such as __aarch64_cas4_acq and __aarch64_ldadd8_acq_rel.

The scheduler claims contiguous windows so one producer thread fills a run of adjacent input and result slots before another thread touches the next run. Completed windows flow through shared L3 cache as whole runs, avoiding producer and consumer ping-pong where two cores repeatedly alternate ownership of the same L1 cache line.

Tests cover the window sizing policy, ordered output for all three batch encode entry points, parallel error propagation, and the parallelism-used marker.

Verification:

cargo fmt --manifest-path tokenizers/Cargo.toml.
cargo test --manifest-path tokenizers/Cargo.toml --lib --features http: 205 passed.
on Vera, 88 Rayon threads, bpe-encode/BPE GPT2 encode batch: LSE samples dropped from ~5.5% on main to ~0.9% on p2-prime; throughput measured at main 20.67 MiB/s, p2 24.53 MiB/s, p2-prime 23.55 MiB/s.

Replace the batch encode parallel-iterator collect path with a compact scoped scheduler that uses one cache-line-padded atomic counter to hand out contiguous windows of work. This keeps the throughput win with less code than the standalone batch utility while preserving the fork-safety parallelism marker for direct Rayon use. The speedup comes from avoiding Rayon per-item iterator plumbing and mutex-backed worker wakeups in this hot loop. On aarch64 those wakeups show up as LSE atomic helper calls such as __aarch64_cas4_acq and __aarch64_ldadd8_acq_rel. The scheduler claims contiguous windows so one producer thread fills a run of adjacent input and result slots before another thread touches the next run. Completed windows flow through shared L3 cache as whole runs, avoiding producer and consumer ping-pong where two cores repeatedly alternate ownership of the same L1 cache line. Tests cover the window sizing policy, ordered output for all three batch encode entry points, parallel error propagation, and the parallelism-used marker. Verification: - cargo fmt --manifest-path tokenizers/Cargo.toml. - cargo test --manifest-path tokenizers/Cargo.toml --lib --features http: 205 passed. - on Vera, 88 Rayon threads, bpe-encode/BPE GPT2 encode batch: LSE samples dropped from ~5.5% on main to ~0.9% on p2-prime; throughput measured at main 20.67 MiB/s, p2 24.53 MiB/s, p2-prime 23.55 MiB/s.

sebpop mentioned this pull request Apr 28, 2026

Batch encode: lock-free work queue with dynamic window sizing #2029

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch encode: simple lock-free scheduler#2044

Batch encode: simple lock-free scheduler#2044
sebpop wants to merge 1 commit intohuggingface:mainfrom
sebpop:p2-prime

sebpop commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sebpop commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant