Skip to content

Batch encode: simple lock-free scheduler#2044

Open
sebpop wants to merge 1 commit intohuggingface:mainfrom
sebpop:p2-prime
Open

Batch encode: simple lock-free scheduler#2044
sebpop wants to merge 1 commit intohuggingface:mainfrom
sebpop:p2-prime

Conversation

@sebpop
Copy link
Copy Markdown
Contributor

@sebpop sebpop commented Apr 28, 2026

Replace the batch encode parallel-iterator collect path with a compact scoped scheduler that uses one cache-line-padded atomic counter to hand out contiguous windows of work. This keeps the throughput win with less code than the standalone batch utility while preserving the fork-safety parallelism marker for direct Rayon use.

The speedup comes from avoiding Rayon per-item iterator plumbing and mutex-backed worker wakeups in this hot loop. On aarch64 those wakeups show up as LSE atomic helper calls such as __aarch64_cas4_acq and __aarch64_ldadd8_acq_rel.

The scheduler claims contiguous windows so one producer thread fills a run of adjacent input and result slots before another thread touches the next run. Completed windows flow through shared L3 cache as whole runs, avoiding producer and consumer ping-pong where two cores repeatedly alternate ownership of the same L1 cache line.

Tests cover the window sizing policy, ordered output for all three batch encode entry points, parallel error propagation, and the parallelism-used marker.

Verification:

  • cargo fmt --manifest-path tokenizers/Cargo.toml.
  • cargo test --manifest-path tokenizers/Cargo.toml --lib --features http: 205 passed.
  • on Vera, 88 Rayon threads, bpe-encode/BPE GPT2 encode batch: LSE samples dropped from ~5.5% on main to ~0.9% on p2-prime; throughput measured at main 20.67 MiB/s, p2 24.53 MiB/s, p2-prime 23.55 MiB/s.

Replace the batch encode parallel-iterator collect path with a compact scoped
scheduler that uses one cache-line-padded atomic counter to hand out contiguous
windows of work. This keeps the throughput win with less code than the
standalone batch utility while preserving the fork-safety parallelism marker for
direct Rayon use.

The speedup comes from avoiding Rayon per-item iterator plumbing and
mutex-backed worker wakeups in this hot loop. On aarch64 those wakeups show up
as LSE atomic helper calls such as __aarch64_cas4_acq and
__aarch64_ldadd8_acq_rel.

The scheduler claims contiguous windows so one producer thread fills a run of
adjacent input and result slots before another thread touches the next
run. Completed windows flow through shared L3 cache as whole runs, avoiding
producer and consumer ping-pong where two cores repeatedly alternate ownership
of the same L1 cache line.

Tests cover the window sizing policy, ordered output for all three batch encode
entry points, parallel error propagation, and the parallelism-used marker.

Verification:
- cargo fmt --manifest-path tokenizers/Cargo.toml.
- cargo test --manifest-path tokenizers/Cargo.toml --lib --features http: 205 passed.
- on Vera, 88 Rayon threads, bpe-encode/BPE GPT2 encode batch:
  LSE samples dropped from ~5.5% on main to ~0.9% on p2-prime;
  throughput measured at main 20.67 MiB/s, p2 24.53 MiB/s, p2-prime 23.55 MiB/s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant