bench(engine): drain_parallel alternatives at 10K/100K items (#1682) by oferchen · Pull Request #4214 · oferchen/rsync

oferchen · 2026-05-17T18:56:58Z

Summary

Adds crates/engine/benches/drain_parallel_alternatives.rs, a Criterion micro-benchmark that compares three fan-in strategies for WorkQueueReceiver::drain_parallel:
1. sharded_mutex_vec - the current Vec<Mutex<Vec<R>>> indexed by rayon::current_thread_index(), one rayon task per item.
2. per_thread_vec - one rayon task per worker via par_chunks; each task owns its Vec<R> exclusively, no mutex on the hot path.
3. mpsc_unbounded_channel - crossbeam_channel::unbounded MPSC drain, one task per item.
Sweeps 10K and 100K items across 4 / 8 / 16 rayon workers; reports throughput in Throughput::Elements(N) so reviewers can read elements/sec directly off the Criterion summary.
Pre-allocates DeltaWork items outside the timed section. Each thread count gets its own private rayon::ThreadPool so the global pool cannot skew the measurement.
Registers the bench in crates/engine/Cargo.toml.

Bench only. No production changes.

Action this evidence informs

Refactor daemon CLI into dedicated module #1681 (lock-free MPSC drain_parallel replacement) - if MPSC or per-thread-Vec beats the sharded mutex at T=8/16 by a margin large enough to justify the churn, Refactor daemon CLI into dedicated module #1681 picks the winner; otherwise Refactor daemon CLI into dedicated module #1681 closes as "no change warranted".
Refs bench: profile Arc<Mutex<Vec>> contention in parallel-stat path (#1192) #4170 (parallel-stat collector contention bench, established pattern), docs(audits): WorkQueueSender multi-producer usage audit (#1383) #4173 (Mutex<Vec> audit naming this site), bench(transfer): stdlib mpsc vs crossbeam channel overhead at 100K items (#1592) #4203 (sync-channel overhead bench, parity reference for the MPSC strategy).

Test plan

CI: fmt + clippy clean on the new bench file
CI: cross-platform compile (bench gated by existing [target.'cfg(unix)'.dev-dependencies] criterion entry, matches sibling benches)
Local: cargo bench -p engine --bench drain_parallel_alternatives -- --quick produces all 18 benchmark IDs (3 strategies x 3 thread counts x 2 item counts)

Design note for the lock-free MPSC variant of WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel swap, documents the ordering contract delegated to ReorderBuffer, defines the 20% threshold on the #4214 drain_parallel_alternatives bench that gates the migration, and lays out a feature-flag rollout plan. Recommendation is to defer the implementation until the #4214 numbers land on the reference host. Cross-refs #4170, #4173, #4203, #4214.

Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.

…1405) (#4224) Replaces the prior multi-root-focused note at the same path with a focused #1405 design discussion for parallel generator fan-in. Surveys the current SP / MP shape (feature-gated Clone at work_queue/multi_producer.rs), the candidate use cases, the three adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit producer-vector constructor), and the ordering, capacity, and bench evidence that would have to land before flipping the default. Recommends keeping WorkQueueSender Send + !Clone in default builds and the multi-producer cargo feature opt-in. Documents the cost so reviewers can reject naive MP refactors. Promotion to default-on is gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench showing no regression, and an actual fan-in caller materialising.

Add a design note that inventories the current two-thread delta-drain/delta-reorder topology, proves the ordering contract is preserved by the per-result sequence key, sketches the single-thread alternative, and recommends deferring the refactor until the drain_parallel_alternatives bench from #4214 lands. Includes proposed metric additions (force_insert counter, drain batch size and pause histograms) that should land independently as a prerequisite to any later capacity-contract work.

Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).