bench(engine): drain_parallel alternatives at 10K/100K items (#1682)#4214
Merged
Conversation
This was referenced May 17, 2026
oferchen
added a commit
that referenced
this pull request
May 17, 2026
Design note for the lock-free MPSC variant of WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel swap, documents the ordering contract delegated to ReorderBuffer, defines the 20% threshold on the #4214 drain_parallel_alternatives bench that gates the migration, and lays out a feature-flag rollout plan. Recommendation is to defer the implementation until the #4214 numbers land on the reference host. Cross-refs #4170, #4173, #4203, #4214.
This was referenced May 17, 2026
2e2eeef to
aa91838
Compare
oferchen
added a commit
that referenced
this pull request
May 17, 2026
Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.
oferchen
added a commit
that referenced
this pull request
May 17, 2026
…1405) (#4224) Replaces the prior multi-root-focused note at the same path with a focused #1405 design discussion for parallel generator fan-in. Surveys the current SP / MP shape (feature-gated Clone at work_queue/multi_producer.rs), the candidate use cases, the three adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit producer-vector constructor), and the ordering, capacity, and bench evidence that would have to land before flipping the default. Recommends keeping WorkQueueSender Send + !Clone in default builds and the multi-producer cargo feature opt-in. Documents the cost so reviewers can reject naive MP refactors. Promotion to default-on is gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench showing no regression, and an actual fan-in caller materialising.
2 tasks
oferchen
added a commit
that referenced
this pull request
May 17, 2026
Add a design note that inventories the current two-thread delta-drain/delta-reorder topology, proves the ordering contract is preserved by the per-result sequence key, sketches the single-thread alternative, and recommends deferring the refactor until the drain_parallel_alternatives bench from #4214 lands. Includes proposed metric additions (force_insert counter, drain batch size and pause histograms) that should land independently as a prerequisite to any later capacity-contract work.
Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).
aa91838 to
7cc48f1
Compare
8 tasks
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Design note for the lock-free MPSC variant of WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel swap, documents the ordering contract delegated to ReorderBuffer, defines the 20% threshold on the #4214 drain_parallel_alternatives bench that gates the migration, and lays out a feature-flag rollout plan. Recommendation is to defer the implementation until the #4214 numbers land on the reference host. Cross-refs #4170, #4173, #4203, #4214.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1405) (#4224) Replaces the prior multi-root-focused note at the same path with a focused #1405 design discussion for parallel generator fan-in. Surveys the current SP / MP shape (feature-gated Clone at work_queue/multi_producer.rs), the candidate use cases, the three adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit producer-vector constructor), and the ordering, capacity, and bench evidence that would have to land before flipping the default. Recommends keeping WorkQueueSender Send + !Clone in default builds and the multi-producer cargo feature opt-in. Documents the cost so reviewers can reject naive MP refactors. Promotion to default-on is gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench showing no regression, and an actual fan-in caller materialising.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Add a design note that inventories the current two-thread delta-drain/delta-reorder topology, proves the ordering contract is preserved by the per-result sequence key, sketches the single-thread alternative, and recommends deferring the refactor until the drain_parallel_alternatives bench from #4214 lands. Includes proposed metric additions (force_insert counter, drain batch size and pause histograms) that should land independently as a prerequisite to any later capacity-contract work.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Design note for the lock-free MPSC variant of WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel swap, documents the ordering contract delegated to ReorderBuffer, defines the 20% threshold on the #4214 drain_parallel_alternatives bench that gates the migration, and lays out a feature-flag rollout plan. Recommendation is to defer the implementation until the #4214 numbers land on the reference host. Cross-refs #4170, #4173, #4203, #4214.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1405) (#4224) Replaces the prior multi-root-focused note at the same path with a focused #1405 design discussion for parallel generator fan-in. Surveys the current SP / MP shape (feature-gated Clone at work_queue/multi_producer.rs), the candidate use cases, the three adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit producer-vector constructor), and the ordering, capacity, and bench evidence that would have to land before flipping the default. Recommends keeping WorkQueueSender Send + !Clone in default builds and the multi-producer cargo feature opt-in. Documents the cost so reviewers can reject naive MP refactors. Promotion to default-on is gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench showing no regression, and an actual fan-in caller materialising.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Add a design note that inventories the current two-thread delta-drain/delta-reorder topology, proves the ordering contract is preserved by the per-result sequence key, sketches the single-thread alternative, and recommends deferring the refactor until the drain_parallel_alternatives bench from #4214 lands. Includes proposed metric additions (force_insert counter, drain batch size and pause histograms) that should land independently as a prerequisite to any later capacity-contract work.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).
Merged
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
crates/engine/benches/drain_parallel_alternatives.rs, a Criterion micro-benchmark that compares three fan-in strategies forWorkQueueReceiver::drain_parallel:sharded_mutex_vec- the currentVec<Mutex<Vec<R>>>indexed byrayon::current_thread_index(), one rayon task per item.per_thread_vec- one rayon task per worker viapar_chunks; each task owns itsVec<R>exclusively, no mutex on the hot path.mpsc_unbounded_channel-crossbeam_channel::unboundedMPSC drain, one task per item.Throughput::Elements(N)so reviewers can read elements/sec directly off the Criterion summary.DeltaWorkitems outside the timed section. Each thread count gets its own privaterayon::ThreadPoolso the global pool cannot skew the measurement.crates/engine/Cargo.toml.Bench only. No production changes.
Action this evidence informs
Mutex<Vec>audit naming this site), bench(transfer): stdlib mpsc vs crossbeam channel overhead at 100K items (#1592) #4203 (sync-channel overhead bench, parity reference for the MPSC strategy).Test plan
[target.'cfg(unix)'.dev-dependencies]criterion entry, matches sibling benches)cargo bench -p engine --bench drain_parallel_alternatives -- --quickproduces all 18 benchmark IDs (3 strategies x 3 thread counts x 2 item counts)