Skip to content

bench(engine): drain_parallel alternatives at 10K/100K items (#1682)#4214

Merged
oferchen merged 1 commit into
masterfrom
bench/drain-parallel-alternatives-1682
May 17, 2026
Merged

bench(engine): drain_parallel alternatives at 10K/100K items (#1682)#4214
oferchen merged 1 commit into
masterfrom
bench/drain-parallel-alternatives-1682

Conversation

@oferchen
Copy link
Copy Markdown
Owner

Summary

  • Adds crates/engine/benches/drain_parallel_alternatives.rs, a Criterion micro-benchmark that compares three fan-in strategies for WorkQueueReceiver::drain_parallel:
    1. sharded_mutex_vec - the current Vec<Mutex<Vec<R>>> indexed by rayon::current_thread_index(), one rayon task per item.
    2. per_thread_vec - one rayon task per worker via par_chunks; each task owns its Vec<R> exclusively, no mutex on the hot path.
    3. mpsc_unbounded_channel - crossbeam_channel::unbounded MPSC drain, one task per item.
  • Sweeps 10K and 100K items across 4 / 8 / 16 rayon workers; reports throughput in Throughput::Elements(N) so reviewers can read elements/sec directly off the Criterion summary.
  • Pre-allocates DeltaWork items outside the timed section. Each thread count gets its own private rayon::ThreadPool so the global pool cannot skew the measurement.
  • Registers the bench in crates/engine/Cargo.toml.

Bench only. No production changes.

Action this evidence informs

Test plan

  • CI: fmt + clippy clean on the new bench file
  • CI: cross-platform compile (bench gated by existing [target.'cfg(unix)'.dev-dependencies] criterion entry, matches sibling benches)
  • Local: cargo bench -p engine --bench drain_parallel_alternatives -- --quick produces all 18 benchmark IDs (3 strategies x 3 thread counts x 2 item counts)

oferchen added a commit that referenced this pull request May 17, 2026
Design note for the lock-free MPSC variant of
WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel
swap, documents the ordering contract delegated to ReorderBuffer,
defines the 20% threshold on the #4214 drain_parallel_alternatives
bench that gates the migration, and lays out a feature-flag rollout
plan. Recommendation is to defer the implementation until the
#4214 numbers land on the reference host. Cross-refs #4170, #4173,
#4203, #4214.
@oferchen oferchen force-pushed the bench/drain-parallel-alternatives-1682 branch from 2e2eeef to aa91838 Compare May 17, 2026 19:19
oferchen added a commit that referenced this pull request May 17, 2026
Captures the design for parallelizing the receiver's per-file delta
apply loop while preserving per-file token order and wire-format
parity. Documents the current sequential surface, the dormant
ParallelDeltaPipeline that would host the change, the backpressure
model, and the gating prerequisites - chiefly the parity-test gap
flagged by audit #4205 (G2) - that block default adoption.

Recommends a phased opt-in rollout: land the sequential-vs-parallel
parity test first, then add a hidden CLI gate, then collect #4214 /
#4206 bench evidence, and only then consider flipping the default.
oferchen added a commit that referenced this pull request May 17, 2026
…1405) (#4224)

Replaces the prior multi-root-focused note at the same path with a
focused #1405 design discussion for parallel generator fan-in.
Surveys the current SP / MP shape (feature-gated Clone at
work_queue/multi_producer.rs), the candidate use cases, the three
adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit
producer-vector constructor), and the ordering, capacity, and bench
evidence that would have to land before flipping the default.

Recommends keeping WorkQueueSender Send + !Clone in default builds
and the multi-producer cargo feature opt-in. Documents the cost so
reviewers can reject naive MP refactors. Promotion to default-on is
gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench
showing no regression, and an actual fan-in caller materialising.
oferchen added a commit that referenced this pull request May 17, 2026
Add a design note that inventories the current two-thread
delta-drain/delta-reorder topology, proves the ordering contract is
preserved by the per-result sequence key, sketches the single-thread
alternative, and recommends deferring the refactor until the
drain_parallel_alternatives bench from #4214 lands. Includes proposed
metric additions (force_insert counter, drain batch size and pause
histograms) that should land independently as a prerequisite to any
later capacity-contract work.
Adds drain_parallel_alternatives benchmark comparing three fan-in
strategies on the WorkQueueReceiver drain path: the current sharded
Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final
concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and
100K items across 4, 8, and 16 rayon workers, reporting throughput in
elements per second so reviewers can compare per-iteration cost
directly.

Pre-allocates DeltaWork items outside the timed section, isolates each
worker count in a private rayon pool, and shares the same simulated
per-item compute across strategies so the only delta between groups is
the collector itself. Closes the measurement gap that #1681 needs to
decide whether the current sharded Mutex<Vec> warrants replacement
(#1682, refs #4170 / #4173 / #4203).
@oferchen oferchen force-pushed the bench/drain-parallel-alternatives-1682 branch from aa91838 to 7cc48f1 Compare May 17, 2026 19:45
@oferchen oferchen merged commit 665cd8e into master May 17, 2026
13 checks passed
@oferchen oferchen deleted the bench/drain-parallel-alternatives-1682 branch May 17, 2026 19:45
oferchen added a commit that referenced this pull request May 18, 2026
Design note for the lock-free MPSC variant of
WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel
swap, documents the ordering contract delegated to ReorderBuffer,
defines the 20% threshold on the #4214 drain_parallel_alternatives
bench that gates the migration, and lays out a feature-flag rollout
plan. Recommendation is to defer the implementation until the
#4214 numbers land on the reference host. Cross-refs #4170, #4173,
#4203, #4214.
oferchen added a commit that referenced this pull request May 18, 2026
Captures the design for parallelizing the receiver's per-file delta
apply loop while preserving per-file token order and wire-format
parity. Documents the current sequential surface, the dormant
ParallelDeltaPipeline that would host the change, the backpressure
model, and the gating prerequisites - chiefly the parity-test gap
flagged by audit #4205 (G2) - that block default adoption.

Recommends a phased opt-in rollout: land the sequential-vs-parallel
parity test first, then add a hidden CLI gate, then collect #4214 /
#4206 bench evidence, and only then consider flipping the default.
oferchen added a commit that referenced this pull request May 18, 2026
…1405) (#4224)

Replaces the prior multi-root-focused note at the same path with a
focused #1405 design discussion for parallel generator fan-in.
Surveys the current SP / MP shape (feature-gated Clone at
work_queue/multi_producer.rs), the candidate use cases, the three
adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit
producer-vector constructor), and the ordering, capacity, and bench
evidence that would have to land before flipping the default.

Recommends keeping WorkQueueSender Send + !Clone in default builds
and the multi-producer cargo feature opt-in. Documents the cost so
reviewers can reject naive MP refactors. Promotion to default-on is
gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench
showing no regression, and an actual fan-in caller materialising.
oferchen added a commit that referenced this pull request May 18, 2026
Add a design note that inventories the current two-thread
delta-drain/delta-reorder topology, proves the ordering contract is
preserved by the per-result sequence key, sketches the single-thread
alternative, and recommends deferring the refactor until the
drain_parallel_alternatives bench from #4214 lands. Includes proposed
metric additions (force_insert counter, drain batch size and pause
histograms) that should land independently as a prerequisite to any
later capacity-contract work.
oferchen added a commit that referenced this pull request May 18, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in
strategies on the WorkQueueReceiver drain path: the current sharded
Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final
concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and
100K items across 4, 8, and 16 rayon workers, reporting throughput in
elements per second so reviewers can compare per-iteration cost
directly.

Pre-allocates DeltaWork items outside the timed section, isolates each
worker count in a private rayon pool, and shares the same simulated
per-item compute across strategies so the only delta between groups is
the collector itself. Closes the measurement gap that #1681 needs to
decide whether the current sharded Mutex<Vec> warrants replacement
(#1682, refs #4170 / #4173 / #4203).
oferchen added a commit that referenced this pull request May 18, 2026
Design note for the lock-free MPSC variant of
WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel
swap, documents the ordering contract delegated to ReorderBuffer,
defines the 20% threshold on the #4214 drain_parallel_alternatives
bench that gates the migration, and lays out a feature-flag rollout
plan. Recommendation is to defer the implementation until the
#4214 numbers land on the reference host. Cross-refs #4170, #4173,
#4203, #4214.
oferchen added a commit that referenced this pull request May 18, 2026
Captures the design for parallelizing the receiver's per-file delta
apply loop while preserving per-file token order and wire-format
parity. Documents the current sequential surface, the dormant
ParallelDeltaPipeline that would host the change, the backpressure
model, and the gating prerequisites - chiefly the parity-test gap
flagged by audit #4205 (G2) - that block default adoption.

Recommends a phased opt-in rollout: land the sequential-vs-parallel
parity test first, then add a hidden CLI gate, then collect #4214 /
#4206 bench evidence, and only then consider flipping the default.
oferchen added a commit that referenced this pull request May 18, 2026
…1405) (#4224)

Replaces the prior multi-root-focused note at the same path with a
focused #1405 design discussion for parallel generator fan-in.
Surveys the current SP / MP shape (feature-gated Clone at
work_queue/multi_producer.rs), the candidate use cases, the three
adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit
producer-vector constructor), and the ordering, capacity, and bench
evidence that would have to land before flipping the default.

Recommends keeping WorkQueueSender Send + !Clone in default builds
and the multi-producer cargo feature opt-in. Documents the cost so
reviewers can reject naive MP refactors. Promotion to default-on is
gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench
showing no regression, and an actual fan-in caller materialising.
oferchen added a commit that referenced this pull request May 18, 2026
Add a design note that inventories the current two-thread
delta-drain/delta-reorder topology, proves the ordering contract is
preserved by the per-result sequence key, sketches the single-thread
alternative, and recommends deferring the refactor until the
drain_parallel_alternatives bench from #4214 lands. Includes proposed
metric additions (force_insert counter, drain batch size and pause
histograms) that should land independently as a prerequisite to any
later capacity-contract work.
oferchen added a commit that referenced this pull request May 18, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in
strategies on the WorkQueueReceiver drain path: the current sharded
Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final
concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and
100K items across 4, 8, and 16 rayon workers, reporting throughput in
elements per second so reviewers can compare per-iteration cost
directly.

Pre-allocates DeltaWork items outside the timed section, isolates each
worker count in a private rayon pool, and shares the same simulated
per-item compute across strategies so the only delta between groups is
the collector itself. Closes the measurement gap that #1681 needs to
decide whether the current sharded Mutex<Vec> warrants replacement
(#1682, refs #4170 / #4173 / #4203).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant