Skip to content

perf: sharded mode pipeline throughput optimization#112

Merged
kacy merged 3 commits intomainfrom
perf/sharded-pipeline-throughput
Feb 12, 2026
Merged

perf: sharded mode pipeline throughput optimization#112
kacy merged 3 commits intomainfrom
perf/sharded-pipeline-throughput

Conversation

@kacy
Copy link
Copy Markdown
Owner

@kacy kacy commented Feb 12, 2026

summary

three optimizations to reduce channel overhead in sharded mode, targeting the 4-8x throughput gap vs concurrent mode at high pipeline depths (P=16).

root cause: every pipelined command in sharded mode pays ~300-450ns of channel overhead (oneshot allocation + mpsc send + atomic wakeup + oneshot recv). the actual keyspace operation takes ~20-50ns. channel overhead is 6-9x the useful work.

changes

  1. shard channel draining — after recv() returns a message, drain pending messages with try_recv() before re-entering select!. this amortizes the select! overhead across bursts of pipelined commands.

  2. eliminate key clones — make Engine::shard_for_key public and compute the shard index before moving the key into ShardRequest, avoiding one String heap allocation per single-key command (~30 command types).

  3. dispatch-collect pipeline — replace join_all(futures) with a two-phase pattern: dispatch all commands to shards via mpsc sends (fast, non-blocking), then collect oneshot responses in order. eliminates N large async state machines from join_all. uses a lightweight ResponseTag enum to guide ShardResponse → Frame conversion in the collect phase.

what was tested

  • cargo clippy --workspace --features protobuf -- -D warnings — clean
  • cargo test --workspace --features protobuf — all 363 unit tests + 113 integration tests pass (occasional flaky port-collision failures pre-exist on main)
  • the optimizations are pure refactoring of the dispatch/response path with no behavioral changes

design considerations

  • single-key commands (GET, SET, INCR, etc.) use the fast dispatch path. complex commands (broadcast, multi-key, cluster, pub/sub) fall through to the existing execute() function.
  • ResponseTag is a ~1-byte enum discriminant that replaces keeping the full Command alive during the collect phase.
  • AOF ordering is preserved since the shard loop still processes messages serially.
  • the process_message helper in shard.rs handles the same dispatch + AOF + special-request logic as before, just extracted to avoid duplication.

kacy added 3 commits February 12, 2026 16:14
extract the message-processing body into a process_message helper,
then add a try_recv drain loop after the initial recv(). this
amortizes the tokio::select! overhead across bursts of pipelined
commands — multiple messages are processed per wakeup instead of
re-entering select! for each one.
make Engine::shard_for_key and ShardHandle::dispatch public, add
Engine::dispatch_to_shard for the upcoming dispatch-collect pipeline.

update all ~30 single-key command arms in execute() to compute the
shard index first, then move the key into ShardRequest instead of
cloning it. saves one String heap allocation per request.

multi-key commands (MGET, MSET, DEL, EXISTS) still clone since
keys go to different shards.
replace the join_all(futures) pipeline pattern with a two-phase
dispatch-collect approach:

1. dispatch phase: parse each frame, send the request to the owning
   shard via mpsc (fast — completes immediately with channel capacity),
   storing a oneshot receiver + lightweight ResponseTag

2. collect phase: await each oneshot in order, convert ShardResponse
   to Frame using the tag

this eliminates N large async state machines from join_all (each was
a full process() + execute() future, ~1KB+ on the stack for P=16).
instead, all dispatches are simple mpsc sends, and shards process in
parallel while the connection handler waits.

single-key commands (GET, SET, INCR, etc) use the fast dispatch path.
complex commands (broadcast, multi-key, cluster) fall through to the
existing execute() function.
@kacy kacy merged commit d0d48e6 into main Feb 12, 2026
6 of 7 checks passed
@kacy kacy deleted the perf/sharded-pipeline-throughput branch February 12, 2026 21:36
kacy added a commit that referenced this pull request Feb 19, 2026
* perf: drain shard channel after recv to reduce select! overhead

extract the message-processing body into a process_message helper,
then add a try_recv drain loop after the initial recv(). this
amortizes the tokio::select! overhead across bursts of pipelined
commands — multiple messages are processed per wakeup instead of
re-entering select! for each one.

* perf: eliminate key clones in single-key shard commands

make Engine::shard_for_key and ShardHandle::dispatch public, add
Engine::dispatch_to_shard for the upcoming dispatch-collect pipeline.

update all ~30 single-key command arms in execute() to compute the
shard index first, then move the key into ShardRequest instead of
cloning it. saves one String heap allocation per request.

multi-key commands (MGET, MSET, DEL, EXISTS) still clone since
keys go to different shards.

* perf: replace join_all with dispatch-collect pipeline

replace the join_all(futures) pipeline pattern with a two-phase
dispatch-collect approach:

1. dispatch phase: parse each frame, send the request to the owning
   shard via mpsc (fast — completes immediately with channel capacity),
   storing a oneshot receiver + lightweight ResponseTag

2. collect phase: await each oneshot in order, convert ShardResponse
   to Frame using the tag

this eliminates N large async state machines from join_all (each was
a full process() + execute() future, ~1KB+ on the stack for P=16).
instead, all dispatches are simple mpsc sends, and shards process in
parallel while the connection handler waits.

single-key commands (GET, SET, INCR, etc) use the fast dispatch path.
complex commands (broadcast, multi-key, cluster) fall through to the
existing execute() function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant