perf: sharded mode pipeline throughput optimization by kacy · Pull Request #112 · kacy/ember

kacy · 2026-02-12T21:29:30Z

summary

three optimizations to reduce channel overhead in sharded mode, targeting the 4-8x throughput gap vs concurrent mode at high pipeline depths (P=16).

root cause: every pipelined command in sharded mode pays ~300-450ns of channel overhead (oneshot allocation + mpsc send + atomic wakeup + oneshot recv). the actual keyspace operation takes ~20-50ns. channel overhead is 6-9x the useful work.

changes

shard channel draining — after recv() returns a message, drain pending messages with try_recv() before re-entering select!. this amortizes the select! overhead across bursts of pipelined commands.
eliminate key clones — make Engine::shard_for_key public and compute the shard index before moving the key into ShardRequest, avoiding one String heap allocation per single-key command (~30 command types).
dispatch-collect pipeline — replace join_all(futures) with a two-phase pattern: dispatch all commands to shards via mpsc sends (fast, non-blocking), then collect oneshot responses in order. eliminates N large async state machines from join_all. uses a lightweight ResponseTag enum to guide ShardResponse → Frame conversion in the collect phase.

what was tested

cargo clippy --workspace --features protobuf -- -D warnings — clean
cargo test --workspace --features protobuf — all 363 unit tests + 113 integration tests pass (occasional flaky port-collision failures pre-exist on main)
the optimizations are pure refactoring of the dispatch/response path with no behavioral changes

design considerations

single-key commands (GET, SET, INCR, etc.) use the fast dispatch path. complex commands (broadcast, multi-key, cluster, pub/sub) fall through to the existing execute() function.
ResponseTag is a ~1-byte enum discriminant that replaces keeping the full Command alive during the collect phase.
AOF ordering is preserved since the shard loop still processes messages serially.
the process_message helper in shard.rs handles the same dispatch + AOF + special-request logic as before, just extracted to avoid duplication.

extract the message-processing body into a process_message helper, then add a try_recv drain loop after the initial recv(). this amortizes the tokio::select! overhead across bursts of pipelined commands — multiple messages are processed per wakeup instead of re-entering select! for each one.

make Engine::shard_for_key and ShardHandle::dispatch public, add Engine::dispatch_to_shard for the upcoming dispatch-collect pipeline. update all ~30 single-key command arms in execute() to compute the shard index first, then move the key into ShardRequest instead of cloning it. saves one String heap allocation per request. multi-key commands (MGET, MSET, DEL, EXISTS) still clone since keys go to different shards.

replace the join_all(futures) pipeline pattern with a two-phase dispatch-collect approach: 1. dispatch phase: parse each frame, send the request to the owning shard via mpsc (fast — completes immediately with channel capacity), storing a oneshot receiver + lightweight ResponseTag 2. collect phase: await each oneshot in order, convert ShardResponse to Frame using the tag this eliminates N large async state machines from join_all (each was a full process() + execute() future, ~1KB+ on the stack for P=16). instead, all dispatches are simple mpsc sends, and shards process in parallel while the connection handler waits. single-key commands (GET, SET, INCR, etc) use the fast dispatch path. complex commands (broadcast, multi-key, cluster) fall through to the existing execute() function.

* perf: drain shard channel after recv to reduce select! overhead extract the message-processing body into a process_message helper, then add a try_recv drain loop after the initial recv(). this amortizes the tokio::select! overhead across bursts of pipelined commands — multiple messages are processed per wakeup instead of re-entering select! for each one. * perf: eliminate key clones in single-key shard commands make Engine::shard_for_key and ShardHandle::dispatch public, add Engine::dispatch_to_shard for the upcoming dispatch-collect pipeline. update all ~30 single-key command arms in execute() to compute the shard index first, then move the key into ShardRequest instead of cloning it. saves one String heap allocation per request. multi-key commands (MGET, MSET, DEL, EXISTS) still clone since keys go to different shards. * perf: replace join_all with dispatch-collect pipeline replace the join_all(futures) pipeline pattern with a two-phase dispatch-collect approach: 1. dispatch phase: parse each frame, send the request to the owning shard via mpsc (fast — completes immediately with channel capacity), storing a oneshot receiver + lightweight ResponseTag 2. collect phase: await each oneshot in order, convert ShardResponse to Frame using the tag this eliminates N large async state machines from join_all (each was a full process() + execute() future, ~1KB+ on the stack for P=16). instead, all dispatches are simple mpsc sends, and shards process in parallel while the connection handler waits. single-key commands (GET, SET, INCR, etc) use the fast dispatch path. complex commands (broadcast, multi-key, cluster) fall through to the existing execute() function.

kacy added 3 commits February 12, 2026 16:14

kacy merged commit d0d48e6 into main Feb 12, 2026
6 of 7 checks passed

kacy deleted the perf/sharded-pipeline-throughput branch February 12, 2026 21:36

kacy mentioned this pull request Feb 12, 2026

update benchmark results from gcp c2-standard-8 #113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: sharded mode pipeline throughput optimization#112

perf: sharded mode pipeline throughput optimization#112
kacy merged 3 commits intomainfrom
perf/sharded-pipeline-throughput

kacy commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kacy commented Feb 12, 2026

summary

changes

what was tested

design considerations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant