feat: multi-shard improvements Tier 1 — operator UX by TinDang97 · Pull Request #99 · pilotspace/moon

TinDang97 · 2026-05-23T16:22:35Z

Summary

Tier 1 of the multi-shard improvements plan (see .planning/multi-shard-improvements/CONTEXT.md + PLAN.md). Five tasks closing documented operator-UX footguns. No hot-path regressions.

T1.1 b9c0bee — Startup warn when maxclients < 25 × shards (footgun documented in benchmark_scaling_concurrency_2026_04_26). Suppressible via MOON_NO_UNDERSUBSCRIPTION_WARN=1. Pure-function should_warn_undersubscription(maxclients, num_shards) -> Option<String> with 7 unit tests.
T1.2 4575d34 — Opt-IN auto-shard cap via MOON_AUTO_SHARDS_CONSERVATIVE=1 (clamps --shards 0 auto-detect to min(2, vCPU)). Default unchanged: bare moon --shards 0 still resolves to available_parallelism(). User-requested deviation from the originally-planned opt-OUT semantic to avoid silently degrading existing operators on high-core hosts.
T1.3 60ab2fd — Reframed SWAPDB/MOVE/COPY-DB error text from misleading "not supported in sharded mode" to accurate "not implemented (tracked in T2.x)". Added MOVE to the dispatch table and COMMAND_META as a stub. 3 unit tests.
T1.4 959354a — Surface cross-shard dispatch metrics in INFO stats (total_dispatch_cross_read_fastpath, total_dispatch_cross_spsc) via dedicated static AtomicU64 with Ordering::Relaxed. Works without admin port enabled — the original gap.
T1.5 (orchestrator) — Memory hygiene: corrected project_multi_shard_limitations.md re: --shards 0 default (still 0, not 1).
6a1a044 — clippy::collapsible_if cleanup on lint introduced by T1.1.
46ed28a — .planning submodule pointer bump for T1.2 opt-IN doc.

Benchmark validation (OrbStack moon-dev, aarch64, `--shards 4 -c 200`)

T1.4 is the only task touching a hot path. Median-of-3 at p=64:

Cell	Pre-T1.4 (`4575d34`)	This PR (`46ed28a`)	Δ%
GET p=64 c=200	5,814,325 rps	5,494,857 rps	-5.5%
SET p=64 c=200	3,030,496 rps	3,226,013 rps	+6.5%

Mixed-sign deltas at same depth = noise, not regression. A real cost from the Relaxed fetch_add would show consistent negative deltas across both ops. T1.4 is performance-neutral.

INFO output verified live: total_dispatch_cross_read_fastpath:1124781, total_dispatch_cross_spsc:1124709.

Test plan

cargo fmt --check — clean
cargo clippy --release -- -D warnings (default features) — clean
cargo clippy --release --no-default-features --features runtime-tokio,jemalloc -- -D warnings — clean
Unit tests for T1.1 (8 passed: pure-function + 7 cases), T1.2 (compute_auto_shards), T1.3 (3 passed: swapdb/move/copy_db_option)
Bench: T1.4 zero measurable cost on OrbStack moon-dev
Manual: redis-cli INFO stats | grep -i dispatch returns the new counters
Full cargo test --release (3200+ tests) — agent reported pass on the original commits; CI to re-confirm against rebased tree
CI green on push

Notes for reviewer

Branch was rebased onto current main d532b74 after upstream advanced 9 commits during the impl session. The rebase dropped a problematic commit (117a9ab) that inverted tombstone GC semantics (0,0 → u64::MAX,u64::MAX) on the WAL-overflow emergency checkpoint path. Main upstream independently arrived at the correct 0,0. See feedback_agent_self_doubt_inversion.md in ~/.claude/.../memory/ for the postmortem.
Bench was run on OrbStack (aarch64), not GCloud. Treat as no-regression sanity, not a canonical baseline. Tier 3 will publish baselines on c4a Axion per CLAUDE.md.

Closes the T1 acceptance criteria in .planning/multi-shard-improvements/CONTEXT.md.

Summary by CodeRabbit

New Features
- Auto-detect shard count when --shards 0 is specified, with optional conservative capping via environment variable.
- Display cross-shard dispatch metrics in INFO output.
- Startup warning when maxclients is insufficient for configured shards.
Bug Fixes
- Updated error messages for SWAPDB and COPY DB commands with tracking references.
Chores
- Registered MOVE command stub (returns "not implemented" error).
- Updated submodule reference.

Previously SWAPDB returned "ERR SWAPDB is not supported in sharded mode", which was misleading — the command is not implemented at all, even on a single-shard server. COPY with the DB option had a similar vague message. MOVE (the db-move command) had no dispatch entry or metadata entry at all, causing it to fall through to an unhelpful "unknown command" path. Changes: - SWAPDB now returns: "ERR SWAPDB not implemented (tracked in T2.1)" - MOVE stub added to dispatch table and metadata registry, returns: "ERR MOVE not implemented (tracked in T2.2)" - COPY...DB n now returns: "ERR COPY DB option not implemented (tracked in T2.3)" The tracking references point operators to the T2.x tasks where each command will be fully implemented. Also fixes a pre-existing compile error on this branch: `maybe_force_checkpoint_on_wal_overflow` called `force_checkpoint` with 7 arguments but the function signature requires 9 (tombstone_retain_epochs and tombstone_retain_secs were added in a prior commit but this call site was not updated). The WAL-overflow path does not perform tombstone GC — that is the scheduled persistence tick's responsibility — so 0/0 is the correct value at this call site. Unit tests added: - `swapdb_error_is_not_implemented` (src/command/mod.rs) - `move_error_is_not_implemented` (src/command/mod.rs) - `test_copy_db_option_not_implemented` (src/command/key_extra.rs) author: Tin Dang

…rds) Documented footgun: at c=50 with 8 shards the per-shard SPSC channels are chronically under-subscribed (≈6 clients/shard), causing SET p=1 throughput to collapse to 0.07× single-shard baseline. The empirical safe threshold is ≥25 concurrent clients per shard (benchmark_scaling_concurrency_2026_04_26). On startup, after resolving num_shards, Moon now checks: if maxclients < 25 × num_shards → WARN with suggested fix The warning is: - Suppressed when num_shards == 1 (no cross-shard dispatch). - Suppressed when maxclients == 0 (unlimited — no ceiling to warn about). - Suppressed globally by setting MOON_NO_UNDERSUBSCRIPTION_WARN=1. Implementation uses a pure function `should_warn_undersubscription(maxclients, num_shards) -> Option<String>` so it is unit-testable without tracing_test. The tracing::warn!() call site in main() reads the env var guard directly. Also adds `compute_auto_shards(parallelism, conservative) -> usize` pure function in preparation for T1.2 (opt-in conservative cap). Tests for both functions are included in the same commit so the test suite stays green at every commit boundary. Unit tests (src/main.rs #[cfg(test)]): - no_warn_single_shard - no_warn_unlimited_clients - no_warn_sufficient_clients - warns_below_threshold - warns_high_shard_count - threshold_is_inclusive - auto_shards_conservative_cap (pre-lands the T1.2 pure-function test) author: Tin Dang

Default behaviour is UNCHANGED: --shards 0 still resolves to available_parallelism(). This preserves backward compatibility for operators on high-core hosts who are already using auto-detect. Operators who observe sub-linear multi-shard scaling (documented at benchmark_scaling_concurrency_2026_04_26: SET p=1 8-shard at c=50 was 0.07× baseline) can now opt into the empirical sweet-spot by setting: MOON_AUTO_SHARDS_CONSERVATIVE=1 This caps the auto-detected shard count at min(2, available_parallelism()). A startup INFO log always explains which path was taken. The pure function `compute_auto_shards(parallelism, conservative) -> usize` is unit-tested independently of the env var so correctness is verifiable without process-level environment manipulation. Deviation from PLAN.md original spec: the plan described opt-OUT semantics (MOON_AUTO_SHARDS_AGGRESSIVE=1 to re-enable full vCPU). Changed to opt-IN at user direction before kickoff — default must not silently degrade existing operators. PLAN.md updated inline to reflect shipped reality. Unit test: auto_shards_conservative_cap (src/main.rs) author: Tin Dang

Counter storage type verified: The existing `record_dispatch_cross_*` helpers use the `metrics` crate facade (`counter!()` macro → Prometheus registry). This registry is NOT directly readable from the INFO path — it only surfaces via the /metrics HTTP endpoint. The counters are therefore NOT AtomicU64 values accessible via load(); they are opaque handles owned by the Prometheus recorder. Added dedicated `static AtomicU64` counters (Relaxed ordering, monotonic event counters) incremented in parallel inside the existing `record_dispatch_*` helpers. These always work regardless of whether Prometheus is enabled (admin_port=0). New counters exposed in `INFO stats`: total_dispatch_cross_read_fastpath — commands served via the RwLock shared-read fast path (no SPSC message; low latency but holds a read lock on the target shard's database). total_dispatch_cross_spsc — commands dispatched through the SPSC slow path (covers both reads when fast-path is off AND writes; no separate write-only counter exists in the codebase). Note on plan deviation: PLAN.md listed `total_dispatch_cross_write_spsc` as a separate counter. There is no write-specific SPSC recording in the existing helpers — both read and write cross-shard non-fastpath traffic is recorded by `record_dispatch_cross_spsc`. Exposing a split counter would require adding write-vs-read classification to the handler layer (a T2.x task). INFO exposes the correct unified total instead. `fastpath_avg_ns` is not emitted: the latency histogram at `record_dispatch_cross_read_fastpath_timed` is Prometheus-only. Emitting a meaningful average would require a second `FASTPATH_TOTAL_NS` atomic. Deferred to a follow-up if operators request it; the histogram is already available via /metrics when Prometheus is enabled. Implementation: - src/admin/metrics_setup.rs: two new static AtomicU64 counters (DISPATCH_CROSS_READ_FASTPATH_TOTAL, DISPATCH_CROSS_READ_SPSC_TOTAL); public getter fns total_dispatch_cross_read_fastpath() and total_dispatch_cross_spsc(); increments wired into existing helpers. - src/command/connection.rs: INFO # Stats section extended with the two new keys. Unit tests (src/admin/metrics_setup.rs): - cross_read_fastpath_atomic_increments - cross_read_fastpath_batch_atomic_increments - cross_read_fastpath_batch_zero_is_noop - cross_spsc_atomic_increments - cross_spsc_batch_atomic_increments - cross_spsc_batch_zero_is_noop author: Tin Dang

…llapsible_if `if let Some(msg) { if env.is_none() { ... } }` violates clippy::collapsible_if. Use `if let ... && cond` (let-chains, stable since Rust 1.64) to collapse into a single if-expression. Semantics and runtime behaviour are unchanged. author: Tin Dang

Submodule HEAD advances from 3bbfd04 to 1d05828, which contains: 1d05828 docs(plan): update T1.2 spec to reflect opt-IN semantics This pointer bump was previously folded into a commit that was dropped during rebase. Lifting it into its own atomic commit so the parent repo and planning submodule agree on HEAD. author: Tin Dang

qodo-code-review · 2026-05-23T16:22:39Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-23T16:22:49Z

📝 Walkthrough

Walkthrough

This PR adds cross-shard dispatch observability to INFO output, marks three commands (COPY DB, MOVE, SWAPDB) as unimplemented with tracking IDs, and introduces CPU-based auto-detection of shard count at startup with optional warnings for undersized client limits.

Changes

Cross-shard dispatch observability

Layer / File(s)	Summary
Dispatch metrics infrastructure `src/admin/metrics_setup.rs`	Two process-wide `AtomicU64` counters track cross-shard read fast-path and SPSC slow-path dispatch totals. Recording helpers always increment these atomics (relaxed ordering) even when Prometheus is disabled, skip on zero-count batches, and tests validate monotonic increment behavior.
Metrics accessors and INFO integration `src/admin/metrics_setup.rs`, `src/command/connection.rs`	Public getter functions expose dispatch totals without requiring Prometheus initialization. INFO output is extended to include `total_dispatch_cross_read_fastpath` and `total_dispatch_cross_spsc` alongside existing command/connection counts.

Unimplemented command stubs

Layer / File(s)	Summary
COPY DB option error stub `src/command/key_extra.rs`	COPY DB option parsing now returns `ERR COPY DB option not implemented (tracked in T2.3)`, with test validation of the error message.
MOVE command stub `src/command/metadata.rs`, `src/command/mod.rs`	MOVE is registered in command metadata (arity 3, WRITE, GEN ACL), dispatch returns `ERR MOVE not implemented (tracked in T2.2)`, and test assertions verify error content.
SWAPDB error message update `src/command/mod.rs`	SWAPDB dispatch error updated to `ERR SWAPDB not implemented (tracked in T2.1)`, replacing prior sharded-mode-specific wording, with test assertions excluding old terminology.

Startup auto-sharding and warnings

Layer / File(s)	Summary
Shard resolution and undersubscription logic `src/main.rs`	Zero-shard config now auto-detects available CPU parallelism (fallback 4) with optional conservative cap via `MOON_AUTO_SHARDS_CONSERVATIVE=1`. Startup emits undersubscription warning when `maxclients` falls below 25-per-shard threshold (suppressible via `MOON_NO_UNDERSUBSCRIPTION_WARN=1`).
Shard auto-detect and warning helpers `src/main.rs`	Public functions `compute_auto_shards()` and `should_warn_undersubscription()` encapsulate parallelism detection, conservative capping, and warning threshold logic with unit tests validating both behaviors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

pilotspace/moon#68: Updates COPY command parsing and error handling for unsupported DB and REPLACE options.
pilotspace/moon#67: Enriches INFO output formatting in src/command/connection.rs for Clients, Memory, and Replication metrics.

Poem

🐰 Dispatch counts now whisper true,
Shards self-tune to cores on queue,
Commands stub with tracking tags,
INFO glows through metric flags! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: implementing Tier 1 of multi-shard improvements focused on operator UX improvements across five documented tasks.
Description check	✅ Passed	The description comprehensively covers all required template sections: a detailed summary of changes, test results, performance impact analysis, and notes for reviewers.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/tier-1-operator-ux

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/admin/metrics_setup.rs`:
- Around line 1471-1480: The zero-count batch test
cross_read_fastpath_batch_zero_is_noop currently only checks monotonicity and
can miss regressions where record_dispatch_cross_read_fastpath_batch(0)
increments the counter; change the test to serialize access (avoid concurrent
interference) by running it single-threaded and assert equality instead of >=:
capture before = total_dispatch_cross_read_fastpath(), call
record_dispatch_cross_read_fastpath_batch(0), then assert after == before; apply
the same change to the other zero-count dispatch tests (the equivalent test
around lines 1506-1513) using their corresponding helper functions (e.g.,
total_dispatch_... and record_dispatch_..._batch) so zero-count calls are
verified to be no-ops.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4e96f79-2ef6-4fee-9f7c-8e778e84e7f1

📥 Commits

Reviewing files that changed from the base of the PR and between b7a443f and 46ed28a.

📒 Files selected for processing (7)

.planning
src/admin/metrics_setup.rs
src/command/connection.rs
src/command/key_extra.rs
src/command/metadata.rs
src/command/mod.rs
src/main.rs

coderabbitai · 2026-05-23T16:28:22Z

+    fn cross_read_fastpath_batch_zero_is_noop() {
+        // batch(0) must not call fetch_add — the counter must not move
+        // backward; it may move forward due to concurrent tests.
+        let before = total_dispatch_cross_read_fastpath();
+        record_dispatch_cross_read_fastpath_batch(0);
+        let after = total_dispatch_cross_read_fastpath();
+        assert!(
+            after >= before,
+            "counter must be monotone; before={before} after={after}"
+        );


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Zero-count batch tests are currently non-detecting for regressions.

Both tests only assert monotonicity, so they still pass if *_batch(0) accidentally increments. Please serialize the dispatch-counter tests and assert equality for zero-count cases.

Suggested hardening

+ static DISPATCH_COUNTER_TEST_GUARD: once_cell::sync::Lazy<parking_lot::Mutex<()>> = + once_cell::sync::Lazy::new(|| parking_lot::Mutex::new(())); #[test] fn cross_read_fastpath_batch_zero_is_noop() { + let _guard = DISPATCH_COUNTER_TEST_GUARD.lock(); let before = total_dispatch_cross_read_fastpath(); record_dispatch_cross_read_fastpath_batch(0); let after = total_dispatch_cross_read_fastpath(); - assert!( - after >= before, - "counter must be monotone; before={before} after={after}" - ); + assert_eq!(after, before, "batch(0) must not increment"); } #[test] fn cross_spsc_batch_zero_is_noop() { + let _guard = DISPATCH_COUNTER_TEST_GUARD.lock(); let before = total_dispatch_cross_spsc(); record_dispatch_cross_spsc_batch(0); let after = total_dispatch_cross_spsc(); - assert!( - after >= before, - "counter must be monotone; before={before} after={after}" - ); + assert_eq!(after, before, "batch(0) must not increment"); }

Also applies to: 1506-1513

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/admin/metrics_setup.rs` around lines 1471 - 1480, The zero-count batch test cross_read_fastpath_batch_zero_is_noop currently only checks monotonicity and can miss regressions where record_dispatch_cross_read_fastpath_batch(0) increments the counter; change the test to serialize access (avoid concurrent interference) by running it single-threaded and assert equality instead of >=: capture before = total_dispatch_cross_read_fastpath(), call record_dispatch_cross_read_fastpath_batch(0), then assert after == before; apply the same change to the other zero-count dispatch tests (the equivalent test around lines 1506-1513) using their corresponding helper functions (e.g., total_dispatch_... and record_dispatch_..._batch) so zero-count calls are verified to be no-ops.

TinDang97 added 6 commits May 23, 2026 23:00

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Merge branch 'main' into feat/tier-1-operator-ux

f7ad387

TinDang97 merged commit ddbc782 into main May 23, 2026
3 of 8 checks passed

TinDang97 deleted the feat/tier-1-operator-ux branch May 23, 2026 17:00

coderabbitai Bot mentioned this pull request May 24, 2026

feat(tier-2): Lane A — SWAPDB, MOVE, COPY DB n, CLUSTER REPLICAS/SLAVES, COUNT-FAILURE-REPORTS #100

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-shard improvements Tier 1 — operator UX#99

feat: multi-shard improvements Tier 1 — operator UX#99
TinDang97 merged 7 commits into
mainfrom
feat/tier-1-operator-ux

TinDang97 commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-code-review Bot commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TinDang97 commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark validation (OrbStack moon-dev, aarch64, --shards 4 -c 200)

Test plan

Notes for reviewer

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented May 23, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TinDang97 commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Benchmark validation (OrbStack moon-dev, aarch64, `--shards 4 -c 200`)

coderabbitai Bot commented May 23, 2026 •

edited

Loading