Skip to content

fix: wave 0-4 gap closure — correctness, observability, CI, durability, compat#67

Merged
pilotspacex-byte merged 11 commits into
mainfrom
fix/wave0-5-todos
Apr 10, 2026
Merged

fix: wave 0-4 gap closure — correctness, observability, CI, durability, compat#67
pilotspacex-byte merged 11 commits into
mainfrom
fix/wave0-5-todos

Conversation

@TinDang97
Copy link
Copy Markdown
Collaborator

@TinDang97 TinDang97 commented Apr 9, 2026

Summary

Codebase-qualified gap closure across 5 areas identified by tracing all remaining TODOs against actual code state.

  • Wave 0 — ZREVRANGEBYSCORE fix: Double-swap of min/max bounds in zrange_by_score and zrange_by_lex caused empty results for finite score ranges (e.g., ZREVRANGEBYSCORE key 3 1). Same bug pattern fixed in lex variant.
  • Wave 1 — Observability (Phase 101): INFO Clients/Memory/Replication sections filled with real data. 6 new #[instrument] tracing spans. Replication lag Prometheus gauge wired (had zero call sites).
  • Wave 2 — Release Pipeline (Phase 105): cargo deny check + cargo audit enforced in CI. aarch64-unknown-linux-gnu release binary added via cross.
  • Wave 3 — Durability (Phase 103): BGSAVE and BGREWRITEAOF crash matrix cells added (6/7 coverage).
  • Wave 4 — Compatibility (Phase 104): Stream, Lua scripting, and ACL compat tests added to redis_compat.rs.

Files changed (16 files, +495 -51)

Area Files
Correctness src/command/sorted_set/mod.rs, scripts/test-commands.sh
Observability src/admin/metrics_setup.rs, src/command/connection.rs, src/main.rs, src/server/listener.rs
Tracing src/server/conn/handler_single.rs, src/server/conn/handler_monoio.rs, src/replication/master.rs, src/vector/segment/compaction.rs, src/persistence/aof.rs
CI/Release .github/workflows/ci.yml, .github/workflows/release.yml
Tests tests/durability/crash_matrix.rs, tests/redis_compat.rs

Test plan

  • cargo fmt --check clean
  • cargo clippy --no-default-features --features runtime-tokio,jemalloc -- -D warnings clean
  • cargo test --no-default-features --features runtime-tokio,jemalloc --release --lib — 1895 passed
  • Sorted set unit tests pass including test_zrevrangebyscore
  • cargo test --test redis_compat -- --ignored (requires running Moon)
  • cargo test --test durability_crash_matrix -- --ignored (requires built binary)
  • cargo test --test replication_hardening -- --ignored (requires built binary)

Deferred

  • SIGHUP TLS reload (ArcSwap refactor, separate PR)
  • Phase 102 ConnectionCore extraction (high-risk refactor, post-GA)
  • Vector client SDK tests (Python infra needed)
  • WAL rotation crash cell (7th matrix cell)

Summary by CodeRabbit

  • Bug Fixes

    • Fixed ZREVRANGEBYSCORE returning empty results for finite score ranges
  • New Features

    • INFO now shows connected clients, memory metrics, and replication info
  • Tests

    • Added crash-injection tests for BGSAVE and AOF rewrite
    • Expanded Redis compatibility tests (Streams, Lua scripting, ACL)
  • Chores

    • Added supply-chain security checks to CI
    • Extended release builds to include Linux ARM64
    • Added tracing instrumentation across several components

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 25873be4-ed43-4ae2-9bf4-e19dbe1944cf

📥 Commits

Reviewing files that changed from the base of the PR and between 3d5372a and 81cad44.

📒 Files selected for processing (1)
  • CHANGELOG.md
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md

📝 Walkthrough

Walkthrough

Adds CI supply-chain checks and an aarch64 cross release target; fixes sorted-set reverse-range bound handling; enriches INFO output with connected clients, RSS memory, and replication info; registers global replication state at startup; adds tracing instrumentation to multiple hot paths; and expands crash and Redis-compat tests.

Changes

Cohort / File(s) Summary
CI Security & Release Matrix
/.github/workflows/ci.yml, /.github/workflows/release.yml
Added supply-chain CI job running cargo deny check and cargo audit. Extended release matrix with linux-aarch64-tokio using cross; conditional cross install and packaging/checksum updates for new artifact.
Sorted Set Logic & Tests
src/command/sorted_set/mod.rs, scripts/test-commands.sh
Removed min/max swap when rev is set; always parse min/max in-order and reverse results after collection; updated test to assert non-empty reverse score range behavior.
Metrics, INFO Plumbing & Startup
src/admin/metrics_setup.rs, src/command/connection.rs, src/main.rs, src/server/listener.rs
Added CONNECTED_CLIENTS counter, get_rss_bytes() (Linux /proc/self/status), global GLOBAL_REPL_STATE with setters/getters, and get_replication_info() that updates a replication-lag gauge. Wired INFO output to these helpers and registered global repl state at startup; added server-ready signal.
Tracing Instrumentation
src/persistence/aof.rs, src/replication/master.rs, src/server/conn/handler_monoio.rs, src/server/conn/handler_single.rs, src/vector/segment/compaction.rs
Annotated several public handlers and functions with #[tracing::instrument(skip_all, level = "...")] to emit spans for AOF rewrite, PSYNC handling (both runtimes), connection handlers, and vector compaction.
Tests: Durability & Redis Compatibility
tests/durability/crash_matrix.rs, tests/redis_compat.rs
Added ignored crash-injection tests for BGSAVE and BGREWRITEAOF. Added multiple ignored Redis-compat tests for Streams, Lua scripting (EVAL/EVALSHA/SCRIPT), and ACL commands.
Changelog
CHANGELOG.md
Updated Unreleased notes documenting sorted-set fix, INFO enrichment, tracing additions, CI supply-chain checks, release matrix update, and expanded tests.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ConnHandler as Connection\r\nHandler
    participant InfoCmd as info()\r\nCommand
    participant Metrics as metrics_setup
    participant ReplState as GLOBAL_REPL_STATE

    Client->>ConnHandler: INFO
    ConnHandler->>InfoCmd: handle INFO
    InfoCmd->>Metrics: connected_clients()
    Metrics-->>InfoCmd: u64
    InfoCmd->>Metrics: get_rss_bytes()
    Metrics->>Metrics: read /proc/self/status (Linux)
    Metrics-->>InfoCmd: u64
    InfoCmd->>Metrics: get_replication_info()
    Metrics->>ReplState: read replication state (Arc<RwLock>)
    Metrics->>Metrics: record_replication_lag()
    Metrics-->>InfoCmd: (role, connected_slaves, offset, repl_id)
    InfoCmd->>InfoCmd: format_memory_human()
    InfoCmd-->>ConnHandler: formatted INFO payload
    ConnHandler-->>Client: INFO response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🐰 Hopped through code with nimble paws,

bounds fixed, and cleared the claws.
Metrics counted, spans aglow,
Supply-chain guards in tidy row.
Tests now leap — safe paths to trod, hooray!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: wave 0-4 gap closure — correctness, observability, CI, durability, compat' accurately reflects the five major areas of changes across the pull request.
Description check ✅ Passed The PR description is comprehensive and covers all required sections, but the checklist items are marked incomplete (lacking checkmarks for test execution status).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/wave0-5-todos

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Wave 0-4 gap closure: correctness, observability, durability, CI, compatibility, and documentation

🐞 Bug fix ✨ Enhancement 🧪 Tests 📝 Documentation

Grey Divider

Walkthroughs

Description
  **Correctness & Protocol**
• Fixed double-swap bug in ZREVRANGEBYSCORE and ZREVRANGEBYSCORE causing empty results for
  finite score ranges
• Implemented strict RESP protocol integer parsing via strict_atoi to reject trailing bytes and
  validate RESP3 null type
• Fixed stream consumer group logic and NaN-safe distance sorting in cold tier
  **Observability & Metrics (Phase 101)**
• Added Prometheus metrics exporter with /metrics, /healthz, /readyz HTTP endpoints on
  configurable --admin-port
• Enriched INFO command with real metrics: connected clients, RSS memory, CPU usage, replication
  state
• Implemented Redis-compatible SLOWLOG with per-command latency tracking and configurable thresholds
• Added 6 new #[tracing::instrument] spans across connection handlers, replication, AOF rewrite,
  and vector compaction
• Wired replication lag Prometheus gauge with global state registration
  **Durability & Crash Recovery (Phase 103)**
• Implemented crash injection test matrix covering 6 persistence modes × 4 write phases (6/7 cells)
• Added Jepsen-lite linearizability test for crash-recovery validation
• Added WAL v3 torn write detection via CRC32C validation tests
• Added backup/restore workflow and upgrade compatibility tests
  **Release Pipeline & Supply Chain (Phase 105)**
• Added cargo deny check and cargo audit enforcement in CI to block vulnerable/unlicensed
  dependencies
• Implemented aarch64-unknown-linux-gnu cross-compilation via cross tool
• Added SBOM generation, SHA256 checksums, and cosign artifact signing to release pipeline
• Enforced CHANGELOG.md updates via dedicated CI gate
  **Compatibility (Phase 104)**
• Added 40+ Redis compatibility integration tests covering strings, hashes, lists, sets, sorted
  sets, transactions, pub/sub, streams, Lua scripting, and ACL
• Added multi-language client compatibility CI matrix (Python, Go, Node.js, Rust, C, Java)
• Added vector search smoke test script
  **Infrastructure & Robustness**
• Migrated RuntimeConfig from std::sync::RwLock to parking_lot::RwLock across 15+ modules for
  non-poisonable semantics
• Replaced 40+ un-annotated .unwrap() calls with explicit error handling or safety annotations
• Added explicit TLS cipher suite allowlist (AEAD-only, PFS-required)
• Added performance regression gate workflow with memory and latency benchmarks
  **Documentation**
• Added comprehensive threat model and security boundaries document
• Added Lua sandbox security audit with CVE analysis
• Added complete configuration reference guide (50+ options)
• Added 7 operational runbooks: rolling restart, TLS rotation, corrupted AOF recovery, replica lag,
  OOM, disk full, and replication hardening
• Added monitoring setup guide with Prometheus/Grafana examples and alerting rules
• Added Redis protocol compatibility matrix and versioning policy
Diagram
flowchart LR
  A["ZREVRANGEBYSCORE<br/>Bug Fix"] --> B["Strict RESP<br/>Parsing"]
  B --> C["Protocol<br/>Correctness"]
  
  D["Prometheus<br/>Metrics"] --> E["SLOWLOG<br/>Tracking"]
  E --> F["Tracing<br/>Spans"]
  F --> G["Observability<br/>Phase 101"]
  
  H["Crash Matrix<br/>Tests"] --> I["Jepsen-lite<br/>Linearizability"]
  I --> J["WAL/CRC<br/>Validation"]
  J --> K["Durability<br/>Phase 103"]
  
  L["cargo deny<br/>+ audit"] --> M["aarch64<br/>Build"]
  M --> N["SBOM +<br/>Signing"]
  N --> O["Release<br/>Phase 105"]
  
  P["Redis Compat<br/>Tests"] --> Q["Multi-lang<br/>CI Matrix"]
  Q --> R["Compatibility<br/>Phase 104"]
  
  C --> S["Production<br/>Ready"]
  G --> S
  K --> S
  O --> S
  R --> S
  
  T["parking_lot<br/>Migration"] --> U["Error<br/>Handling"]
  U --> V["Robustness"]
  V --> S
  
  W["Threat Model<br/>+ Runbooks"] --> X["Documentation"]
  X --> S
Loading

Grey Divider

File Changes

1. tests/redis_compat.rs 🧪 Tests +644/-0

Redis compatibility test battery with 40+ integration tests

• New comprehensive Redis compatibility test suite with 40+ integration tests covering strings,
 hashes, lists, sets, sorted sets, keys, transactions, pub/sub, streams, Lua scripting, and ACL
 commands
• Tests connect to a running Moon instance (default 127.0.0.1:6379) and are marked #[ignore]
 requiring manual server startup
• Covers edge cases like nonexistent keys, type overwrites, and cross-type operations
• Includes RESP protocol parsing and command execution via the redis crate client library

tests/redis_compat.rs


2. src/admin/metrics_setup.rs Observability +606/-0

Prometheus metrics initialization and recording helpers

• New Prometheus metrics initialization and recording module with atomic counters for INFO command
 (total commands, connections, connected clients)
• Implements command metrics recording with sanitized labels to prevent cardinality explosion, plus
 keyspace hits/misses, eviction, persistence, replication lag, and memory metrics
• Provides helper functions for RSS bytes, CPU usage via getrusage, and replication state tracking
• Integrates global slowlog and replication state management for INFO command population

src/admin/metrics_setup.rs


3. tests/replication_hardening.rs 🧪 Tests +363/-0

Replication hardening tests for PSYNC2 scenarios

• New replication hardening test suite with 4 integration tests covering partial resync within
 backlog, replica kill-restart parity, replica promotion, full resync outside backlog, and network
 partition recovery
• Tests spawn Moon server processes on different ports, simulate network partitions and crashes via
 SIGKILL, and verify data consistency
• Validates PSYNC2 correctness across reconnect scenarios and replica state transitions

tests/replication_hardening.rs


View more (74)
4. src/protocol/parse.rs 🐞 Bug fix +101/-15

Strict RESP protocol integer parsing and null validation

• Introduces strict_atoi function using FromRadix10SignedChecked to reject inputs with trailing
 bytes (e.g., b"5\n" now fails instead of silently ignoring the newline)
• Adds validation for RESP3 null type (_\r\n) to reject junk data before CRLF
• Replaces all atoi::atoi calls with strict_atoi for consistent strict parsing across frame
 parsing, validation, and zerocopy paths
• Adds regression test for crash artifact with bare LF in frame count and test for RESP3 null junk
 rejection

src/protocol/parse.rs


5. tests/durability/crash_matrix.rs 🧪 Tests +396/-0

Crash injection durability test matrix (6 cells)

• New crash injection test matrix covering 6 persistence modes (none, RDB, AOF-always, AOF-everysec,
 WAL+RDB, disk-offload) × 4 write phases
• Implements framework to start Moon server, write keys, SIGKILL at phase, restart, and verify RPO
 bounds
• Includes 6 test cells: AOF-always/everysec during SET, no persistence, BGSAVE crash, BGREWRITEAOF
 crash, and disk-offload spill crash
• Validates data survival and recovery correctness across crash scenarios

tests/durability/crash_matrix.rs


6. src/admin/slowlog.rs ✨ Enhancement +336/-0

Redis-compatible slowlog with latency tracking

• New slowlog module implementing Redis-compatible SLOWLOG GET/LEN/RESET/HELP commands with
 per-command latency tracking
• Uses global monotonic ID counter and ring buffer (VecDeque) with configurable max length and
 threshold
• Records command args (truncated to 128 bytes per arg), client address, client name, and execution
 duration in microseconds
• Provides handle_slowlog dispatcher and entry_to_frame serializer for RESP protocol output

src/admin/slowlog.rs


7. tests/jepsen_lite.rs 🧪 Tests +323/-0

Jepsen-lite linearizability test for crash recovery

• New Jepsen-lite crash-recovery linearizability test spawning Moon with AOF appendfsync=always,
 running concurrent writers, periodically SIGKILLing, and verifying per-key monotonicity
• Implements 4 concurrent writer threads writing to disjoint key spaces across 3 restart cycles
• Validates that no key value decreases after restart (linearizability violation detection)
• Tracks ACK'd writes and verifies they survive or are missing (ACK-lost), but never regress

tests/jepsen_lite.rs


8. src/server/conn/handler_sharded.rs ✨ Enhancement +43/-8

Connection metrics, slowlog integration, and lock improvements

• Adds #[tracing::instrument] span to handle_connection_sharded for observability
• Records connection open/close metrics via record_connection_opened() and
 record_connection_closed()
• Adds SLOWLOG command handler dispatching to handle_slowlog
• Changes runtime_config from std::sync::RwLock to parking_lot::RwLock and removes .unwrap()
 calls on lock acquisition
• Records command latency and slowlog entries for both write and read dispatch paths with timing
 instrumentation

src/server/conn/handler_sharded.rs


9. src/config.rs ⚙️ Configuration changes +16/-0

Configuration options for admin port and slowlog

• Adds admin_port config option (default 0 = disabled) for Prometheus metrics HTTP server serving
 /metrics, /healthz, /readyz
• Adds slowlog_log_slower_than config (default 10000 microseconds) and slowlog_max_len config
 (default 128 entries)
• Adds check_config flag to validate configuration and exit without starting server

src/config.rs


10. src/command/connection.rs ✨ Enhancement +110/-6

Observability: health checks and INFO metrics expansion

• Added HEALTHZ and READYZ commands for liveness/readiness probes
• Enhanced INFO command with real metrics: connected clients, memory usage (RSS), CPU, replication
 state, and stats
• Replaced .unwrap() calls with explicit error handling for poisoned ACL locks (fail-closed
 pattern)
• Added human-readable memory formatting helper function

src/command/connection.rs


11. src/command/acl.rs Error handling +29/-12

ACL: parking_lot migration and poison-safe error handling

• Migrated RuntimeConfig from std::sync::RwLock to parking_lot::RwLock for non-poisonable
 semantics
• Replaced all .unwrap() calls on lock reads/writes with explicit error handling
• Improved lock scope management by explicitly dropping locks before I/O operations

src/command/acl.rs


12. src/server/conn/handler_single.rs ✨ Enhancement +44/-9

Handler: tracing instrumentation and slowlog integration

• Added #[tracing::instrument] span for connection handler observability
• Integrated command timing and slowlog recording at dispatch points (write and read paths)
• Replaced .unwrap() on RuntimeConfig reads with direct access (parking_lot is non-poisonable)
• Records command latency in microseconds and feeds slowlog with peer address and client name

src/server/conn/handler_single.rs


13. tests/durability/jepsen_lite.rs 🧪 Tests +191/-0

Durability: Jepsen-lite linearizability test harness

• New Jepsen-lite linearizability harness for durability testing
• Spawns 4 writer threads incrementing sequence numbers, periodically SIGKILLs server, verifies
 monotonicity on restart
• Validates that committed values are linearizable (no gaps) across crash cycles

tests/durability/jepsen_lite.rs


14. src/shard/conn_accept.rs ✨ Enhancement +15/-24

Connection acceptance: parking_lot migration and cleanup

• Migrated RuntimeConfig from std::sync::RwLock to parking_lot::RwLock across all connection
 spawn functions
• Removed error handling for poisoned locks (parking_lot never poisons)
• Added #[allow] annotations for startup-phase .expect() calls on Lua VM initialization

src/shard/conn_accept.rs


15. src/server/conn/handler_monoio.rs ✨ Enhancement +30/-8

Monoio handler: tracing and slowlog instrumentation

• Added #[tracing::instrument] span for monoio connection handler
• Integrated command timing and slowlog recording at dispatch points (write and read paths)
• Migrated RuntimeConfig to parking_lot::RwLock and removed .unwrap() calls
• Records command latency and slowlog entries with peer address and client name

src/server/conn/handler_monoio.rs


16. src/command/mod.rs ✨ Enhancement +45/-2

Command dispatch: health checks and slowlog routing

• Added HEALTHZ and READYZ command routing in dispatch tables
• Added SLOWLOG command routing for both write and read dispatch paths
• Refactored dispatch functions to separate metrics recording (owned by handler layer) from command
 execution
• Updated is_dispatch_read_supported to include SLOWLOG as a read-safe command

src/command/mod.rs


17. src/main.rs ✨ Enhancement +37/-13

Main: startup observability and readiness signaling

• Added persistence directory validation at startup with early error reporting
• Added --check-config flag support for configuration validation without starting server
• Initialized Prometheus metrics exporter and global slowlog with user-configured thresholds
• Registered global replication state for INFO command queries
• Migrated RuntimeConfig to parking_lot::RwLock
• Marked server as ready after all shards recover from persistence

src/main.rs


18. src/admin/http_server.rs ✨ Enhancement +154/-0

Admin HTTP server: health and metrics endpoints

• New custom HTTP server for /metrics, /healthz, /readyz endpoints
• Runs on dedicated thread with single-threaded tokio runtime
• Serves Prometheus metrics alongside health/readiness probes on a single admin port
• Implements readiness flag synchronization for startup completion signaling

src/admin/http_server.rs


19. tests/durability/torn_write.rs 🧪 Tests +106/-0

Durability: WAL v3 torn write and CRC validation tests

• New test suite for WAL v3 torn write detection via CRC32C validation
• Tests recovery of complete records and clean truncation at corruption points
• Validates CRC mismatch detection and empty/short data handling

tests/durability/torn_write.rs


20. tests/durability/backup_restore.rs 🧪 Tests +123/-0

Durability: backup and restore workflow test

• New backup/restore workflow test: BGSAVE → copy snapshot → restore on fresh node
• Validates data parity via DBSIZE comparison between primary and restored instances

tests/durability/backup_restore.rs


21. tests/upgrade_test.rs 🧪 Tests +92/-0

Upgrade: AOF persistence compatibility test

• New upgrade smoke test validating AOF persistence format compatibility
• Verifies that version upgrades preserve persisted state across restarts
• Tests both AOF data preservation and empty directory initialization

tests/upgrade_test.rs


22. src/shard/persistence_tick.rs ✨ Enhancement +8/-9

Persistence tick: parking_lot migration

• Migrated RuntimeConfig to parking_lot::RwLock across eviction and pressure cascade functions
• Removed error handling for poisoned locks and .unwrap() calls
• Simplified lock scope management with direct access patterns

src/shard/persistence_tick.rs


23. src/command/sorted_set/mod.rs 🐞 Bug fix +16/-41

Sorted set: ZREVRANGEBYSCORE min/max bounds correctness fix

• Fixed double-swap bug in zrange_by_score and zrange_by_lex where min/max bounds were
 incorrectly swapped for reverse queries
• Simplified logic: all callers pass (min, max) in semantic order; rev flag only affects iteration
 direction
• Removed redundant conditional branches

src/command/sorted_set/mod.rs


24. src/tls.rs Security +34/-18

TLS: explicit cipher suite allowlist for security hardening

• Added explicit default cipher suite allowlist (AEAD-only, PFS-required)
• Prevents rustls upgrades from silently enabling weaker cipher suites
• Includes TLS 1.3 and 1.2 variants: AES-256-GCM, AES-128-GCM, CHACHA20-POLY1305

src/tls.rs


25. src/shard/event_loop.rs ✨ Enhancement +6/-9

Event loop: parking_lot migration

• Migrated RuntimeConfig and AclTable to parking_lot::RwLock
• Removed error handling for poisoned locks
• Simplified lock access patterns throughout event loop initialization

src/shard/event_loop.rs


26. src/replication/master.rs ✨ Enhancement +2/-0

Replication: PSYNC handler tracing instrumentation

• Added #[tracing::instrument] spans to both tokio and monoio variants of handle_psync_on_master
• Captures replication ID and client offset in trace fields for observability

src/replication/master.rs


27. src/storage/stream.rs 🐞 Bug fix +10/-5

Stream: consumer group and PEL safety improvements

• Fixed consumer group logic: check for existing consumer before insertion to avoid unnecessary
 allocation
• Replaced .unwrap() on non-empty BTreeMap keys with explicit guards using .next() and
 .next_back()

src/storage/stream.rs


28. src/server/conn/shared.rs ✨ Enhancement +5/-3

Shared connection: parking_lot migration

• Migrated RuntimeConfig to parking_lot::RwLock
• Removed .unwrap() calls on lock operations

src/server/conn/shared.rs


29. src/server/listener.rs ✨ Enhancement +4/-1

Listener: parking_lot migration and repl state registration

• Migrated RuntimeConfig to parking_lot::RwLock
• Registered global replication state for INFO command queries

src/server/listener.rs


30. src/shard/timers.rs ✨ Enhancement +3/-4

Timers: parking_lot migration

• Migrated RuntimeConfig to parking_lot::RwLock
• Removed .unwrap() calls on lock reads

src/shard/timers.rs


31. src/server/conn/blocking.rs Error handling +6/-1

Blocking: safe last element access

• Replaced .unwrap() on args.last() with explicit guard pattern

src/server/conn/blocking.rs


32. src/persistence/redis_rdb.rs Miscellaneous +2/-0

RDB: CRC64 extraction safety annotation

• Added #[allow] annotation for .unwrap() on CRC64 checksum extraction (guaranteed 8-byte slice)

src/persistence/redis_rdb.rs


33. src/command/set/set_write.rs Error handling +4/-1

Set write: safe set creation error handling

• Replaced .unwrap() on get_or_create_set() with explicit error guard

src/command/set/set_write.rs


34. src/persistence/aof.rs ✨ Enhancement +1/-0

AOF: rewrite operation tracing instrumentation

• Added #[tracing::instrument] span to rewrite_aof for AOF rewrite observability

src/persistence/aof.rs


35. src/storage/dashtable/mod.rs Error handling +4/-2

DashTable: safe slab capacity check

• Replaced .unwrap() on slabs.last() with .map_or() pattern for safe capacity check

src/storage/dashtable/mod.rs


36. src/command/sorted_set/sorted_set_write.rs Formatting +1/-3

Sorted set write: idiomatic score comparison

• Replaced chained .is_some() and .unwrap() with .is_some_and() for cleaner score comparison

src/command/sorted_set/sorted_set_write.rs


37. src/storage/tiered/cold_tier.rs 🐞 Bug fix +1/-1

Cold tier: NaN-safe distance sorting

• Replaced .unwrap() on partial_cmp() with .unwrap_or(Equal) for NaN-safe sorting

src/storage/tiered/cold_tier.rs


38. src/lib.rs ✨ Enhancement +1/-0

Library: admin module export

• Added pub mod admin export for observability modules

src/lib.rs


39. src/server/conn_state.rs ✨ Enhancement +1/-1

Connection state: parking_lot migration

• Migrated RuntimeConfig to parking_lot::RwLock

src/server/conn_state.rs


40. src/storage/compact_key.rs Miscellaneous +1/-0

Compact key: heap pointer extraction safety annotation

• Added #[allow] annotation for .unwrap() on heap pointer extraction (guaranteed 8-byte slice)

src/storage/compact_key.rs


41. src/shard/spsc_handler.rs ✨ Enhancement +1/-0

SPSC handler: message drain tracing instrumentation

• Added #[tracing::instrument] span to drain_spsc_shared for SPSC message processing
 observability

src/shard/spsc_handler.rs


42. src/storage/compact_value.rs Miscellaneous +1/-0

Compact value: tagged pointer extraction safety annotation

• Added #[allow] annotation for .unwrap() on tagged pointer extraction (guaranteed 8-byte slice)

src/storage/compact_value.rs


43. src/shard/mesh.rs Miscellaneous +1/-0

Mesh: connection receiver take safety annotation

• Added #[allow] annotation for .expect() on connection receiver take (startup-phase,
 double-take is a logic bug)

src/shard/mesh.rs


44. tests/durability/mod.rs 🧪 Tests +10/-0

Durability: test module infrastructure

• New durability test module aggregating crash recovery, torn write, and backup/restore tests

tests/durability/mod.rs


45. src/storage/tiered/spill_thread.rs Miscellaneous +2/-0

Spill thread: spawn safety annotation

• Added #[allow] annotation for .expect() on spill thread spawn (startup-phase, spawn failure is
 fatal)

src/storage/tiered/spill_thread.rs


46. src/vector/segment/compaction.rs ✨ Enhancement +1/-0

Vector compaction: tracing instrumentation

• Added #[tracing::instrument] span to compact() for vector segment compaction observability

src/vector/segment/compaction.rs


47. src/admin/mod.rs ✨ Enhancement +8/-0

Admin module: observability infrastructure

• New admin module aggregating HTTP server, metrics setup, and slowlog functionality

src/admin/mod.rs


48. tests/durability_tests.rs 🧪 Tests +6/-0

Durability tests: suite entry point

• New test suite entry point for durability tests

tests/durability_tests.rs


49. scripts/test-vector-clients.sh 🧪 Tests +292/-0

Vector client smoke test script

• New bash script for vector search (FT.*) smoke tests via redis-cli
• Tests FT.CREATE, HSET vector ingest, FT.SEARCH, FT.INFO, FT.DROPINDEX
• Supports custom port, shard count, and skip-build options

scripts/test-vector-clients.sh


50. scripts/bench-memory.sh 🧪 Tests +195/-0

Memory regression benchmark gate script

• New bash script for RSS memory regression gate
• Writes 1M keys via redis-benchmark, measures RSS delta, compares against baseline
• Fails if RSS-per-key exceeds baseline by >10%

scripts/bench-memory.sh


51. scripts/audit-unwrap.sh Miscellaneous +27/-6

Unwrap audit: stricter detection and zero baseline

• Updated baseline from 98 to 0 (target: zero un-annotated unwrap/expect in hot-path modules)
• Improved detection: skips test-only modules, comment-only lines, and checks 30-line window for
 #[allow] annotations

scripts/audit-unwrap.sh


52. scripts/test-commands.sh 🧪 Tests +1/-2

Command tests: ZREVRANGEBYSCORE correctness validation

• Removed TODO comment about ZREVRANGEBYSCORE empty result bug (now fixed)
• Added test case for ZREVRANGEBYSCORE with finite score range (3 to 1)

scripts/test-commands.sh


53. .github/workflows/bench-gate.yml ⚙️ Configuration changes +180/-0

CI: performance regression gate workflow

• New CI workflow for performance regression gate
• Runs critical benchmarks (get_hotpath, dispatch_baseline, resp_parsing, etc.) on PR and main
• Compares against baseline from main branch; fails if regression exceeds threshold
• Includes RSS memory gate (100K keys baseline 150MB)

.github/workflows/bench-gate.yml


54. Cargo.toml Dependencies +10/-3

Dependencies: observability and parking_lot infrastructure

• Added metrics and metrics-exporter-prometheus dependencies for observability
• Added hyper, hyper-util, http-body-util for custom admin HTTP server
• Migrated to parking_lot RwLock (implicit via dependency updates)
• Updated roaring from 0.10 to 0.11
• Refactored tokio features: base (rt, net, macros) always available; runtime-tokio adds full
 feature set

Cargo.toml


55. src/admin/metrics_setup.rs ✨ Enhancement +606/-0

Metrics setup: observability infrastructure

• Implements Prometheus metrics collection and readiness flag management

src/admin/metrics_setup.rs


56. src/admin/slowlog.rs ✨ Enhancement +336/-0

Slowlog: command latency tracking

• New slowlog module (inferred from usage in command dispatch)
• Provides handle_slowlog() command handler and slowlog recording with latency thresholds

src/admin/slowlog.rs


57. .github/workflows/compat.yml 🧪 Tests +350/-0

Multi-language client compatibility CI matrix

• New CI workflow testing client compatibility across 6 languages (Python, Go, Node.js, Rust, C,
 Java)
• Each job builds Moon, starts it on port 6399, and runs language-specific smoke tests
• Tests cover basic operations (SET/GET), hashes, lists, sets, sorted sets, pipelines, and INFO
 command
• Scheduled weekly and runs on pull requests to main branch

.github/workflows/compat.yml


58. docs/guides/configuration.md 📝 Documentation +138/-0

Complete command-line configuration reference guide

• Comprehensive configuration reference documenting all CLI flags and defaults
• Organized by category: Network, Server, Persistence, Memory, TLS, ACL, Cluster, Slowlog, io_uring,
 Disk Offload, WAL, Checkpoint, Vector Search
• Includes environment variables and size syntax documentation
• Covers 50+ configuration options with descriptions and defaults

docs/guides/configuration.md


59. docs/runbooks/rolling-restart.md 📝 Documentation +161/-0

Zero-downtime rolling restart operational runbook

• Step-by-step procedure for zero-downtime binary upgrades in primary+replica topology
• Covers replica drain, upgrade, promotion, and optional re-promotion workflows
• Includes replication lag monitoring and rollback procedures
• Emphasizes preserving at least one healthy node at all times

docs/runbooks/rolling-restart.md


60. docs/security/lua-sandbox.md 📝 Documentation +104/-0

Lua sandbox security audit and configuration

• Audit of Lua 5.4 sandbox configuration via mlua 0.11 vendored source
• Documents allowed libraries (base, string, table, math, cjson) and blocked libraries (io, os,
 debug, package)
• Reviews CVEs affecting Lua 5.4 (all fixed in vendored 5.4.7)
• Evaluates potential escape vectors and provides recommendations for monitoring/fuzzing

docs/security/lua-sandbox.md


61. docs/THREAT-MODEL.md 📝 Documentation +128/-0

Comprehensive threat model and security boundaries

• Defines assets (user data, credentials, availability, memory safety) and their protection
 mechanisms
• Identifies 5 attacker classes: network, authenticated client, malicious Lua, replica impersonator,
 local user
• Maps trust boundaries and risk matrix with likelihood/impact/mitigation status
• Covers RESP parser, ACL, Lua sandbox, TLS, replication, and supply chain threats

docs/THREAT-MODEL.md


62. docs/guides/monitoring.md 📝 Documentation +144/-0

Prometheus monitoring and alerting setup guide

• Guide for enabling Prometheus metrics on --admin-port with /metrics, /healthz, /readyz
 endpoints
• Documents key metrics (connected_clients, used_memory, commands_processed, keyspace hits/misses,
 eviction)
• Provides Prometheus scrape config, Grafana dashboard queries, and Kubernetes/Docker health check
 examples
• Includes alerting rules for down, high memory, and high eviction rate scenarios

docs/guides/monitoring.md


63. CHANGELOG.md 📝 Documentation +21/-0

Wave 0-4 gap closure and production readiness phases

• Added Wave 0-4 gap closure section documenting ZREVRANGEBYSCORE fix, INFO enrichment, 6 new
 tracing spans, replication lag metric wiring, CI supply chain security, aarch64 release build, crash
 matrix expansion, and compatibility tests
• Added Production Readiness Phases 92-105 section covering observability, durability, replication,
 compatibility, performance, security, and release engineering improvements
• Entries reference specific phases and include implementation details

CHANGELOG.md


64. .github/workflows/release.yml ⚙️ Configuration changes +50/-2

Release pipeline with aarch64 build, SBOM, checksums, and signing

• Added aarch64-unknown-linux-gnu build matrix entry using cross tool for cross-compilation
• Generates SBOM via cargo-cyclonedx in JSON format
• Computes SHA256 checksums for all binaries and SBOM
• Signs artifacts with cosign using COSIGN_EXPERIMENTAL mode
• Uploads SBOM, checksums, and signatures to release assets

.github/workflows/release.yml


65. docs/redis-compat.md 📝 Documentation +77/-0

Redis protocol and command compatibility matrix

• Documents RESP2/RESP3 protocol compatibility and pipelining/transactions/pub-sub support
• Lists client compatibility matrix (redis-py, go-redis, redis-rs, jedis, ioredis, hiredis tested or
 planned)
• Identifies known incompatibilities: unimplemented commands (DEBUG, ACL LOG, WAIT, MODULE,
 SENTINEL, FUNCTION), RESP3 pub/sub framing, custom RDB format, memory reporting differences
• Covers vector search command support (FT.CREATE, FT.SEARCH implemented; FT.AGGREGATE, FT.ALTER not
 implemented)

docs/redis-compat.md


66. docs/runbooks/tls-cert-rotation.md 📝 Documentation +86/-0

TLS certificate rotation operational runbook

• Procedure for rotating TLS certificates on running Moon without downtime via SIGHUP signal
• Includes certificate validation, file placement, reload, and verification steps
• Provides rollback procedure if new certificate causes handshake failures
• Notes that SIGHUP only reloads TLS, not the entire server

docs/runbooks/tls-cert-rotation.md


67. docs/guides/getting-started.md 📝 Documentation +84/-0

Getting started quick-start guide

• Quick-start guide covering prerequisites, build from source, server startup, redis-cli connection
• Demonstrates basic operations (SET/GET, INCR, HSET, LPUSH, RPOP)
• Shows AOF persistence enablement with --appendonly yes
• Links to configuration, monitoring, persistence, and TLS guides

docs/guides/getting-started.md


68. docs/versioning.md 📝 Documentation +57/-0

Semantic versioning policy and compatibility guarantees

• Defines SemVer policy: major for format/protocol breaking changes, minor for new features, patch
 for bug fixes
• Documents format versioning for RDB (magic MOON), WAL v3, and AOF with forward/backward
 compatibility rules
• Covers pre-1.0 stability guarantees and upgrade/downgrade procedures
• References Production Contract for SLO guarantees

docs/versioning.md


69. SECURITY.md 📝 Documentation +55/-0

Security policy and vulnerability reporting process

• Defines supported versions (0.1.x) and vulnerability reporting process via email/GitHub Security
 Advisories
• Specifies response timeline: 48h acknowledgment, 7d triage, 30d fix for Critical/High, 90d for
 Medium/Low
• Lists in-scope threats (memory safety, RESP parsing, ACL bypass, Lua sandbox, TLS, DoS,
 replication) and out-of-scope items
• Documents security measures: fuzzing, unsafe audit, supply chain checks, SBOM, signed releases

SECURITY.md


70. docs/runbooks/corrupted-aof-recovery.md 📝 Documentation +61/-0

Corrupted AOF recovery operational runbook

• Runbook for recovering from corrupted AOF files with symptoms and root causes
• Describes automatic recovery (Moon truncates at first corrupted record) and manual recovery steps
• Includes verification and prevention recommendations (appendfsync policy, disk monitoring, UPS)

docs/runbooks/corrupted-aof-recovery.md


71. .github/workflows/ci.yml ⚙️ Configuration changes +37/-0

CI gates for changelog and supply chain security

• Added changelog job that enforces CHANGELOG.md updates or skip-changelog label on PRs
• Added supply-chain job running cargo deny check and cargo audit to block
 vulnerable/unlicensed dependencies
• Both jobs run on pull requests to main branch

.github/workflows/ci.yml


72. docs/runbooks/replica-fell-behind.md 📝 Documentation +58/-0

Replica replication lag recovery operational runbook

• Runbook for addressing replica replication lag with symptoms and root causes
• Covers replication status checks, partial sync waiting, full resync triggering, and backlog sizing
• Includes prevention recommendations: size backlog for 2x write volume, monitor lag metric, ensure
 replica bandwidth

docs/runbooks/replica-fell-behind.md


73. docs/runbooks/oom-during-snapshot.md 📝 Documentation +47/-0

OOM during snapshot recovery operational runbook

• Runbook for OOM killer events during BGSAVE with recovery steps
• Covers restart, data verification, and root cause mitigation (increase memory, set maxmemory, use
 AOF-only)
• Includes monitoring recommendations for RSS alerts and Prometheus metrics

docs/runbooks/oom-during-snapshot.md


74. docs/runbooks/disk-full-during-wal-rotation.md 📝 Documentation +47/-0

Disk full during WAL rotation recovery operational runbook

• Runbook for disk full errors during WAL segment rotation
• Covers immediate disk space recovery, Moon restart, AOF compaction, and prevention measures
• Recommends disk monitoring, --max-wal-size bounds, dedicated partition, and disk-offload tiering

docs/runbooks/disk-full-during-wal-rotation.md


75. .github/workflows/changelog-gate.yml ⚙️ Configuration changes +35/-0

Dedicated CHANGELOG enforcement workflow

• New workflow enforcing CHANGELOG.md updates on pull requests to main
• Allows bypass via skip-changelog label for documentation-only or internal changes
• Runs on all PRs and fails if CHANGELOG.md not modified and label not present

.github/workflows/changelog-gate.yml


76. deny.toml ⚙️ Configuration changes +38/-0

Cargo-deny supply chain security configuration

• New cargo-deny configuration file blocking vulnerable advisories and unmaintained crates
• Allows common open-source licenses (MIT, Apache-2.0, BSD variants, ISC, Zlib, OpenSSL, BSL-1.0)
• Denies copyleft licenses, multiple versions of same crate, unknown registries/git sources
• Enforces crates.io registry only

deny.toml


77. .planning Miscellaneous +1/-1

Planning submodule commit update

• Subproject commit reference updated to reflect planning/tracking state

.planning


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 9, 2026

Code Review by Qodo

🐞 Bugs (4)   📘 Rule violations (4)   📎 Requirement gaps (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1) ☼ Reliability (1) ➹ Performance (1) ◔ Observability (1)
📘\ ☼ Reliability (1) ➹ Performance (3)

Grey Divider


Action required

1. format! in format_memory_human 📘
Description
src/command/connection.rs introduces format!-based string building in the command path, which
performs heap allocations and formatting work in a hot module. This violates the
no-allocation/formatting requirement for code under src/command/.
Code

src/command/connection.rs[R133-147]

+/// Format bytes as human-readable (e.g. "1.23M", "456.78K").
+fn format_memory_human(bytes: u64) -> String {
+    const KB: f64 = 1024.0;
+    const MB: f64 = 1024.0 * 1024.0;
+    const GB: f64 = 1024.0 * 1024.0 * 1024.0;
+    let b = bytes as f64;
+    if b >= GB {
+        format!("{:.2}G", b / GB)
+    } else if b >= MB {
+        format!("{:.2}M", b / MB)
+    } else if b >= KB {
+        format!("{:.2}K", b / KB)
+    } else {
+        format!("{bytes}B")
+    }
Evidence
The checklist forbids introducing format!() in command dispatch modules under src/command/. The
added helper format_memory_human() uses multiple format! calls to build strings.

CLAUDE.md
src/command/connection.rs[133-147]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`src/command/connection.rs` adds `format!` calls for INFO/`format_memory_human()`, which violates the hot-path allocation/formatting restrictions for `src/command/`.
## Issue Context
Compliance requires avoiding `format!()`/`to_string()`/expensive allocations in command dispatch paths.
## Fix Focus Areas
- src/command/connection.rs[133-147]
- src/command/connection.rs[171-180]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. format! added in parser 📘
Description
src/protocol/parse.rs adds new format! allocations when rejecting malformed RESP3 null frames.
This adds allocation/formatting work inside the protocol parser hot path and can be amplified by
malicious clients sending invalid frames.
Code

src/protocol/parse.rs[R234-245]

+            // RESP3 Null: `_\r\n` — verify CRLF immediately follows type byte
+            if *pos + 1 >= buf.len() {
+                return Err(ParseError::Incomplete);
+            }
+            if buf[*pos] != b'\r' || buf[*pos + 1] != b'\n' {
+                return Err(ParseError::Invalid {
+                    message: format!(
+                        "RESP3 null has trailing data before CRLF at offset {}",
+                        *pos
+                    ),
+                    offset: *pos,
+                });
Evidence
The checklist prohibits introducing format!() inside protocol parsing (src/protocol/). The added
RESP3 null validation constructs error messages using format!, which allocates.

CLAUDE.md
src/protocol/parse.rs[234-245]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New RESP3 null validation paths in `src/protocol/parse.rs` use `format!()` to build error strings, introducing allocations in the protocol parser.
## Issue Context
The parser is a hot path and should avoid heap allocations/formatting even on invalid inputs to reduce DoS amplification risk.
## Fix Focus Areas
- src/protocol/parse.rs[234-245]
- src/protocol/parse.rs[629-635]
- src/protocol/parse.rs[874-880]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. std::sync::RwLock in metrics 📘
Description
src/admin/metrics_setup.rs introduces std::sync::RwLock for global replication state, contrary
to the project locking rule requiring parking_lot primitives. This increases risk of poisoning
behavior and violates the lock primitive standardization requirement.
Code

src/admin/metrics_setup.rs[R535-548]

+static GLOBAL_REPL_STATE: once_cell::sync::OnceCell<
+    std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
+> = once_cell::sync::OnceCell::new();
+
+/// Register the global replication state for INFO queries.
+pub fn set_global_repl_state(
+    state: std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
+) {
+    let _ = GLOBAL_REPL_STATE.set(state);
+}
+
+/// Get replication info for INFO command: (role, connected_slaves, master_repl_offset, repl_id).
+/// Also updates the Prometheus replication lag gauge as a side-effect.
+pub fn get_replication_info() -> (&'static str, usize, u64, String) {
Evidence
The checklist requires locks to use parking_lot::RwLock/parking_lot::Mutex instead of
std::sync locks. The new global replication state is stored behind std::sync::RwLock.

CLAUDE.md
src/admin/metrics_setup.rs[535-548]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new `std::sync::RwLock` is introduced for `GLOBAL_REPL_STATE` in `src/admin/metrics_setup.rs`, violating the requirement to use `parking_lot` locks.
## Issue Context
Project locking rules standardize on `parking_lot` to avoid poisoning semantics and improve performance/consistency.
## Fix Focus Areas
- src/admin/metrics_setup.rs[535-548]
- src/admin/metrics_setup.rs[549-582]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (2)
4. READYZ always not ready🐞
Description
connection::readyz() gates on metrics_setup::is_server_ready(), but startup only sets the admin
HTTP readiness flag and never sets SERVER_READY, so READYZ keeps returning ERR server not ready
after initialization.
Code

src/command/connection.rs[R125-131]

+pub fn readyz() -> Frame {
+    if crate::admin::metrics_setup::is_server_ready() {
+        Frame::SimpleString(Bytes::from_static(b"OK"))
+    } else {
+        Frame::Error(Bytes::from_static(b"ERR server not ready"))
+    }
+}
Evidence
READYZ checks the SERVER_READY atomic, while main only flips the Prometheus admin server’s
readiness flag; the SERVER_READY atomic is never updated in the startup path.

src/command/connection.rs[123-131]
src/admin/metrics_setup.rs[10-21]
src/main.rs[89-96]
src/main.rs[366-370]
src/admin/http_server.rs[41-47]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`READYZ` uses `metrics_setup::SERVER_READY`, but startup only sets the admin HTTP server readiness flag. This makes `READYZ` always return not-ready.
## Issue Context
- Admin HTTP `/readyz` uses an `Arc<AtomicBool>` created in `init_metrics()`.
- Redis command `READYZ` uses `metrics_setup::is_server_ready()`.
## Fix Focus Areas
- src/main.rs[366-370]
- src/admin/metrics_setup.rs[10-21]
- src/command/connection.rs[123-131]
## Suggested fix
- When shards are recovered, call `moon::admin::metrics_setup::set_server_ready()` in addition to setting the HTTP readiness flag.
- Alternatively, remove the separate `SERVER_READY` atomic and have `READYZ` read the same readiness source used by the admin HTTP server (store it in a global OnceCell).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. INFO keys prefixed by spaces 🐞
Description
INFO uses multi-line string literals with \ line continuation plus indentation, which embeds
leading spaces into field names (e.g. " used_memory_human"), breaking Redis INFO compatibility and
parsers.
Code

src/command/connection.rs[R174-177]

+        "used_memory:{rss}\r\n\
+         used_memory_human:{human}\r\n\
+         used_memory_rss:{rss}\r\n\
+         used_memory_peak:{rss}\r\n",
Evidence
The INFO builder includes indented continuations after \, so the output field names will contain a
leading space on every continued line (Memory/Persistence/Vector/Stats/CPU/Replication sections).

src/command/connection.rs[170-190]
src/command/connection.rs[240-272]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
INFO output includes leading spaces in keys due to indented `\`-continued string literals, which breaks Redis INFO formatting expectations.
## Issue Context
Rust string literal line continuation `\` removes the newline but preserves any indentation whitespace on the next source line.
## Fix Focus Areas
- src/command/connection.rs[170-190]
- src/command/connection.rs[240-272]
## Suggested fix
Rewrite the format strings so continued lines start immediately (no indentation), e.g.:
- "used_memory:{rss}\r\nused_memory_human:{human}\r\nused_memory_rss:{rss}\r\n..."
Apply the same change to Persistence/Vector/Stats/CPU/Replication sections.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

6. Unconditional Instant::now() dispatch 📘
Description
src/server/conn/handler_single.rs adds per-command Instant::now() timing on the dispatch path
without gating, increasing overhead in a critical throughput path. This violates the performance
invariant expectation to avoid repeated Instant::now() calls in core request processing.
Code

src/server/conn/handler_single.rs[R685-690]

+                                    let dispatch_start = std::time::Instant::now();
                              let result = dispatch(&mut *guard, d_cmd, d_args, &mut selected_db, db_count);
+                                    let elapsed_us = dispatch_start.elapsed().as_micros() as u64;
+                                    if let Ok(cmd_str) = std::str::from_utf8(d_cmd) {
+                                        crate::admin::metrics_setup::record_command(cmd_str, elapsed_us);
+                                    }
Evidence
The checklist requires maintaining performance invariants and explicitly calls out avoiding
Instant::now() in critical paths. The new dispatch timing uses Instant::now() on each dispatched
command and is not gated by is_metrics_enabled() or similar.

CLAUDE.md
src/server/conn/handler_single.rs[685-690]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handler_single` now calls `std::time::Instant::now()` on every command dispatch to measure latency, even when metrics/slowlog may be disabled.
## Issue Context
This is a core throughput path; the checklist requires avoiding repeated `Instant::now()` calls in critical request processing (prefer cached timestamps or feature-gated timing).
## Fix Focus Areas
- src/server/conn/handler_single.rs[685-698]
- src/server/conn/handler_single.rs[1093-1106]
- src/server/conn/handler_single.rs[1178-1191]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


7. Client metrics only sharded 🐞
Description
INFO reports connected_clients and total_connections_received from metrics_setup atomics,
but only the sharded Tokio handler updates those counters; the non-sharded Tokio listener path (and
monoio handler) never increments/decrements them so INFO is wrong there.
Code

src/server/conn/handler_sharded.rs[R113-114]

+    crate::admin::metrics_setup::record_connection_opened();
let peer_addr = stream
Evidence
INFO reads connected_clients() which is only updated by record_connection_opened/closed. The
sharded handler calls record_connection_opened(), but the non-sharded listener spawns
handler_single::handle_connection() which does not call these functions, so counters remain 0 in
that mode.

src/command/connection.rs[162-167]
src/admin/metrics_setup.rs[329-355]
src/server/conn/handler_sharded.rs[112-117]
src/server/listener.rs[214-233]
src/server/conn/handler_single.rs[57-85]
src/server/conn/handler_monoio.rs[76-101]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Connection counters used by INFO are only updated in the sharded Tokio handler. Other runtimes/handlers do not update them, producing incorrect INFO output.
## Issue Context
`connected_clients()` and `total_connections_received()` are backed by atomics incremented in `record_connection_opened/closed()`.
## Fix Focus Areas
- src/server/conn/handler_sharded.rs[112-231]
- src/server/conn/handler_single.rs[57-85]
- src/server/conn/handler_monoio.rs[76-101]
- src/admin/metrics_setup.rs[329-349]
## Suggested fix
- Call `record_connection_opened()` at the beginning of `handler_single::handle_connection()` and `handle_connection_sharded_monoio()`.
- Ensure `record_connection_closed()` runs on all exit paths (use a guard/defer pattern).
- Consider making `record_connection_closed()` saturating to avoid underflow if mismatched.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


8. Metrics init leaves flag set 🐞
Description
init_metrics() sets METRICS_INITIALIZED=true before calling metrics::set_global_recorder; if
set_global_recorder fails, it returns None but leaves METRICS_INITIALIZED set, making
is_metrics_enabled() report enabled even though the exporter/admin server didn’t start.
Code

src/admin/metrics_setup.rs[R53-69]

+    if METRICS_INITIALIZED
+        .compare_exchange(false, true, Ordering::SeqCst, Ordering::SeqCst)
+        .is_ok()
+    {
+        let recorder = metrics_exporter_prometheus::PrometheusBuilder::new().build_recorder();
+        let prometheus_handle = recorder.handle();
+
+        // Install as the global metrics recorder
+        if let Err(e) = metrics::set_global_recorder(recorder) {
+            tracing::error!("Failed to set global metrics recorder: {}", e);
+            return None;
+        }
+
+        let ready = std::sync::Arc::new(AtomicBool::new(false));
+        crate::admin::http_server::spawn_admin_server(addr, prometheus_handle, ready.clone());
+        Some(ready)
+    } else {
Evidence
The error path returns early without resetting METRICS_INITIALIZED, and is_metrics_enabled()
directly reads that atomic.

src/admin/metrics_setup.rs[52-71]
src/admin/metrics_setup.rs[76-81]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`METRICS_INITIALIZED` is set before a fallible `set_global_recorder()` call and is not rolled back on failure.
## Issue Context
This can cause `is_metrics_enabled()` to return true even when metrics were not successfully initialized.
## Fix Focus Areas
- src/admin/metrics_setup.rs[52-71]
- src/admin/metrics_setup.rs[76-81]
## Suggested fix
- Move the `METRICS_INITIALIZED` store/compare-exchange to after `set_global_recorder()` succeeds, or
- If keeping compare_exchange, reset `METRICS_INITIALIZED` back to false on the error path.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
9. Per-command label allocates 🐞
Description
sanitize_cmd_label() calls to_ascii_lowercase() which allocates a temporary String per
command; combined with unconditional per-command timing in handlers, this adds avoidable overhead on
the hot path when metrics are enabled.
Code

src/admin/metrics_setup.rs[R89-105]

+fn sanitize_cmd_label(cmd: &str) -> &'static str {
+    if cmd.len() > 20 || cmd.is_empty() {
+        return "unknown";
+    }
+    if !cmd.bytes().all(|b| b.is_ascii_alphabetic() || b == b'.') {
+        return "unknown";
+    }
+    // Map to a static string to avoid per-call allocation.
+    // The match covers all commands Moon dispatches; anything else is "unknown".
+    match cmd.to_ascii_lowercase().as_str() {
+        // String
+        "get" => "get",
+        "set" => "set",
+        "mget" => "mget",
+        "mset" => "mset",
+        "append" => "append",
+        "incr" => "incr",
Evidence
sanitize_cmd_label() lowercases by allocating a String (to_ascii_lowercase). Handlers call
record_command() for each command and measure latency via Instant::now()/elapsed(), so this
work occurs on the command execution path.

src/admin/metrics_setup.rs[89-105]
src/admin/metrics_setup.rs[306-316]
src/server/conn/handler_single.rs[683-697]
src/server/conn/handler_sharded.rs[1398-1411]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Command metrics label sanitization allocates a temporary lowercase `String` per command. This is avoidable overhead on the hot path.
## Issue Context
The function returns a `&'static str`, but still allocates due to `to_ascii_lowercase()`.
## Fix Focus Areas
- src/admin/metrics_setup.rs[89-105]
- src/admin/metrics_setup.rs[306-316]
## Suggested fix
- Avoid allocation by matching case-insensitively without building a `String` (e.g., `match` on `cmd.as_bytes()` with ASCII folding, or use `eq_ignore_ascii_case` against a small curated set).
- If you keep a mapping table, consider a static perfect-hash (`phf`) keyed by lowercase command bytes or pre-normalize once when parsing commands.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread src/command/connection.rs
Comment on lines +133 to +147
/// Format bytes as human-readable (e.g. "1.23M", "456.78K").
fn format_memory_human(bytes: u64) -> String {
const KB: f64 = 1024.0;
const MB: f64 = 1024.0 * 1024.0;
const GB: f64 = 1024.0 * 1024.0 * 1024.0;
let b = bytes as f64;
if b >= GB {
format!("{:.2}G", b / GB)
} else if b >= MB {
format!("{:.2}M", b / MB)
} else if b >= KB {
format!("{:.2}K", b / KB)
} else {
format!("{bytes}B")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. format! in format_memory_human 📘 Rule violation ➹ Performance

src/command/connection.rs introduces format!-based string building in the command path, which
performs heap allocations and formatting work in a hot module. This violates the
no-allocation/formatting requirement for code under src/command/.
Agent Prompt
## Issue description
`src/command/connection.rs` adds `format!` calls for INFO/`format_memory_human()`, which violates the hot-path allocation/formatting restrictions for `src/command/`.

## Issue Context
Compliance requires avoiding `format!()`/`to_string()`/expensive allocations in command dispatch paths.

## Fix Focus Areas
- src/command/connection.rs[133-147]
- src/command/connection.rs[171-180]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread src/protocol/parse.rs
Comment on lines +234 to +245
// RESP3 Null: `_\r\n` — verify CRLF immediately follows type byte
if *pos + 1 >= buf.len() {
return Err(ParseError::Incomplete);
}
if buf[*pos] != b'\r' || buf[*pos + 1] != b'\n' {
return Err(ParseError::Invalid {
message: format!(
"RESP3 null has trailing data before CRLF at offset {}",
*pos
),
offset: *pos,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. format! added in parser 📘 Rule violation ➹ Performance

src/protocol/parse.rs adds new format! allocations when rejecting malformed RESP3 null frames.
This adds allocation/formatting work inside the protocol parser hot path and can be amplified by
malicious clients sending invalid frames.
Agent Prompt
## Issue description
New RESP3 null validation paths in `src/protocol/parse.rs` use `format!()` to build error strings, introducing allocations in the protocol parser.

## Issue Context
The parser is a hot path and should avoid heap allocations/formatting even on invalid inputs to reduce DoS amplification risk.

## Fix Focus Areas
- src/protocol/parse.rs[234-245]
- src/protocol/parse.rs[629-635]
- src/protocol/parse.rs[874-880]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +535 to +548
static GLOBAL_REPL_STATE: once_cell::sync::OnceCell<
std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
> = once_cell::sync::OnceCell::new();

/// Register the global replication state for INFO queries.
pub fn set_global_repl_state(
state: std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
) {
let _ = GLOBAL_REPL_STATE.set(state);
}

/// Get replication info for INFO command: (role, connected_slaves, master_repl_offset, repl_id).
/// Also updates the Prometheus replication lag gauge as a side-effect.
pub fn get_replication_info() -> (&'static str, usize, u64, String) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. std::sync::rwlock in metrics 📘 Rule violation ☼ Reliability

src/admin/metrics_setup.rs introduces std::sync::RwLock for global replication state, contrary
to the project locking rule requiring parking_lot primitives. This increases risk of poisoning
behavior and violates the lock primitive standardization requirement.
Agent Prompt
## Issue description
A new `std::sync::RwLock` is introduced for `GLOBAL_REPL_STATE` in `src/admin/metrics_setup.rs`, violating the requirement to use `parking_lot` locks.

## Issue Context
Project locking rules standardize on `parking_lot` to avoid poisoning semantics and improve performance/consistency.

## Fix Focus Areas
- src/admin/metrics_setup.rs[535-548]
- src/admin/metrics_setup.rs[549-582]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread src/command/connection.rs
Comment thread src/command/connection.rs
Comment on lines +174 to +177
"used_memory:{rss}\r\n\
used_memory_human:{human}\r\n\
used_memory_rss:{rss}\r\n\
used_memory_peak:{rss}\r\n",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

5. Info keys prefixed by spaces 🐞 Bug ≡ Correctness

INFO uses multi-line string literals with \ line continuation plus indentation, which embeds
leading spaces into field names (e.g. " used_memory_human"), breaking Redis INFO compatibility and
parsers.
Agent Prompt
## Issue description
INFO output includes leading spaces in keys due to indented `\`-continued string literals, which breaks Redis INFO formatting expectations.

## Issue Context
Rust string literal line continuation `\` removes the newline but preserves any indentation whitespace on the next source line.

## Fix Focus Areas
- src/command/connection.rs[170-190]
- src/command/connection.rs[240-272]

## Suggested fix
Rewrite the format strings so continued lines start immediately (no indentation), e.g.:
- "used_memory:{rss}\r\nused_memory_human:{human}\r\nused_memory_rss:{rss}\r\n..."
Apply the same change to Persistence/Vector/Stats/CPU/Replication sections.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Base automatically changed from v0.1.3-phases-92-100 to main April 10, 2026 00:51
… finite score ranges

The rev branch in zrange_by_score/zrange_by_lex swapped min_arg and max_arg
during parsing, but all callers already pass them in semantic (min, max) order.
This double-swap produced min_bound > max_bound, making the filter
(s >= min AND s <= max) reject everything for finite ranges like "3 1".

Fix: remove the rev-specific swap — rev only affects iteration direction
(entries.reverse), not bound parsing. Added ZREVRANGEBYSCORE finite range
test to test-commands.sh.
- INFO Clients: connected_clients from atomic counter
- INFO Memory: used_memory/used_memory_human/used_memory_rss from /proc/self/status
- INFO Replication: role/connected_slaves/master_replid/master_repl_offset
  from global ReplicationState (registered at startup via set_global_repl_state)
- Added #[instrument] spans to: handle_connection (single/monoio),
  handle_psync_on_master (tokio/monoio), compact(), rewrite_aof()
- Added get_rss_bytes() for Linux /proc parsing, connected_clients()
  atomic counter, get_replication_info() global accessor
get_replication_info() now computes max lag across all replicas and
calls record_replication_lag() to update the moon_replication_lag_bytes
Prometheus gauge. Previously the function existed but had zero call sites.
- CI: add supply-chain job running cargo deny check + cargo audit
  (deny.toml already existed but was not enforced in pipeline)
- Release: add linux-aarch64-tokio matrix entry using cross for
  cross-compilation (aarch64 is the primary production target)
- Release: update checksums and release artifact list
Two new crash matrix test cells:
- crash_during_bgsave: SIGKILL mid-RDB snapshot, verify AOF recovery
- crash_during_bgrewriteaof: SIGKILL mid-AOF compaction, verify
  original AOF intact for recovery

Both use appendfsync=always so all 500 keys must survive (RPO=0).
This brings crash matrix coverage to 6/7 cells.
New redis_compat.rs test coverage:
- Streams: XADD, XLEN, XRANGE, XTRIM MAXLEN
- Lua: EVAL return string, EVAL with KEYS/ARGV, EVALSHA after SCRIPT LOAD,
  SCRIPT EXISTS/FLUSH
- ACL: WHOAMI, LIST (verify default user exists)

All tests are #[ignore] — require a running Moon instance.
Covers: ZREVRANGEBYSCORE fix, INFO enrichment, tracing spans,
repl lag metric, CI supply chain, aarch64 release build, crash matrix,
and expanded compat tests.
READYZ command always returned "ERR server not ready" because
set_server_ready() was never called. The HTTP /readyz endpoint
worked via a separate readiness_flag AtomicBool, but the Redis
READYZ command used is_server_ready() which checks SERVER_READY.

Added set_server_ready() call after shard recovery completes,
immediately before the existing readiness_flag.store().
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
.github/workflows/ci.yml (1)

89-90: Pin the security tools in CI.

Installing whatever cargo-deny and cargo-audit happen to be latest makes this job non-reproducible. A new upstream release can start failing unrelated PRs. Please install pinned versions with --locked or use a pinned installer action.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 89 - 90, The CI step "Install
cargo-deny and cargo-audit" currently installs latest releases; change it to
install pinned, reproducible versions by either adding the cargo install flag
--locked with a Cargo.lock that pins versions or by switching to a pinned
installer action (or specifying exact versions with --version) for cargo-deny
and cargo-audit in the step named "Install cargo-deny and cargo-audit" so CI
uses deterministic tool versions.
.github/workflows/release.yml (1)

52-54: Pin cross version for reproducibility.

The --locked flag ensures consistent dependencies within cross, but cross itself isn't version-pinned. A breaking cross release could fail future builds. Pinning to the latest stable version (v0.2.5) prevents unexpected build failures.

♻️ Suggested fix
      - name: Install cross (aarch64)
        if: matrix.cross
-       run: cargo install cross --locked
+       run: cargo install cross@0.2.5 --locked
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/release.yml around lines 52 - 54, Update the GitHub
Actions step that installs cross (the step named "Install cross (aarch64)" which
currently runs "cargo install cross --locked") to pin the cross crate to v0.2.5
by adding the --version (or --vers) flag; e.g. change the run command to use
"cargo install cross --locked --version 0.2.5" so the workflow installs a
reproducible, pinned cross release.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/admin/metrics_setup.rs`:
- Around line 569-585: The Prometheus gauge is only updated when guard.replicas
is non-empty, so when the last replica disconnects the previous non-zero value
remains exported; modify the logic around guard.replicas in the function
containing this block (the code that computes max_lag_bytes and calls
record_replication_lag) to explicitly set the metric to zero when
replicas.is_empty() (e.g., call record_replication_lag(0, 0) or equivalent in
the empty branch) so moon_replication_lag_bytes is reset to 0 when no replicas
are connected.
- Around line 354-359: record_connection_closed currently unconditionally
decrements CONNECTED_CLIENTS and the "moon_connected_clients" gauge, which can
underflow if called when count is zero; change record_connection_closed to
atomically decrement CONNECTED_CLIENTS only when its current value is > 0 (use
fetch_update or a compare_exchange loop on CONNECTED_CLIENTS) and only call
gauge!("moon_connected_clients").decrement(...) if the atomic decrement actually
occurred and METRICS_INITIALIZED.load(...) is true, ensuring the atomic and
Prometheus gauge stay in sync and never go negative; reference
record_connection_closed, CONNECTED_CLIENTS, METRICS_INITIALIZED, and the
"moon_connected_clients" gauge in the fix.
- Around line 546-553: The GLOBAL_REPL_STATE static and the
set_global_repl_state function parameter use std::sync::RwLock; change both to
use parking_lot::RwLock instead (keep Arc as std::sync::Arc). Update the type in
the OnceCell declaration for GLOBAL_REPL_STATE and the parameter type of
set_global_repl_state to
std::sync::Arc<parking_lot::RwLock<crate::replication::state::ReplicationState>>
(or fully-qualified parking_lot::RwLock) and adjust any imports if necessary.

In `@src/command/connection.rs`:
- Around line 171-177: The current code prints used_memory_peak using the
current rss value, which makes the peak non-monotonic; update metrics tracking
in crate::admin::metrics_setup to maintain a high-water mark (e.g., add or
expose a get_rss_peak_bytes or update_rss_peak function that stores the max
observed RSS) and then replace the peak usage here to call that new accessor
instead of reusing rss (reference get_rss_bytes and the used_memory_peak output
string in this block to locate where to swap rss for the real peak value);
ensure the peak is only increased when a new rss > stored_peak so
used_memory_peak remains monotonic.

In `@tests/durability/crash_matrix.rs`:
- Around line 269-275: The test currently sleeps a fixed 50ms after
send_resp_command("BGSAVE")/BGREWRITEAOF then SIGKILL, which is racy; instead
poll for an in-progress persistence indicator before killing: after calling
send_resp_command(addr, "BGSAVE") or "BGREWRITEAOF" query the server with "INFO
persistence" and wait until the output shows a child or rewrite in progress, or
watch for the AOF rewrite temp-file appearing in the background writer path,
with a bounded timeout (e.g. loop with short sleep up to N ms); only call
libc::kill(...) once the observable in-progress flag is true (ref:
send_resp_command usage, BGSAVE handling in src/command/persistence.rs
spawn_blocking and BGREWRITEAOF handled by the background writer in
src/persistence/aof.rs).

---

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 89-90: The CI step "Install cargo-deny and cargo-audit" currently
installs latest releases; change it to install pinned, reproducible versions by
either adding the cargo install flag --locked with a Cargo.lock that pins
versions or by switching to a pinned installer action (or specifying exact
versions with --version) for cargo-deny and cargo-audit in the step named
"Install cargo-deny and cargo-audit" so CI uses deterministic tool versions.

In @.github/workflows/release.yml:
- Around line 52-54: Update the GitHub Actions step that installs cross (the
step named "Install cross (aarch64)" which currently runs "cargo install cross
--locked") to pin the cross crate to v0.2.5 by adding the --version (or --vers)
flag; e.g. change the run command to use "cargo install cross --locked --version
0.2.5" so the workflow installs a reproducible, pinned cross release.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff0d84a9-1624-44d3-8be4-790dc9745429

📥 Commits

Reviewing files that changed from the base of the PR and between 24ee60e and 3d5372a.

📒 Files selected for processing (16)
  • .github/workflows/ci.yml
  • .github/workflows/release.yml
  • CHANGELOG.md
  • scripts/test-commands.sh
  • src/admin/metrics_setup.rs
  • src/command/connection.rs
  • src/command/sorted_set/mod.rs
  • src/main.rs
  • src/persistence/aof.rs
  • src/replication/master.rs
  • src/server/conn/handler_monoio.rs
  • src/server/conn/handler_single.rs
  • src/server/listener.rs
  • src/vector/segment/compaction.rs
  • tests/durability/crash_matrix.rs
  • tests/redis_compat.rs

Comment on lines 354 to 359
pub fn record_connection_closed() {
CONNECTED_CLIENTS.fetch_sub(1, Ordering::Relaxed);
if !METRICS_INITIALIZED.load(Ordering::Relaxed) {
return;
}
gauge!("moon_connected_clients").decrement(1.0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard the connected-client count against unmatched closes.

Line 355 unconditionally decrements both the atomic state and the Prometheus gauge. If a close path fires twice or after a partially-opened connection, CONNECTED_CLIENTS wraps to u64::MAX and the gauge can go negative.

Suggested fix
 pub fn record_connection_closed() {
-    CONNECTED_CLIENTS.fetch_sub(1, Ordering::Relaxed);
+    let prev = CONNECTED_CLIENTS
+        .fetch_update(Ordering::Relaxed, Ordering::Relaxed, |n| {
+            Some(n.saturating_sub(1))
+        })
+        .unwrap_or(0);
+    if prev == 0 {
+        return;
+    }
     if !METRICS_INITIALIZED.load(Ordering::Relaxed) {
         return;
     }
     gauge!("moon_connected_clients").decrement(1.0);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn record_connection_closed() {
CONNECTED_CLIENTS.fetch_sub(1, Ordering::Relaxed);
if !METRICS_INITIALIZED.load(Ordering::Relaxed) {
return;
}
gauge!("moon_connected_clients").decrement(1.0);
pub fn record_connection_closed() {
let prev = CONNECTED_CLIENTS
.fetch_update(Ordering::Relaxed, Ordering::Relaxed, |n| {
Some(n.saturating_sub(1))
})
.unwrap_or(0);
if prev == 0 {
return;
}
if !METRICS_INITIALIZED.load(Ordering::Relaxed) {
return;
}
gauge!("moon_connected_clients").decrement(1.0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/admin/metrics_setup.rs` around lines 354 - 359, record_connection_closed
currently unconditionally decrements CONNECTED_CLIENTS and the
"moon_connected_clients" gauge, which can underflow if called when count is
zero; change record_connection_closed to atomically decrement CONNECTED_CLIENTS
only when its current value is > 0 (use fetch_update or a compare_exchange loop
on CONNECTED_CLIENTS) and only call
gauge!("moon_connected_clients").decrement(...) if the atomic decrement actually
occurred and METRICS_INITIALIZED.load(...) is true, ensuring the atomic and
Prometheus gauge stay in sync and never go negative; reference
record_connection_closed, CONNECTED_CLIENTS, METRICS_INITIALIZED, and the
"moon_connected_clients" gauge in the fix.

Comment on lines +546 to +553
static GLOBAL_REPL_STATE: once_cell::sync::OnceCell<
std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
> = once_cell::sync::OnceCell::new();

/// Register the global replication state for INFO queries.
pub fn set_global_repl_state(
state: std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and examine the code at the specified lines
if [ -f "src/admin/metrics_setup.rs" ]; then
  echo "=== File exists. Checking lines 546-553 ==="
  sed -n '546,553p' src/admin/metrics_setup.rs
  echo ""
  echo "=== Checking for std::sync::RwLock usage in this file ==="
  grep -n "std::sync::RwLock\|std::sync::Mutex\|std::sync::Arc" src/admin/metrics_setup.rs | head -20
else
  echo "File not found at src/admin/metrics_setup.rs"
fi

Repository: pilotspace/moon

Length of output: 865


🏁 Script executed:

# Get context around lines 546-553 to understand the usage better
if [ -f "src/admin/metrics_setup.rs" ]; then
  echo "=== Context around lines 546-553 ==="
  sed -n '540,560p' src/admin/metrics_setup.rs
fi

Repository: pilotspace/moon

Length of output: 925


🏁 Script executed:

# Check if parking_lot is available in the project dependencies
echo "=== Checking Cargo.toml for parking_lot dependency ==="
if grep -r "parking_lot" Cargo.toml 2>/dev/null; then
  echo "parking_lot found in main Cargo.toml"
else
  echo "parking_lot may be in workspace member Cargo.toml or feature-gated"
  find . -name "Cargo.toml" -type f -exec grep -l "parking_lot" {} \; 2>/dev/null | head -10
fi

Repository: pilotspace/moon

Length of output: 171


Replace std::sync::RwLock with parking_lot::RwLock in the global replication state.

Both the static type (line 547) and the function parameter (line 552) use std::sync::RwLock, which violates the coding guideline: never use std::sync locks. Update both to use parking_lot::RwLock instead. parking_lot is already a project dependency.

Code snippet
static GLOBAL_REPL_STATE: once_cell::sync::OnceCell<
    std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
> = once_cell::sync::OnceCell::new();

pub fn set_global_repl_state(
    state: std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/admin/metrics_setup.rs` around lines 546 - 553, The GLOBAL_REPL_STATE
static and the set_global_repl_state function parameter use std::sync::RwLock;
change both to use parking_lot::RwLock instead (keep Arc as std::sync::Arc).
Update the type in the OnceCell declaration for GLOBAL_REPL_STATE and the
parameter type of set_global_repl_state to
std::sync::Arc<parking_lot::RwLock<crate::replication::state::ReplicationState>>
(or fully-qualified parking_lot::RwLock) and adjust any imports if necessary.

Comment on lines +569 to +585
// Update Prometheus lag gauge: max lag across all replicas.
if !guard.replicas.is_empty() {
let max_lag_bytes = guard
.replicas
.iter()
.map(|r| {
let ack: u64 = r
.ack_offsets
.iter()
.map(|a| a.load(Ordering::Relaxed))
.sum();
offset.saturating_sub(ack)
})
.max()
.unwrap_or(0);
record_replication_lag(max_lag_bytes, 0);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset replication lag to zero when no replicas are connected.

The gauge is only updated inside the non-empty branch, so after the last replica disconnects moon_replication_lag_bytes keeps exporting the previous non-zero lag forever. That makes the new metric operationally misleading.

Suggested fix
-            if !guard.replicas.is_empty() {
-                let max_lag_bytes = guard
-                    .replicas
-                    .iter()
-                    .map(|r| {
-                        let ack: u64 = r
-                            .ack_offsets
-                            .iter()
-                            .map(|a| a.load(Ordering::Relaxed))
-                            .sum();
-                        offset.saturating_sub(ack)
-                    })
-                    .max()
-                    .unwrap_or(0);
-                record_replication_lag(max_lag_bytes, 0);
-            }
+            let max_lag_bytes = guard
+                .replicas
+                .iter()
+                .map(|r| {
+                    let ack: u64 = r
+                        .ack_offsets
+                        .iter()
+                        .map(|a| a.load(Ordering::Relaxed))
+                        .sum();
+                    offset.saturating_sub(ack)
+                })
+                .max()
+                .unwrap_or(0);
+            record_replication_lag(max_lag_bytes, 0);
             return (role, slaves, offset, repl_id);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Update Prometheus lag gauge: max lag across all replicas.
if !guard.replicas.is_empty() {
let max_lag_bytes = guard
.replicas
.iter()
.map(|r| {
let ack: u64 = r
.ack_offsets
.iter()
.map(|a| a.load(Ordering::Relaxed))
.sum();
offset.saturating_sub(ack)
})
.max()
.unwrap_or(0);
record_replication_lag(max_lag_bytes, 0);
}
let max_lag_bytes = guard
.replicas
.iter()
.map(|r| {
let ack: u64 = r
.ack_offsets
.iter()
.map(|a| a.load(Ordering::Relaxed))
.sum();
offset.saturating_sub(ack)
})
.max()
.unwrap_or(0);
record_replication_lag(max_lag_bytes, 0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/admin/metrics_setup.rs` around lines 569 - 585, The Prometheus gauge is
only updated when guard.replicas is non-empty, so when the last replica
disconnects the previous non-zero value remains exported; modify the logic
around guard.replicas in the function containing this block (the code that
computes max_lag_bytes and calls record_replication_lag) to explicitly set the
metric to zero when replicas.is_empty() (e.g., call record_replication_lag(0, 0)
or equivalent in the empty branch) so moon_replication_lag_bytes is reset to 0
when no replicas are connected.

Comment thread src/command/connection.rs
Comment on lines +171 to +177
let rss = crate::admin::metrics_setup::get_rss_bytes();
let _ = write!(
sections,
"used_memory:{rss}\r\n\
used_memory_human:{human}\r\n\
used_memory_rss:{rss}\r\n\
used_memory_peak:{rss}\r\n",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

used_memory_peak should not mirror current RSS.

This now reports the current RSS as the peak, so the “peak” value will shrink again after memory is released. INFO consumers usually assume used_memory_peak is monotonic. Please track a real high-water mark in metrics_setup and use that here instead of reusing rss.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/command/connection.rs` around lines 171 - 177, The current code prints
used_memory_peak using the current rss value, which makes the peak
non-monotonic; update metrics tracking in crate::admin::metrics_setup to
maintain a high-water mark (e.g., add or expose a get_rss_peak_bytes or
update_rss_peak function that stores the max observed RSS) and then replace the
peak usage here to call that new accessor instead of reusing rss (reference
get_rss_bytes and the used_memory_peak output string in this block to locate
where to swap rss for the real peak value); ensure the peak is only increased
when a new rss > stored_peak so used_memory_peak remains monotonic.

Comment on lines +269 to +275
// Trigger BGSAVE then immediately kill
send_resp_command(addr, "BGSAVE");
thread::sleep(Duration::from_millis(50));

// SAFETY: valid PID, SIGKILL is always valid
let ret = unsafe { libc::kill(server.id() as i32, libc::SIGKILL) };
assert_eq!(ret, 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wait for the background persistence job to actually start before SIGKILL.

Both code paths are asynchronous after the command returns: BGSAVE schedules spawn_blocking in src/command/persistence.rs:36-120, and BGREWRITEAOF is handled in the background writer loop in src/persistence/aof.rs:292-307. The fixed 50ms sleep does not guarantee either job is mid-flight, so these tests can pass while only covering a plain crash-after-writes scenario.

Please poll an observable in-progress signal before killing the process, e.g. INFO persistence / a rewrite temp-file side effect with a bounded timeout.

Also applies to: 330-336

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/durability/crash_matrix.rs` around lines 269 - 275, The test currently
sleeps a fixed 50ms after send_resp_command("BGSAVE")/BGREWRITEAOF then SIGKILL,
which is racy; instead poll for an in-progress persistence indicator before
killing: after calling send_resp_command(addr, "BGSAVE") or "BGREWRITEAOF" query
the server with "INFO persistence" and wait until the output shows a child or
rewrite in progress, or watch for the AOF rewrite temp-file appearing in the
background writer path, with a bounded timeout (e.g. loop with short sleep up to
N ms); only call libc::kill(...) once the observable in-progress flag is true
(ref: send_resp_command usage, BGSAVE handling in src/command/persistence.rs
spawn_blocking and BGREWRITEAOF handled by the background writer in
src/persistence/aof.rs).

@qodo-code-review
Copy link
Copy Markdown

CI Feedback 🧐

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: RSS Memory Gate

Failed stage: Set up job [❌]

Failed test name: ""

Failure summary:

The workflow failed during action preparation because it attempted to use disallowed third-party
actions:
- actions/checkout@v6
- dtolnay/rust-toolchain@1.94.0
- swatinem/rust-cache@v2
Repository
policy for pilotspace/moon requires all actions to come from repositories owned by pilotspace, so
these actions were blocked before any jobs/tests could run.

Relevant error logs:
1:  ##[group]Runner Image Provisioner
2:  Hosted Compute Agent
...

13:  ##[group]Runner Image
14:  Image: ubuntu-24.04
15:  Version: 20260406.80.1
16:  Included Software: https://github.com/actions/runner-images/blob/ubuntu24/20260406.80/images/ubuntu/Ubuntu2404-Readme.md
17:  Image Release: https://github.com/actions/runner-images/releases/tag/ubuntu24%2F20260406.80
18:  ##[endgroup]
19:  ##[group]GITHUB_TOKEN Permissions
20:  Contents: read
21:  Metadata: read
22:  Packages: read
23:  ##[endgroup]
24:  Secret source: Actions
25:  Prepare workflow directory
26:  Prepare all required actions
27:  Getting action download info
28:  ##[error]The actions actions/checkout@v6, dtolnay/rust-toolchain@1.94.0, and swatinem/rust-cache@v2 are not allowed in pilotspace/moon because all actions must be from a repository owned by pilotspace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants