fix: wave 0-4 gap closure — correctness, observability, CI, durability, compat#67
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds CI supply-chain checks and an aarch64 Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ConnHandler as Connection\r\nHandler
participant InfoCmd as info()\r\nCommand
participant Metrics as metrics_setup
participant ReplState as GLOBAL_REPL_STATE
Client->>ConnHandler: INFO
ConnHandler->>InfoCmd: handle INFO
InfoCmd->>Metrics: connected_clients()
Metrics-->>InfoCmd: u64
InfoCmd->>Metrics: get_rss_bytes()
Metrics->>Metrics: read /proc/self/status (Linux)
Metrics-->>InfoCmd: u64
InfoCmd->>Metrics: get_replication_info()
Metrics->>ReplState: read replication state (Arc<RwLock>)
Metrics->>Metrics: record_replication_lag()
Metrics-->>InfoCmd: (role, connected_slaves, offset, repl_id)
InfoCmd->>InfoCmd: format_memory_human()
InfoCmd-->>ConnHandler: formatted INFO payload
ConnHandler-->>Client: INFO response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review Summary by QodoWave 0-4 gap closure: correctness, observability, durability, CI, compatibility, and documentation
WalkthroughsDescription**Correctness & Protocol** • Fixed double-swap bug in ZREVRANGEBYSCORE and ZREVRANGEBYSCORE causing empty results for finite score ranges • Implemented strict RESP protocol integer parsing via strict_atoi to reject trailing bytes and validate RESP3 null type • Fixed stream consumer group logic and NaN-safe distance sorting in cold tier **Observability & Metrics (Phase 101)** • Added Prometheus metrics exporter with /metrics, /healthz, /readyz HTTP endpoints on configurable --admin-port • Enriched INFO command with real metrics: connected clients, RSS memory, CPU usage, replication state • Implemented Redis-compatible SLOWLOG with per-command latency tracking and configurable thresholds • Added 6 new #[tracing::instrument] spans across connection handlers, replication, AOF rewrite, and vector compaction • Wired replication lag Prometheus gauge with global state registration **Durability & Crash Recovery (Phase 103)** • Implemented crash injection test matrix covering 6 persistence modes × 4 write phases (6/7 cells) • Added Jepsen-lite linearizability test for crash-recovery validation • Added WAL v3 torn write detection via CRC32C validation tests • Added backup/restore workflow and upgrade compatibility tests **Release Pipeline & Supply Chain (Phase 105)** • Added cargo deny check and cargo audit enforcement in CI to block vulnerable/unlicensed dependencies • Implemented aarch64-unknown-linux-gnu cross-compilation via cross tool • Added SBOM generation, SHA256 checksums, and cosign artifact signing to release pipeline • Enforced CHANGELOG.md updates via dedicated CI gate **Compatibility (Phase 104)** • Added 40+ Redis compatibility integration tests covering strings, hashes, lists, sets, sorted sets, transactions, pub/sub, streams, Lua scripting, and ACL • Added multi-language client compatibility CI matrix (Python, Go, Node.js, Rust, C, Java) • Added vector search smoke test script **Infrastructure & Robustness** • Migrated RuntimeConfig from std::sync::RwLock to parking_lot::RwLock across 15+ modules for non-poisonable semantics • Replaced 40+ un-annotated .unwrap() calls with explicit error handling or safety annotations • Added explicit TLS cipher suite allowlist (AEAD-only, PFS-required) • Added performance regression gate workflow with memory and latency benchmarks **Documentation** • Added comprehensive threat model and security boundaries document • Added Lua sandbox security audit with CVE analysis • Added complete configuration reference guide (50+ options) • Added 7 operational runbooks: rolling restart, TLS rotation, corrupted AOF recovery, replica lag, OOM, disk full, and replication hardening • Added monitoring setup guide with Prometheus/Grafana examples and alerting rules • Added Redis protocol compatibility matrix and versioning policy Diagramflowchart LR
A["ZREVRANGEBYSCORE<br/>Bug Fix"] --> B["Strict RESP<br/>Parsing"]
B --> C["Protocol<br/>Correctness"]
D["Prometheus<br/>Metrics"] --> E["SLOWLOG<br/>Tracking"]
E --> F["Tracing<br/>Spans"]
F --> G["Observability<br/>Phase 101"]
H["Crash Matrix<br/>Tests"] --> I["Jepsen-lite<br/>Linearizability"]
I --> J["WAL/CRC<br/>Validation"]
J --> K["Durability<br/>Phase 103"]
L["cargo deny<br/>+ audit"] --> M["aarch64<br/>Build"]
M --> N["SBOM +<br/>Signing"]
N --> O["Release<br/>Phase 105"]
P["Redis Compat<br/>Tests"] --> Q["Multi-lang<br/>CI Matrix"]
Q --> R["Compatibility<br/>Phase 104"]
C --> S["Production<br/>Ready"]
G --> S
K --> S
O --> S
R --> S
T["parking_lot<br/>Migration"] --> U["Error<br/>Handling"]
U --> V["Robustness"]
V --> S
W["Threat Model<br/>+ Runbooks"] --> X["Documentation"]
X --> S
File Changes1. tests/redis_compat.rs
|
Code Review by Qodo
|
| /// Format bytes as human-readable (e.g. "1.23M", "456.78K"). | ||
| fn format_memory_human(bytes: u64) -> String { | ||
| const KB: f64 = 1024.0; | ||
| const MB: f64 = 1024.0 * 1024.0; | ||
| const GB: f64 = 1024.0 * 1024.0 * 1024.0; | ||
| let b = bytes as f64; | ||
| if b >= GB { | ||
| format!("{:.2}G", b / GB) | ||
| } else if b >= MB { | ||
| format!("{:.2}M", b / MB) | ||
| } else if b >= KB { | ||
| format!("{:.2}K", b / KB) | ||
| } else { | ||
| format!("{bytes}B") | ||
| } |
There was a problem hiding this comment.
1. format! in format_memory_human 📘 Rule violation ➹ Performance
src/command/connection.rs introduces format!-based string building in the command path, which performs heap allocations and formatting work in a hot module. This violates the no-allocation/formatting requirement for code under src/command/.
Agent Prompt
## Issue description
`src/command/connection.rs` adds `format!` calls for INFO/`format_memory_human()`, which violates the hot-path allocation/formatting restrictions for `src/command/`.
## Issue Context
Compliance requires avoiding `format!()`/`to_string()`/expensive allocations in command dispatch paths.
## Fix Focus Areas
- src/command/connection.rs[133-147]
- src/command/connection.rs[171-180]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| // RESP3 Null: `_\r\n` — verify CRLF immediately follows type byte | ||
| if *pos + 1 >= buf.len() { | ||
| return Err(ParseError::Incomplete); | ||
| } | ||
| if buf[*pos] != b'\r' || buf[*pos + 1] != b'\n' { | ||
| return Err(ParseError::Invalid { | ||
| message: format!( | ||
| "RESP3 null has trailing data before CRLF at offset {}", | ||
| *pos | ||
| ), | ||
| offset: *pos, | ||
| }); |
There was a problem hiding this comment.
2. format! added in parser 📘 Rule violation ➹ Performance
src/protocol/parse.rs adds new format! allocations when rejecting malformed RESP3 null frames. This adds allocation/formatting work inside the protocol parser hot path and can be amplified by malicious clients sending invalid frames.
Agent Prompt
## Issue description
New RESP3 null validation paths in `src/protocol/parse.rs` use `format!()` to build error strings, introducing allocations in the protocol parser.
## Issue Context
The parser is a hot path and should avoid heap allocations/formatting even on invalid inputs to reduce DoS amplification risk.
## Fix Focus Areas
- src/protocol/parse.rs[234-245]
- src/protocol/parse.rs[629-635]
- src/protocol/parse.rs[874-880]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| static GLOBAL_REPL_STATE: once_cell::sync::OnceCell< | ||
| std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>, | ||
| > = once_cell::sync::OnceCell::new(); | ||
|
|
||
| /// Register the global replication state for INFO queries. | ||
| pub fn set_global_repl_state( | ||
| state: std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>, | ||
| ) { | ||
| let _ = GLOBAL_REPL_STATE.set(state); | ||
| } | ||
|
|
||
| /// Get replication info for INFO command: (role, connected_slaves, master_repl_offset, repl_id). | ||
| /// Also updates the Prometheus replication lag gauge as a side-effect. | ||
| pub fn get_replication_info() -> (&'static str, usize, u64, String) { |
There was a problem hiding this comment.
3. std::sync::rwlock in metrics 📘 Rule violation ☼ Reliability
src/admin/metrics_setup.rs introduces std::sync::RwLock for global replication state, contrary to the project locking rule requiring parking_lot primitives. This increases risk of poisoning behavior and violates the lock primitive standardization requirement.
Agent Prompt
## Issue description
A new `std::sync::RwLock` is introduced for `GLOBAL_REPL_STATE` in `src/admin/metrics_setup.rs`, violating the requirement to use `parking_lot` locks.
## Issue Context
Project locking rules standardize on `parking_lot` to avoid poisoning semantics and improve performance/consistency.
## Fix Focus Areas
- src/admin/metrics_setup.rs[535-548]
- src/admin/metrics_setup.rs[549-582]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| "used_memory:{rss}\r\n\ | ||
| used_memory_human:{human}\r\n\ | ||
| used_memory_rss:{rss}\r\n\ | ||
| used_memory_peak:{rss}\r\n", |
There was a problem hiding this comment.
5. Info keys prefixed by spaces 🐞 Bug ≡ Correctness
INFO uses multi-line string literals with \ line continuation plus indentation, which embeds leading spaces into field names (e.g. " used_memory_human"), breaking Redis INFO compatibility and parsers.
Agent Prompt
## Issue description
INFO output includes leading spaces in keys due to indented `\`-continued string literals, which breaks Redis INFO formatting expectations.
## Issue Context
Rust string literal line continuation `\` removes the newline but preserves any indentation whitespace on the next source line.
## Fix Focus Areas
- src/command/connection.rs[170-190]
- src/command/connection.rs[240-272]
## Suggested fix
Rewrite the format strings so continued lines start immediately (no indentation), e.g.:
- "used_memory:{rss}\r\nused_memory_human:{human}\r\nused_memory_rss:{rss}\r\n..."
Apply the same change to Persistence/Vector/Stats/CPU/Replication sections.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
f971c69 to
d33e9f6
Compare
… finite score ranges The rev branch in zrange_by_score/zrange_by_lex swapped min_arg and max_arg during parsing, but all callers already pass them in semantic (min, max) order. This double-swap produced min_bound > max_bound, making the filter (s >= min AND s <= max) reject everything for finite ranges like "3 1". Fix: remove the rev-specific swap — rev only affects iteration direction (entries.reverse), not bound parsing. Added ZREVRANGEBYSCORE finite range test to test-commands.sh.
- INFO Clients: connected_clients from atomic counter - INFO Memory: used_memory/used_memory_human/used_memory_rss from /proc/self/status - INFO Replication: role/connected_slaves/master_replid/master_repl_offset from global ReplicationState (registered at startup via set_global_repl_state) - Added #[instrument] spans to: handle_connection (single/monoio), handle_psync_on_master (tokio/monoio), compact(), rewrite_aof() - Added get_rss_bytes() for Linux /proc parsing, connected_clients() atomic counter, get_replication_info() global accessor
get_replication_info() now computes max lag across all replicas and calls record_replication_lag() to update the moon_replication_lag_bytes Prometheus gauge. Previously the function existed but had zero call sites.
- CI: add supply-chain job running cargo deny check + cargo audit (deny.toml already existed but was not enforced in pipeline) - Release: add linux-aarch64-tokio matrix entry using cross for cross-compilation (aarch64 is the primary production target) - Release: update checksums and release artifact list
Two new crash matrix test cells: - crash_during_bgsave: SIGKILL mid-RDB snapshot, verify AOF recovery - crash_during_bgrewriteaof: SIGKILL mid-AOF compaction, verify original AOF intact for recovery Both use appendfsync=always so all 500 keys must survive (RPO=0). This brings crash matrix coverage to 6/7 cells.
New redis_compat.rs test coverage: - Streams: XADD, XLEN, XRANGE, XTRIM MAXLEN - Lua: EVAL return string, EVAL with KEYS/ARGV, EVALSHA after SCRIPT LOAD, SCRIPT EXISTS/FLUSH - ACL: WHOAMI, LIST (verify default user exists) All tests are #[ignore] — require a running Moon instance.
Covers: ZREVRANGEBYSCORE fix, INFO enrichment, tracing spans, repl lag metric, CI supply chain, aarch64 release build, crash matrix, and expanded compat tests.
READYZ command always returned "ERR server not ready" because set_server_ready() was never called. The HTTP /readyz endpoint worked via a separate readiness_flag AtomicBool, but the Redis READYZ command used is_server_ready() which checks SERVER_READY. Added set_server_ready() call after shard recovery completes, immediately before the existing readiness_flag.store().
a54cad4 to
3d5372a
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
.github/workflows/ci.yml (1)
89-90: Pin the security tools in CI.Installing whatever
cargo-denyandcargo-audithappen to be latest makes this job non-reproducible. A new upstream release can start failing unrelated PRs. Please install pinned versions with--lockedor use a pinned installer action.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 89 - 90, The CI step "Install cargo-deny and cargo-audit" currently installs latest releases; change it to install pinned, reproducible versions by either adding the cargo install flag --locked with a Cargo.lock that pins versions or by switching to a pinned installer action (or specifying exact versions with --version) for cargo-deny and cargo-audit in the step named "Install cargo-deny and cargo-audit" so CI uses deterministic tool versions..github/workflows/release.yml (1)
52-54: Pincrossversion for reproducibility.The
--lockedflag ensures consistent dependencies within cross, but cross itself isn't version-pinned. A breaking cross release could fail future builds. Pinning to the latest stable version (v0.2.5) prevents unexpected build failures.♻️ Suggested fix
- name: Install cross (aarch64) if: matrix.cross - run: cargo install cross --locked + run: cargo install cross@0.2.5 --locked🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/release.yml around lines 52 - 54, Update the GitHub Actions step that installs cross (the step named "Install cross (aarch64)" which currently runs "cargo install cross --locked") to pin the cross crate to v0.2.5 by adding the --version (or --vers) flag; e.g. change the run command to use "cargo install cross --locked --version 0.2.5" so the workflow installs a reproducible, pinned cross release.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/admin/metrics_setup.rs`:
- Around line 569-585: The Prometheus gauge is only updated when guard.replicas
is non-empty, so when the last replica disconnects the previous non-zero value
remains exported; modify the logic around guard.replicas in the function
containing this block (the code that computes max_lag_bytes and calls
record_replication_lag) to explicitly set the metric to zero when
replicas.is_empty() (e.g., call record_replication_lag(0, 0) or equivalent in
the empty branch) so moon_replication_lag_bytes is reset to 0 when no replicas
are connected.
- Around line 354-359: record_connection_closed currently unconditionally
decrements CONNECTED_CLIENTS and the "moon_connected_clients" gauge, which can
underflow if called when count is zero; change record_connection_closed to
atomically decrement CONNECTED_CLIENTS only when its current value is > 0 (use
fetch_update or a compare_exchange loop on CONNECTED_CLIENTS) and only call
gauge!("moon_connected_clients").decrement(...) if the atomic decrement actually
occurred and METRICS_INITIALIZED.load(...) is true, ensuring the atomic and
Prometheus gauge stay in sync and never go negative; reference
record_connection_closed, CONNECTED_CLIENTS, METRICS_INITIALIZED, and the
"moon_connected_clients" gauge in the fix.
- Around line 546-553: The GLOBAL_REPL_STATE static and the
set_global_repl_state function parameter use std::sync::RwLock; change both to
use parking_lot::RwLock instead (keep Arc as std::sync::Arc). Update the type in
the OnceCell declaration for GLOBAL_REPL_STATE and the parameter type of
set_global_repl_state to
std::sync::Arc<parking_lot::RwLock<crate::replication::state::ReplicationState>>
(or fully-qualified parking_lot::RwLock) and adjust any imports if necessary.
In `@src/command/connection.rs`:
- Around line 171-177: The current code prints used_memory_peak using the
current rss value, which makes the peak non-monotonic; update metrics tracking
in crate::admin::metrics_setup to maintain a high-water mark (e.g., add or
expose a get_rss_peak_bytes or update_rss_peak function that stores the max
observed RSS) and then replace the peak usage here to call that new accessor
instead of reusing rss (reference get_rss_bytes and the used_memory_peak output
string in this block to locate where to swap rss for the real peak value);
ensure the peak is only increased when a new rss > stored_peak so
used_memory_peak remains monotonic.
In `@tests/durability/crash_matrix.rs`:
- Around line 269-275: The test currently sleeps a fixed 50ms after
send_resp_command("BGSAVE")/BGREWRITEAOF then SIGKILL, which is racy; instead
poll for an in-progress persistence indicator before killing: after calling
send_resp_command(addr, "BGSAVE") or "BGREWRITEAOF" query the server with "INFO
persistence" and wait until the output shows a child or rewrite in progress, or
watch for the AOF rewrite temp-file appearing in the background writer path,
with a bounded timeout (e.g. loop with short sleep up to N ms); only call
libc::kill(...) once the observable in-progress flag is true (ref:
send_resp_command usage, BGSAVE handling in src/command/persistence.rs
spawn_blocking and BGREWRITEAOF handled by the background writer in
src/persistence/aof.rs).
---
Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 89-90: The CI step "Install cargo-deny and cargo-audit" currently
installs latest releases; change it to install pinned, reproducible versions by
either adding the cargo install flag --locked with a Cargo.lock that pins
versions or by switching to a pinned installer action (or specifying exact
versions with --version) for cargo-deny and cargo-audit in the step named
"Install cargo-deny and cargo-audit" so CI uses deterministic tool versions.
In @.github/workflows/release.yml:
- Around line 52-54: Update the GitHub Actions step that installs cross (the
step named "Install cross (aarch64)" which currently runs "cargo install cross
--locked") to pin the cross crate to v0.2.5 by adding the --version (or --vers)
flag; e.g. change the run command to use "cargo install cross --locked --version
0.2.5" so the workflow installs a reproducible, pinned cross release.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ff0d84a9-1624-44d3-8be4-790dc9745429
📒 Files selected for processing (16)
.github/workflows/ci.yml.github/workflows/release.ymlCHANGELOG.mdscripts/test-commands.shsrc/admin/metrics_setup.rssrc/command/connection.rssrc/command/sorted_set/mod.rssrc/main.rssrc/persistence/aof.rssrc/replication/master.rssrc/server/conn/handler_monoio.rssrc/server/conn/handler_single.rssrc/server/listener.rssrc/vector/segment/compaction.rstests/durability/crash_matrix.rstests/redis_compat.rs
| pub fn record_connection_closed() { | ||
| CONNECTED_CLIENTS.fetch_sub(1, Ordering::Relaxed); | ||
| if !METRICS_INITIALIZED.load(Ordering::Relaxed) { | ||
| return; | ||
| } | ||
| gauge!("moon_connected_clients").decrement(1.0); |
There was a problem hiding this comment.
Guard the connected-client count against unmatched closes.
Line 355 unconditionally decrements both the atomic state and the Prometheus gauge. If a close path fires twice or after a partially-opened connection, CONNECTED_CLIENTS wraps to u64::MAX and the gauge can go negative.
Suggested fix
pub fn record_connection_closed() {
- CONNECTED_CLIENTS.fetch_sub(1, Ordering::Relaxed);
+ let prev = CONNECTED_CLIENTS
+ .fetch_update(Ordering::Relaxed, Ordering::Relaxed, |n| {
+ Some(n.saturating_sub(1))
+ })
+ .unwrap_or(0);
+ if prev == 0 {
+ return;
+ }
if !METRICS_INITIALIZED.load(Ordering::Relaxed) {
return;
}
gauge!("moon_connected_clients").decrement(1.0);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub fn record_connection_closed() { | |
| CONNECTED_CLIENTS.fetch_sub(1, Ordering::Relaxed); | |
| if !METRICS_INITIALIZED.load(Ordering::Relaxed) { | |
| return; | |
| } | |
| gauge!("moon_connected_clients").decrement(1.0); | |
| pub fn record_connection_closed() { | |
| let prev = CONNECTED_CLIENTS | |
| .fetch_update(Ordering::Relaxed, Ordering::Relaxed, |n| { | |
| Some(n.saturating_sub(1)) | |
| }) | |
| .unwrap_or(0); | |
| if prev == 0 { | |
| return; | |
| } | |
| if !METRICS_INITIALIZED.load(Ordering::Relaxed) { | |
| return; | |
| } | |
| gauge!("moon_connected_clients").decrement(1.0); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/admin/metrics_setup.rs` around lines 354 - 359, record_connection_closed
currently unconditionally decrements CONNECTED_CLIENTS and the
"moon_connected_clients" gauge, which can underflow if called when count is
zero; change record_connection_closed to atomically decrement CONNECTED_CLIENTS
only when its current value is > 0 (use fetch_update or a compare_exchange loop
on CONNECTED_CLIENTS) and only call
gauge!("moon_connected_clients").decrement(...) if the atomic decrement actually
occurred and METRICS_INITIALIZED.load(...) is true, ensuring the atomic and
Prometheus gauge stay in sync and never go negative; reference
record_connection_closed, CONNECTED_CLIENTS, METRICS_INITIALIZED, and the
"moon_connected_clients" gauge in the fix.
| static GLOBAL_REPL_STATE: once_cell::sync::OnceCell< | ||
| std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>, | ||
| > = once_cell::sync::OnceCell::new(); | ||
|
|
||
| /// Register the global replication state for INFO queries. | ||
| pub fn set_global_repl_state( | ||
| state: std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>, | ||
| ) { |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
# Check if the file exists and examine the code at the specified lines
if [ -f "src/admin/metrics_setup.rs" ]; then
echo "=== File exists. Checking lines 546-553 ==="
sed -n '546,553p' src/admin/metrics_setup.rs
echo ""
echo "=== Checking for std::sync::RwLock usage in this file ==="
grep -n "std::sync::RwLock\|std::sync::Mutex\|std::sync::Arc" src/admin/metrics_setup.rs | head -20
else
echo "File not found at src/admin/metrics_setup.rs"
fiRepository: pilotspace/moon
Length of output: 865
🏁 Script executed:
# Get context around lines 546-553 to understand the usage better
if [ -f "src/admin/metrics_setup.rs" ]; then
echo "=== Context around lines 546-553 ==="
sed -n '540,560p' src/admin/metrics_setup.rs
fiRepository: pilotspace/moon
Length of output: 925
🏁 Script executed:
# Check if parking_lot is available in the project dependencies
echo "=== Checking Cargo.toml for parking_lot dependency ==="
if grep -r "parking_lot" Cargo.toml 2>/dev/null; then
echo "parking_lot found in main Cargo.toml"
else
echo "parking_lot may be in workspace member Cargo.toml or feature-gated"
find . -name "Cargo.toml" -type f -exec grep -l "parking_lot" {} \; 2>/dev/null | head -10
fiRepository: pilotspace/moon
Length of output: 171
Replace std::sync::RwLock with parking_lot::RwLock in the global replication state.
Both the static type (line 547) and the function parameter (line 552) use std::sync::RwLock, which violates the coding guideline: never use std::sync locks. Update both to use parking_lot::RwLock instead. parking_lot is already a project dependency.
Code snippet
static GLOBAL_REPL_STATE: once_cell::sync::OnceCell<
std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
> = once_cell::sync::OnceCell::new();
pub fn set_global_repl_state(
state: std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/admin/metrics_setup.rs` around lines 546 - 553, The GLOBAL_REPL_STATE
static and the set_global_repl_state function parameter use std::sync::RwLock;
change both to use parking_lot::RwLock instead (keep Arc as std::sync::Arc).
Update the type in the OnceCell declaration for GLOBAL_REPL_STATE and the
parameter type of set_global_repl_state to
std::sync::Arc<parking_lot::RwLock<crate::replication::state::ReplicationState>>
(or fully-qualified parking_lot::RwLock) and adjust any imports if necessary.
| // Update Prometheus lag gauge: max lag across all replicas. | ||
| if !guard.replicas.is_empty() { | ||
| let max_lag_bytes = guard | ||
| .replicas | ||
| .iter() | ||
| .map(|r| { | ||
| let ack: u64 = r | ||
| .ack_offsets | ||
| .iter() | ||
| .map(|a| a.load(Ordering::Relaxed)) | ||
| .sum(); | ||
| offset.saturating_sub(ack) | ||
| }) | ||
| .max() | ||
| .unwrap_or(0); | ||
| record_replication_lag(max_lag_bytes, 0); | ||
| } |
There was a problem hiding this comment.
Reset replication lag to zero when no replicas are connected.
The gauge is only updated inside the non-empty branch, so after the last replica disconnects moon_replication_lag_bytes keeps exporting the previous non-zero lag forever. That makes the new metric operationally misleading.
Suggested fix
- if !guard.replicas.is_empty() {
- let max_lag_bytes = guard
- .replicas
- .iter()
- .map(|r| {
- let ack: u64 = r
- .ack_offsets
- .iter()
- .map(|a| a.load(Ordering::Relaxed))
- .sum();
- offset.saturating_sub(ack)
- })
- .max()
- .unwrap_or(0);
- record_replication_lag(max_lag_bytes, 0);
- }
+ let max_lag_bytes = guard
+ .replicas
+ .iter()
+ .map(|r| {
+ let ack: u64 = r
+ .ack_offsets
+ .iter()
+ .map(|a| a.load(Ordering::Relaxed))
+ .sum();
+ offset.saturating_sub(ack)
+ })
+ .max()
+ .unwrap_or(0);
+ record_replication_lag(max_lag_bytes, 0);
return (role, slaves, offset, repl_id);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Update Prometheus lag gauge: max lag across all replicas. | |
| if !guard.replicas.is_empty() { | |
| let max_lag_bytes = guard | |
| .replicas | |
| .iter() | |
| .map(|r| { | |
| let ack: u64 = r | |
| .ack_offsets | |
| .iter() | |
| .map(|a| a.load(Ordering::Relaxed)) | |
| .sum(); | |
| offset.saturating_sub(ack) | |
| }) | |
| .max() | |
| .unwrap_or(0); | |
| record_replication_lag(max_lag_bytes, 0); | |
| } | |
| let max_lag_bytes = guard | |
| .replicas | |
| .iter() | |
| .map(|r| { | |
| let ack: u64 = r | |
| .ack_offsets | |
| .iter() | |
| .map(|a| a.load(Ordering::Relaxed)) | |
| .sum(); | |
| offset.saturating_sub(ack) | |
| }) | |
| .max() | |
| .unwrap_or(0); | |
| record_replication_lag(max_lag_bytes, 0); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/admin/metrics_setup.rs` around lines 569 - 585, The Prometheus gauge is
only updated when guard.replicas is non-empty, so when the last replica
disconnects the previous non-zero value remains exported; modify the logic
around guard.replicas in the function containing this block (the code that
computes max_lag_bytes and calls record_replication_lag) to explicitly set the
metric to zero when replicas.is_empty() (e.g., call record_replication_lag(0, 0)
or equivalent in the empty branch) so moon_replication_lag_bytes is reset to 0
when no replicas are connected.
| let rss = crate::admin::metrics_setup::get_rss_bytes(); | ||
| let _ = write!( | ||
| sections, | ||
| "used_memory:{rss}\r\n\ | ||
| used_memory_human:{human}\r\n\ | ||
| used_memory_rss:{rss}\r\n\ | ||
| used_memory_peak:{rss}\r\n", |
There was a problem hiding this comment.
used_memory_peak should not mirror current RSS.
This now reports the current RSS as the peak, so the “peak” value will shrink again after memory is released. INFO consumers usually assume used_memory_peak is monotonic. Please track a real high-water mark in metrics_setup and use that here instead of reusing rss.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/command/connection.rs` around lines 171 - 177, The current code prints
used_memory_peak using the current rss value, which makes the peak
non-monotonic; update metrics tracking in crate::admin::metrics_setup to
maintain a high-water mark (e.g., add or expose a get_rss_peak_bytes or
update_rss_peak function that stores the max observed RSS) and then replace the
peak usage here to call that new accessor instead of reusing rss (reference
get_rss_bytes and the used_memory_peak output string in this block to locate
where to swap rss for the real peak value); ensure the peak is only increased
when a new rss > stored_peak so used_memory_peak remains monotonic.
| // Trigger BGSAVE then immediately kill | ||
| send_resp_command(addr, "BGSAVE"); | ||
| thread::sleep(Duration::from_millis(50)); | ||
|
|
||
| // SAFETY: valid PID, SIGKILL is always valid | ||
| let ret = unsafe { libc::kill(server.id() as i32, libc::SIGKILL) }; | ||
| assert_eq!(ret, 0); |
There was a problem hiding this comment.
Wait for the background persistence job to actually start before SIGKILL.
Both code paths are asynchronous after the command returns: BGSAVE schedules spawn_blocking in src/command/persistence.rs:36-120, and BGREWRITEAOF is handled in the background writer loop in src/persistence/aof.rs:292-307. The fixed 50ms sleep does not guarantee either job is mid-flight, so these tests can pass while only covering a plain crash-after-writes scenario.
Please poll an observable in-progress signal before killing the process, e.g. INFO persistence / a rewrite temp-file side effect with a bounded timeout.
Also applies to: 330-336
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/durability/crash_matrix.rs` around lines 269 - 275, The test currently
sleeps a fixed 50ms after send_resp_command("BGSAVE")/BGREWRITEAOF then SIGKILL,
which is racy; instead poll for an in-progress persistence indicator before
killing: after calling send_resp_command(addr, "BGSAVE") or "BGREWRITEAOF" query
the server with "INFO persistence" and wait until the output shows a child or
rewrite in progress, or watch for the AOF rewrite temp-file appearing in the
background writer path, with a bounded timeout (e.g. loop with short sleep up to
N ms); only call libc::kill(...) once the observable in-progress flag is true
(ref: send_resp_command usage, BGSAVE handling in src/command/persistence.rs
spawn_blocking and BGREWRITEAOF handled by the background writer in
src/persistence/aof.rs).
CI Feedback 🧐A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
6d7ec22 to
81cad44
Compare
Summary
Codebase-qualified gap closure across 5 areas identified by tracing all remaining TODOs against actual code state.
zrange_by_scoreandzrange_by_lexcaused empty results for finite score ranges (e.g.,ZREVRANGEBYSCORE key 3 1). Same bug pattern fixed in lex variant.#[instrument]tracing spans. Replication lag Prometheus gauge wired (had zero call sites).cargo deny check+cargo auditenforced in CI. aarch64-unknown-linux-gnu release binary added viacross.redis_compat.rs.Files changed (16 files, +495 -51)
src/command/sorted_set/mod.rs,scripts/test-commands.shsrc/admin/metrics_setup.rs,src/command/connection.rs,src/main.rs,src/server/listener.rssrc/server/conn/handler_single.rs,src/server/conn/handler_monoio.rs,src/replication/master.rs,src/vector/segment/compaction.rs,src/persistence/aof.rs.github/workflows/ci.yml,.github/workflows/release.ymltests/durability/crash_matrix.rs,tests/redis_compat.rsTest plan
cargo fmt --checkcleancargo clippy --no-default-features --features runtime-tokio,jemalloc -- -D warningscleancargo test --no-default-features --features runtime-tokio,jemalloc --release --lib— 1895 passedtest_zrevrangebyscorecargo test --test redis_compat -- --ignored(requires running Moon)cargo test --test durability_crash_matrix -- --ignored(requires built binary)cargo test --test replication_hardening -- --ignored(requires built binary)Deferred
Summary by CodeRabbit
Bug Fixes
New Features
Tests
Chores