Skip to content

fix: rate-limited ENOSPC handling for AOF writes#249

Merged
kacy merged 1 commit intomainfrom
fix/persistence-resilience
Feb 23, 2026
Merged

fix: rate-limited ENOSPC handling for AOF writes#249
kacy merged 1 commit intomainfrom
fix/persistence-resilience

Conversation

@kacy
Copy link
Copy Markdown
Owner

@kacy kacy commented Feb 23, 2026

summary

the AOF writer already propagated I/O errors, but all callers logged them at warn! on every failure. under a sustained disk-full condition this floods logs at the rate of incoming writes — potentially millions per second. this PR adds severity-aware, rate-limited logging for AOF errors.

also verified audit item #4 (snapshot serialization panics) — all format::write_* calls in non-test snapshot code already use ? propagation. no unwraps to fix.

what changed

  • log_aof_error helper in shard/aof.rs: detects ENOSPC (28) and EDQUOT (69/122) via raw OS error codes, logs at error! for disk-full vs warn! for other I/O errors
  • rate-limiting: logs first failure immediately, then every 1000th consecutive failure. uses saturating_add to prevent overflow.
  • recovery logging: when writes succeed after consecutive failures, logs an info! message with the suppressed error count
  • aof_errors: u32 counter added to ProcessCtx — only touched on the error path, zero cost on success
  • applied to all 4 AOF write sites: process_single, write_aof_record (blocking ops), periodic fsync tick

what was tested

  • cargo fmt --all — clean
  • cargo clippy --workspace -- -D warnings — clean
  • cargo test -p ember-persistence — 74 tests pass
  • cargo test -p emberkv-core — 368 tests pass
  • cargo test -p ember-integration-tests --test integration -- --test-threads=1 — 79 tests pass

design considerations

  • zero hot-path cost: the counter increment and check only run inside if let Err(e) branches — the happy path has no new instructions
  • rate-limit at 1000: balances visibility (operators still see periodic errors) with protection (not millions/sec). first failure always logs immediately.
  • raw OS error codes: io::ErrorKind::StorageFull is nightly-only, so we check raw_os_error() directly. covers both Linux (28/122) and macOS (28/69).
  • recovery message: explicit signal that durability is restored — important for alerting systems that might have triggered on the error

the AOF writer already propagated I/O errors, but callers logged them
at warn level on every failure. under a sustained disk-full condition
this floods logs with millions of warnings per second — worse than the
original problem.

changes:
- log_aof_error helper: detects ENOSPC/EDQUOT via raw OS error codes,
  logs at error! level for disk-full vs warn! for other I/O errors
- rate-limiting: first failure logs immediately, then every 1000th
  consecutive failure. uses saturating_add to avoid overflow.
- recovery logging: when writes succeed after consecutive failures,
  logs an info message with the count of suppressed errors
- applied consistently across all AOF write sites: process_single,
  write_aof_record (blocking ops), and periodic fsync tick

zero hot-path impact — the counter is only checked on the error path.
@kacy kacy merged commit 26a4aab into main Feb 23, 2026
@kacy kacy deleted the fix/persistence-resilience branch February 23, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant