fix: rate-limited ENOSPC handling for AOF writes by kacy · Pull Request #249 · kacy/ember

kacy · 2026-02-23T16:31:07Z

summary

the AOF writer already propagated I/O errors, but all callers logged them at warn! on every failure. under a sustained disk-full condition this floods logs at the rate of incoming writes — potentially millions per second. this PR adds severity-aware, rate-limited logging for AOF errors.

also verified audit item #4 (snapshot serialization panics) — all format::write_* calls in non-test snapshot code already use ? propagation. no unwraps to fix.

what changed

log_aof_error helper in shard/aof.rs: detects ENOSPC (28) and EDQUOT (69/122) via raw OS error codes, logs at error! for disk-full vs warn! for other I/O errors
rate-limiting: logs first failure immediately, then every 1000th consecutive failure. uses saturating_add to prevent overflow.
recovery logging: when writes succeed after consecutive failures, logs an info! message with the suppressed error count
aof_errors: u32 counter added to ProcessCtx — only touched on the error path, zero cost on success
applied to all 4 AOF write sites: process_single, write_aof_record (blocking ops), periodic fsync tick

what was tested

cargo fmt --all — clean
cargo clippy --workspace -- -D warnings — clean
cargo test -p ember-persistence — 74 tests pass
cargo test -p emberkv-core — 368 tests pass
cargo test -p ember-integration-tests --test integration -- --test-threads=1 — 79 tests pass

design considerations

zero hot-path cost: the counter increment and check only run inside if let Err(e) branches — the happy path has no new instructions
rate-limit at 1000: balances visibility (operators still see periodic errors) with protection (not millions/sec). first failure always logs immediately.
raw OS error codes: io::ErrorKind::StorageFull is nightly-only, so we check raw_os_error() directly. covers both Linux (28/122) and macOS (28/69).
recovery message: explicit signal that durability is restored — important for alerting systems that might have triggered on the error

the AOF writer already propagated I/O errors, but callers logged them at warn level on every failure. under a sustained disk-full condition this floods logs with millions of warnings per second — worse than the original problem. changes: - log_aof_error helper: detects ENOSPC/EDQUOT via raw OS error codes, logs at error! level for disk-full vs warn! for other I/O errors - rate-limiting: first failure logs immediately, then every 1000th consecutive failure. uses saturating_add to avoid overflow. - recovery logging: when writes succeed after consecutive failures, logs an info message with the count of suppressed errors - applied consistently across all AOF write sites: process_single, write_aof_record (blocking ops), and periodic fsync tick zero hot-path impact — the counter is only checked on the error path.

kacy merged commit 26a4aab into main Feb 23, 2026

kacy deleted the fix/persistence-resilience branch February 23, 2026 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rate-limited ENOSPC handling for AOF writes#249

fix: rate-limited ENOSPC handling for AOF writes#249
kacy merged 1 commit intomainfrom
fix/persistence-resilience

kacy commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kacy commented Feb 23, 2026

summary

what changed

what was tested

design considerations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant