fix: rate-limited ENOSPC handling for AOF writes#249
Merged
Conversation
the AOF writer already propagated I/O errors, but callers logged them at warn level on every failure. under a sustained disk-full condition this floods logs with millions of warnings per second — worse than the original problem. changes: - log_aof_error helper: detects ENOSPC/EDQUOT via raw OS error codes, logs at error! level for disk-full vs warn! for other I/O errors - rate-limiting: first failure logs immediately, then every 1000th consecutive failure. uses saturating_add to avoid overflow. - recovery logging: when writes succeed after consecutive failures, logs an info message with the count of suppressed errors - applied consistently across all AOF write sites: process_single, write_aof_record (blocking ops), and periodic fsync tick zero hot-path impact — the counter is only checked on the error path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
summary
the AOF writer already propagated I/O errors, but all callers logged them at
warn!on every failure. under a sustained disk-full condition this floods logs at the rate of incoming writes — potentially millions per second. this PR adds severity-aware, rate-limited logging for AOF errors.also verified audit item #4 (snapshot serialization panics) — all
format::write_*calls in non-test snapshot code already use?propagation. no unwraps to fix.what changed
log_aof_errorhelper inshard/aof.rs: detects ENOSPC (28) and EDQUOT (69/122) via raw OS error codes, logs aterror!for disk-full vswarn!for other I/O errorssaturating_addto prevent overflow.info!message with the suppressed error countaof_errors: u32counter added toProcessCtx— only touched on the error path, zero cost on successprocess_single,write_aof_record(blocking ops), periodic fsync tickwhat was tested
cargo fmt --all— cleancargo clippy --workspace -- -D warnings— cleancargo test -p ember-persistence— 74 tests passcargo test -p emberkv-core— 368 tests passcargo test -p ember-integration-tests --test integration -- --test-threads=1— 79 tests passdesign considerations
if let Err(e)branches — the happy path has no new instructionsio::ErrorKind::StorageFullis nightly-only, so we checkraw_os_error()directly. covers both Linux (28/122) and macOS (28/69).