Add files_cleanup.max_committed_ledger_chunks config option#7788
Merged
Add files_cleanup.max_committed_ledger_chunks config option#7788
Conversation
Add periodic cleanup of old committed ledger chunk files from the main ledger directory, controlled by a new files_cleanup.max_committed_ledger_chunks configuration option. When read-only ledger directories are configured, a chunk is only deleted if an identical copy (verified by SHA-256 digest) exists in at least one read-only directory; otherwise the chunk is kept and a warning is logged. Implementation: - Rename SnapshotCleanupImpl to FilesCleanupImpl, handling both snapshot and ledger chunk cleanup in a single UV timer - Extract ledger filename helpers into src/host/ledger_filenames.h for reuse - Add config, serialization, schema, and test infrastructure plumbing - Add three e2e tests: basic cleanup, read-only dir verification, and digest mismatch handling Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace files::slurp + Sha256Hash(bytes) with streaming reads through ISha256Hash (make_incremental_sha256). Files are read in 64KB chunks, avoiding loading entire ledger chunks into memory. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new host configuration option to periodically prune old committed ledger chunk files from the main ledger directory, guarded by digest verification against configured read-only ledger directories. This extends the existing snapshot cleanup into a unified “files cleanup” timer.
Changes:
- Introduces
files_cleanup.max_committed_ledger_chunks(config + schema + docs + changelog) and wires it through host startup/config parsing. - Replaces
SnapshotCleanupImplwithFilesCleanupImpl, performing snapshot + ledger-chunk cleanup on a single periodic timer. - Adds new E2E coverage for committed-ledger-chunk cleanup, including read-only-dir verification and digest mismatch behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/infra/network.py | Forwards new cleanup config arg into node startup. |
| tests/e2e_operations.py | Adds E2E scenarios for committed chunk pruning and digest verification. |
| tests/config.jinja | Renders new files_cleanup.max_committed_ledger_chunks into generated host config. |
| src/host/run.cpp | Instantiates new unified FilesCleanupTimer when either retention option is set. |
| src/host/snapshot_cleanup_timer.h | Removed; functionality subsumed by files_cleanup_timer.h. |
| src/host/files_cleanup_timer.h | Implements snapshot + committed ledger chunk cleanup, including digest checks. |
| src/host/ledger_filenames.h | Extracts ledger filename parsing helpers for reuse. |
| src/host/ledger.h | Switches to extracted ledger filename helpers. |
| src/common/configuration.h | Adds JSON plumbing for new optional config field. |
| include/ccf/node/startup_config.h | Adds new startup config field for committed ledger chunk retention. |
| doc/operations/ledger_snapshot.rst | Documents new retention option and read-only-dir requirement. |
| doc/host_config_schema/cchost_config.json | Extends schema with max_committed_ledger_chunks and updates interval description. |
| CHANGELOG.md | Notes the new configuration option and behavior. |
- Add missing standard includes to files_cleanup_timer.h: <algorithm>, <optional>, <utility> - Add missing <string_view> include to ledger_filenames.h - Raise 'keeping ledger chunk' log from INFO to FAIL level to match the documented 'warning is logged' behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use size_t{64} to avoid bugprone-implicit-widening-of-multiplication-result
- Remove else-after-return for readability-else-after-return
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Change std::stol to std::stoull in ledger_filenames.h for size_t correctness
- Fix stale variable in wait_for_cleanup timeout error message
- Add try/except around test read-only dir cleanup for robustness
- Extract static cleanup helpers from FilesCleanupImpl into
asynchost::files_cleanup namespace for testability
- Add files_cleanup_test with 32 test cases covering:
- find_committed_ledger_chunks (empty dir, sorting, skip filters, nonexistent)
- hash_file (normal, different content, empty file, nonexistent)
- file_exists_with_matching_digest (match, mismatch, no copy, deleted,
multi-dir, empty dirs list)
- cleanup_old_ledger_chunks (empty dir, delete backed up, keep unbacked,
max_retained=0, within limit, digest mismatch)
- FilesCleanupImpl constructor validation
- ledger_filenames.h helper functions
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add std::atomic<bool> cleanup_in_progress to FilesCleanupImpl that is set via compare_exchange_strong before scheduling cleanup work and reset in the completion callback. If the previous task is still running when the timer fires, the new task is skipped with a FAIL log message. Also adds a DEBUG log at the start of each cleanup task for observability (the completion log already existed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the existing ledger chunk and snapshot cleanup notes with cross-references to a new 'Periodic File Cleanup' section. The new section explains that both cleanup types share a single timer cycle, and that only one cleanup task runs at a time (overlapping cycles are skipped with a warning). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ledger chunk cleanup tests intentionally trigger scenarios where chunks are kept (no matching backup, or digest mismatch). These produce expected [fail] log lines that the test infra treats as fatal errors. Register ignore_error_pattern_on_shutdown for 'Keeping ledger chunk' and 'digest does not match' so the tests pass cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
achamayou
commented
Mar 31, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Never delete a committed ledger chunk whose end sequence number is at or beyond the newest committed snapshot. This ensures a complete ledger history exists from that snapshot onwards for disaster recovery. The committed snapshot list is gathered once per cleanup cycle and shared between snapshot cleanup and the watermark calculation, avoiding redundant directory scans. Also replaces non-ASCII em-dashes with hyphens in comments and docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix use-after-free risk in cleanup_in_progress: change from raw atomic<bool> member to shared_ptr<atomic<bool>> so the flag remains valid even if FilesCleanupTimer is destroyed while a cleanup task is still running on the libuv thread pool. - Use non-throwing filesystem overloads: replace fs::exists() and fs::is_regular_file() with std::error_code overloads in file_exists_with_matching_digest to prevent exceptions from permission issues or broken mounts escaping into the cleanup worker. - Fix doc wording: change 'a warning is logged' to 'a failure-level log message is emitted' to match the actual LOG_FAIL_FMT severity. - Use unique temp directories in tests: replace fixed temp dir names with mkdtemp-based make_unique_test_dir() helper to prevent cross-test interference during parallel execution. - Propagate snapshot listing errors: change find_committed_snapshots to return std::optional to distinguish 'no snapshots' from 'listing failed'. When listing fails, skip all file cleanup to avoid deleting ledger chunks without a valid snapshot watermark. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In file_exists_with_matching_digest, handle each error_code result separately rather than collapsing all errors into 'file gone'. If fs::exists or fs::is_regular_file sets an error_code (e.g. permission denied, broken mount), log it as a failure and return false to keep the chunk safe. Only return true (safe to delete) when the file is genuinely absent or not a regular file with no error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
eddyashton
reviewed
Mar 31, 2026
eddyashton
reviewed
Mar 31, 2026
achamayou
commented
Mar 31, 2026
eddyashton
reviewed
Mar 31, 2026
eddyashton
reviewed
Mar 31, 2026
eddyashton
approved these changes
Mar 31, 2026
Add two new e2e tests: 1. test_post_snapshot_chunks_retained: Verifies that committed ledger chunks created after the latest snapshot are NOT deleted even when the total count exceeds max_committed_ledger_chunks. Sets max_retained=1 and generates 4+ post-snapshot chunks, then asserts all post-watermark chunks survive cleanup. 2. run_ledger_cleanup_no_read_only_dir_check: Verifies that a node fails to start with a clear error message when max_committed_ledger_chunks is configured but no read-only ledger directory is provided. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- run_ledger_cleanup_no_read_only_dir_check: use start-stop-restart pattern (start healthy, stop, patch config, restart) since the infra setup cannot start a network with invalid config. Check stderr (not stdout) for the std::logic_error message. Use ignore_errors_on_shutdown() and skip_verify_chunking since the node intentionally crashes. - Minor whitespace fix from auto-formatter in assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace start-stop-restart pattern with a direct network.start() that fails immediately. The node binary starts, reads the config with max_committed_ledger_chunks but no read-only dirs, throws std::logic_error, and exits. Much simpler and tests the actual first-boot path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…on handling Replace bool return from file_exists_with_matching_digest() with a DigestCheckResult enum (match_found, no_match, file_gone). This addresses two review comments: - The function no longer returns true when the local file is concurrently deleted, which was semantically confusing. - The caller now skips already-deleted files without attempting std::filesystem::remove(), avoiding spurious FAIL logs. - As a TOCTOU safety net, errc::no_such_file_or_directory from remove() is also treated as success (logged at INFO). - Rename function to check_digest_against_read_only_dirs() to better reflect the tri-state semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use std::uint8_t as the underlying type for the 3-value enum. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eneration The test captured the snapshot seqno before generating post-snapshot chunks, but new snapshots could be auto-triggered during that phase. The C++ cleanup uses the latest snapshot as its watermark, so the test must re-query after chunk generation to compare against the same value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The new ledger cleanup sub-tests add significant runtime to schema_test, causing it to exceed the default 360s CTest timeout on VMSS Virtual B/C CI runners. Set an explicit 900s (15 min) timeout for this test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add optional periodic cleanup of old committed ledger chunk files from the main ledger directory, controlled by a new
files_cleanup.max_committed_ledger_chunksconfiguration option.This is only enabled when read-only ledger directories are configured. A chunk is deleted if an identical copy (verified by SHA-256 digest) exists in at least one read-only directory; otherwise the chunk is kept and a warning is logged.
Chunks that contain transactions more recent than the most recent snapshot are also never deleted, even if they exceed the
files_cleanup.max_committed_ledger_chunksvalue, to always allow DR from local files.Key changes:
SnapshotCleanupImpltoFilesCleanupImpl, handling both snapshot and ledger chunk cleanup in a single libuv timersrc/host/ledger_filenames.hfor reusecleanup_pendingguard to log an error (FAIL) is a previous cleanup is still running and not schedule a new one.