Skip to content

Add files_cleanup.max_committed_ledger_chunks config option#7788

Merged
achamayou merged 36 commits intomainfrom
cleanup_ledger_files
Apr 1, 2026
Merged

Add files_cleanup.max_committed_ledger_chunks config option#7788
achamayou merged 36 commits intomainfrom
cleanup_ledger_files

Conversation

@achamayou
Copy link
Copy Markdown
Member

@achamayou achamayou commented Mar 27, 2026

Add optional periodic cleanup of old committed ledger chunk files from the main ledger directory, controlled by a new files_cleanup.max_committed_ledger_chunks configuration option.

This is only enabled when read-only ledger directories are configured. A chunk is deleted if an identical copy (verified by SHA-256 digest) exists in at least one read-only directory; otherwise the chunk is kept and a warning is logged.

Chunks that contain transactions more recent than the most recent snapshot are also never deleted, even if they exceed the files_cleanup.max_committed_ledger_chunks value, to always allow DR from local files.

Key changes:

  • Rename SnapshotCleanupImpl to FilesCleanupImpl, handling both snapshot and ledger chunk cleanup in a single libuv timer
  • Extract ledger filename helpers into src/host/ledger_filenames.h for reuse
  • Add config, serialization, schema, and test infrastructure plumbing
  • Add three e2e tests: basic cleanup, read-only dir verification, and digest mismatch handling
  • Add cleanup_pending guard to log an error (FAIL) is a previous cleanup is still running and not schedule a new one.

Add periodic cleanup of old committed ledger chunk files from the main
ledger directory, controlled by a new files_cleanup.max_committed_ledger_chunks
configuration option.

When read-only ledger directories are configured, a chunk is only deleted
if an identical copy (verified by SHA-256 digest) exists in at least one
read-only directory; otherwise the chunk is kept and a warning is logged.

Implementation:
- Rename SnapshotCleanupImpl to FilesCleanupImpl, handling both snapshot
  and ledger chunk cleanup in a single UV timer
- Extract ledger filename helpers into src/host/ledger_filenames.h for reuse
- Add config, serialization, schema, and test infrastructure plumbing
- Add three e2e tests: basic cleanup, read-only dir verification, and
  digest mismatch handling

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@achamayou achamayou requested a review from a team as a code owner March 27, 2026 17:36
Copilot AI review requested due to automatic review settings March 27, 2026 17:36
Replace files::slurp + Sha256Hash(bytes) with streaming reads through
ISha256Hash (make_incremental_sha256). Files are read in 64KB chunks,
avoiding loading entire ledger chunks into memory.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new host configuration option to periodically prune old committed ledger chunk files from the main ledger directory, guarded by digest verification against configured read-only ledger directories. This extends the existing snapshot cleanup into a unified “files cleanup” timer.

Changes:

  • Introduces files_cleanup.max_committed_ledger_chunks (config + schema + docs + changelog) and wires it through host startup/config parsing.
  • Replaces SnapshotCleanupImpl with FilesCleanupImpl, performing snapshot + ledger-chunk cleanup on a single periodic timer.
  • Adds new E2E coverage for committed-ledger-chunk cleanup, including read-only-dir verification and digest mismatch behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/infra/network.py Forwards new cleanup config arg into node startup.
tests/e2e_operations.py Adds E2E scenarios for committed chunk pruning and digest verification.
tests/config.jinja Renders new files_cleanup.max_committed_ledger_chunks into generated host config.
src/host/run.cpp Instantiates new unified FilesCleanupTimer when either retention option is set.
src/host/snapshot_cleanup_timer.h Removed; functionality subsumed by files_cleanup_timer.h.
src/host/files_cleanup_timer.h Implements snapshot + committed ledger chunk cleanup, including digest checks.
src/host/ledger_filenames.h Extracts ledger filename parsing helpers for reuse.
src/host/ledger.h Switches to extracted ledger filename helpers.
src/common/configuration.h Adds JSON plumbing for new optional config field.
include/ccf/node/startup_config.h Adds new startup config field for committed ledger chunk retention.
doc/operations/ledger_snapshot.rst Documents new retention option and read-only-dir requirement.
doc/host_config_schema/cchost_config.json Extends schema with max_committed_ledger_chunks and updates interval description.
CHANGELOG.md Notes the new configuration option and behavior.

Comment thread src/host/files_cleanup_timer.h
Comment thread src/host/files_cleanup_timer.h
Comment thread src/host/files_cleanup_timer.h
Comment thread src/host/files_cleanup_timer.h
Comment thread src/host/ledger_filenames.h
Comment thread doc/host_config_schema/cchost_config.json Outdated
Comment thread src/host/files_cleanup_timer.h
achamayou and others added 2 commits March 27, 2026 18:04
- Add missing standard includes to files_cleanup_timer.h: <algorithm>,
  <optional>, <utility>
- Add missing <string_view> include to ledger_filenames.h
- Raise 'keeping ledger chunk' log from INFO to FAIL level to match
  the documented 'warning is logged' behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@achamayou achamayou added this to the 7.0.0-rc1 milestone Mar 30, 2026
achamayou and others added 4 commits March 30, 2026 13:00
- Use size_t{64} to avoid bugprone-implicit-widening-of-multiplication-result
- Remove else-after-return for readability-else-after-return

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread src/host/files_cleanup_timer.h
Comment thread src/host/files_cleanup_timer.h Outdated
Comment thread src/host/files_cleanup_timer.h
achamayou and others added 9 commits March 30, 2026 15:43
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Change std::stol to std::stoull in ledger_filenames.h for size_t correctness
- Fix stale variable in wait_for_cleanup timeout error message
- Add try/except around test read-only dir cleanup for robustness
- Extract static cleanup helpers from FilesCleanupImpl into
  asynchost::files_cleanup namespace for testability
- Add files_cleanup_test with 32 test cases covering:
  - find_committed_ledger_chunks (empty dir, sorting, skip filters, nonexistent)
  - hash_file (normal, different content, empty file, nonexistent)
  - file_exists_with_matching_digest (match, mismatch, no copy, deleted,
    multi-dir, empty dirs list)
  - cleanup_old_ledger_chunks (empty dir, delete backed up, keep unbacked,
    max_retained=0, within limit, digest mismatch)
  - FilesCleanupImpl constructor validation
  - ledger_filenames.h helper functions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add std::atomic<bool> cleanup_in_progress to FilesCleanupImpl that is
set via compare_exchange_strong before scheduling cleanup work and
reset in the completion callback. If the previous task is still running
when the timer fires, the new task is skipped with a FAIL log message.

Also adds a DEBUG log at the start of each cleanup task for
observability (the completion log already existed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the existing ledger chunk and snapshot cleanup notes with
cross-references to a new 'Periodic File Cleanup' section. The new
section explains that both cleanup types share a single timer cycle,
and that only one cleanup task runs at a time (overlapping cycles are
skipped with a warning).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ledger chunk cleanup tests intentionally trigger scenarios where
chunks are kept (no matching backup, or digest mismatch). These produce
expected [fail] log lines that the test infra treats as fatal errors.

Register ignore_error_pattern_on_shutdown for 'Keeping ledger chunk'
and 'digest does not match' so the tests pass cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread doc/operations/ledger_snapshot.rst Outdated
achamayou and others added 3 commits March 31, 2026 11:45
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Never delete a committed ledger chunk whose end sequence number is at
or beyond the newest committed snapshot. This ensures a complete ledger
history exists from that snapshot onwards for disaster recovery.

The committed snapshot list is gathered once per cleanup cycle and
shared between snapshot cleanup and the watermark calculation, avoiding
redundant directory scans.

Also replaces non-ASCII em-dashes with hyphens in comments and docs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Comment thread src/host/files_cleanup_timer.h
Comment thread src/host/files_cleanup_timer.h
Comment thread doc/operations/ledger_snapshot.rst Outdated
Comment thread src/host/test/files_cleanup_test.cpp
Comment thread src/host/files_cleanup_timer.h
- Fix use-after-free risk in cleanup_in_progress: change from raw
  atomic<bool> member to shared_ptr<atomic<bool>> so the flag remains
  valid even if FilesCleanupTimer is destroyed while a cleanup task is
  still running on the libuv thread pool.

- Use non-throwing filesystem overloads: replace fs::exists() and
  fs::is_regular_file() with std::error_code overloads in
  file_exists_with_matching_digest to prevent exceptions from
  permission issues or broken mounts escaping into the cleanup worker.

- Fix doc wording: change 'a warning is logged' to 'a failure-level
  log message is emitted' to match the actual LOG_FAIL_FMT severity.

- Use unique temp directories in tests: replace fixed temp dir names
  with mkdtemp-based make_unique_test_dir() helper to prevent
  cross-test interference during parallel execution.

- Propagate snapshot listing errors: change find_committed_snapshots
  to return std::optional to distinguish 'no snapshots' from 'listing
  failed'. When listing fails, skip all file cleanup to avoid deleting
  ledger chunks without a valid snapshot watermark.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Comment thread src/host/ledger_filenames.h
Comment thread src/host/files_cleanup_timer.h Outdated
achamayou and others added 2 commits March 31, 2026 13:11
In file_exists_with_matching_digest, handle each error_code result
separately rather than collapsing all errors into 'file gone'. If
fs::exists or fs::is_regular_file sets an error_code (e.g. permission
denied, broken mount), log it as a failure and return false to keep
the chunk safe. Only return true (safe to delete) when the file is
genuinely absent or not a regular file with no error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread doc/operations/ledger_snapshot.rst Outdated
Comment thread src/host/files_cleanup_timer.h
Comment thread doc/operations/ledger_snapshot.rst Outdated
Comment thread src/host/files_cleanup_timer.h
Comment thread tests/e2e_operations.py
achamayou and others added 4 commits March 31, 2026 16:53
Add two new e2e tests:

1. test_post_snapshot_chunks_retained: Verifies that committed ledger
   chunks created after the latest snapshot are NOT deleted even when
   the total count exceeds max_committed_ledger_chunks. Sets
   max_retained=1 and generates 4+ post-snapshot chunks, then asserts
   all post-watermark chunks survive cleanup.

2. run_ledger_cleanup_no_read_only_dir_check: Verifies that a node
   fails to start with a clear error message when
   max_committed_ledger_chunks is configured but no read-only ledger
   directory is provided.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- run_ledger_cleanup_no_read_only_dir_check: use start-stop-restart pattern
  (start healthy, stop, patch config, restart) since the infra setup
  cannot start a network with invalid config. Check stderr (not stdout)
  for the std::logic_error message. Use ignore_errors_on_shutdown() and
  skip_verify_chunking since the node intentionally crashes.

- Minor whitespace fix from auto-formatter in assertion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace start-stop-restart pattern with a direct network.start()
that fails immediately. The node binary starts, reads the config
with max_committed_ledger_chunks but no read-only dirs, throws
std::logic_error, and exits. Much simpler and tests the actual
first-boot path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Comment thread src/host/files_cleanup_timer.h Outdated
Comment thread src/host/files_cleanup_timer.h
achamayou and others added 2 commits March 31, 2026 20:13
…on handling

Replace bool return from file_exists_with_matching_digest() with a
DigestCheckResult enum (match_found, no_match, file_gone). This
addresses two review comments:

- The function no longer returns true when the local file is
  concurrently deleted, which was semantically confusing.
- The caller now skips already-deleted files without attempting
  std::filesystem::remove(), avoiding spurious FAIL logs.
- As a TOCTOU safety net, errc::no_such_file_or_directory from
  remove() is also treated as success (logged at INFO).
- Rename function to check_digest_against_read_only_dirs() to
  better reflect the tri-state semantics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use std::uint8_t as the underlying type for the 3-value enum.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@achamayou achamayou removed the run-long-test Run Long Test job label Apr 1, 2026
achamayou and others added 4 commits April 1, 2026 07:58
…eneration

The test captured the snapshot seqno before generating post-snapshot chunks,
but new snapshots could be auto-triggered during that phase. The C++ cleanup
uses the latest snapshot as its watermark, so the test must re-query after
chunk generation to compare against the same value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The new ledger cleanup sub-tests add significant runtime to schema_test,
causing it to exceed the default 360s CTest timeout on VMSS Virtual B/C
CI runners. Set an explicit 900s (15 min) timeout for this test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@achamayou achamayou merged commit 5cfe7ff into main Apr 1, 2026
19 checks passed
@achamayou achamayou deleted the cleanup_ledger_files branch April 1, 2026 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants