Skip to content

Conversation

@taco-paco
Copy link
Contributor

@taco-paco taco-paco commented Dec 1, 2025

Prior Ledger::delete_slot_range was a very write/read intensive operation where we were traversing the whole db to delete all desired items in range.
This likely was a reason for response time degradation and stalling of concurrent writes to DB from TX processing.

This PR proposes an alternative approach:

  1. We identify slot up to which we desire to truncate the ledger
  2. We use Rocksdb:::delete_range only for those columns where we have a defined range right away as this is a single db write
  3. step 2 also allows us to keep track of oldest slot across restarts. We use Blockhash as a beacon column. It is used to fetch lowest slot, which is the on restart. This way when a new compaction starts it will start cleaning up previously uncleaned columns due to crash or shutdown with compaction in prograss
  4. We start compaction manual compaction for columns, and rely on CompactionFilter to do maintanance for us, eg delete slots earlier than oldest_slot

Summary by CodeRabbit

  • New Features

    • Added performance metrics for ledger truncation (compaction and deletion timing).
    • Added range-based deletion across ledger columns for bulk removals.
  • Improvements

    • More efficient truncation via multi-column tombstone deletions and improved compaction flow.
    • Expanded public API to support generic, column-oriented range deletions and better lifecycle initialization.
  • Chores

    • Added a workspace dependency for metrics integration.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Manual Deploy Available

You can trigger a manual deploy of this PR branch to testnet:

Deploy to Testnet 🚀

Alternative: Comment /deploy on this PR to trigger deployment directly.

⚠️ Note: Manual deploy requires authorization. Only authorized users can trigger deployments.

Comment updated automatically when the PR is synchronized.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

📝 Walkthrough

Walkthrough

This pull request adds range-based deletion to the ledger: LedgerColumn::delete_range and Rocks::delete_range_cf, updates batched deletion to accept from/to ranges, and replaces the legacy delete_slot_range API with a generic delete_range_cf via a new HasColumn<C> trait and macro implementations. Truncation logic in truncate_slot_range and truncate_fat_ledger is refactored to perform tombstone-based multi-column range deletes (via a new delete_slots helper), followed by flush and compaction. New Prometheus histograms and timer helpers for truncator delete and compaction are added and the magicblock-metrics workspace dependency is introduced.

Possibly related PRs

Suggested reviewers

  • thlorenz
  • GabrielePicco
  • bmuddha
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/ledger-truncator/heavy-deletes

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4ec184 and 1618b75.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • test-integration/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • magicblock-metrics/src/metrics/mod.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 596
File: magicblock-processor/src/scheduler.rs:1-1
Timestamp: 2025-10-28T13:15:42.706Z
Learning: In magicblock-processor, transaction indexes were always set to 0 even before the changes in PR #596. The proper transaction indexing within slots will be addressed during the planned ledger rewrite.
🔇 Additional comments (3)
magicblock-metrics/src/metrics/mod.rs (3)

102-119: Histogram definitions for truncator compaction/delete look consistent and correct

Names, descriptions, and bucket choices for LEDGER_TRUNCATOR_COMPACTION_SECONDS and LEDGER_TRUNCATOR_DELETE_SECONDS align with how other duration histograms are defined here (seconds-based, coarse buckets for long-running ops). No issues from a metrics-design or correctness standpoint.


378-379: New truncator histograms are properly registered

Both LEDGER_TRUNCATOR_COMPACTION_SECONDS and LEDGER_TRUNCATOR_DELETE_SECONDS are registered alongside the other collectors, so they’ll be exposed exactly once via the shared registry. Looks good.


478-485: Helper APIs mirror existing patterns and are type‑sound

start_ledger_truncator_compaction_timer and observe_ledger_truncator_delete follow the same patterns as other helpers in this module (observe_columns_count_duration, timer-based APIs). Generic signature on observe_ledger_truncator_delete is correct and keeps call sites clean.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
magicblock-ledger/src/ledger_truncator.rs (3)

132-143: Potential underflow when num_slots_to_truncate is zero.

If slot_size > excess, then num_slots_to_truncate will be 0, and lowest_slot + 0 - 1 will underflow for u64, wrapping to u64::MAX. This would cause incorrect truncation behavior.

Add a guard before the subtraction:

         let num_slots_to_truncate = excess / slot_size;
+        if num_slots_to_truncate == 0 {
+            info!("Fat truncation: excess size too small for even one slot, skipping");
+            return Ok(());
+        }

         // Calculating up to which slot we're truncating
         let truncate_to_slot = lowest_slot + num_slots_to_truncate - 1;

127-130: Replace informal debug message with descriptive warning.

"Nani3?" provides no context for operators reviewing logs. Consider a descriptive message.

         if lowest_slot == highest_slot {
-            warn!("Nani3?");
+            warn!("Fat truncation skipped: only one slot in ledger (slot {})", lowest_slot);
             return Ok(());
         }

316-319: Replace informal debug message with descriptive warning.

         if to_slot < from_slot {
-            warn!("LedgerTruncator: Nani2?");
+            warn!("LedgerTruncator: skipping compaction, invalid range to_slot ({}) < from_slot ({})", to_slot, from_slot);
             return;
         }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd324b5 and c4ec184.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • test-integration/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • magicblock-ledger/Cargo.toml (1 hunks)
  • magicblock-ledger/src/database/ledger_column.rs (1 hunks)
  • magicblock-ledger/src/database/rocks_db.rs (1 hunks)
  • magicblock-ledger/src/ledger_truncator.rs (9 hunks)
  • magicblock-ledger/src/store/api.rs (3 hunks)
  • magicblock-metrics/src/metrics/mod.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 596
File: magicblock-processor/src/scheduler.rs:1-1
Timestamp: 2025-10-28T13:15:42.706Z
Learning: In magicblock-processor, transaction indexes were always set to 0 even before the changes in PR #596. The proper transaction indexing within slots will be addressed during the planned ledger rewrite.
📚 Learning: 2025-11-24T14:21:00.996Z
Learnt from: Dodecahedr0x
Repo: magicblock-labs/magicblock-validator PR: 639
File: Cargo.toml:58-58
Timestamp: 2025-11-24T14:21:00.996Z
Learning: In the magicblock-validator codebase, magicblock-api/Cargo.toml intentionally uses borsh = "1.5.3" (instead of the workspace version 0.10.4) because it needs to deserialize types from the magic-domain-program external dependency, which requires borsh 1.5.x compatibility. This is an intentional exception for interoperability with the magic domain program.

Applied to files:

  • magicblock-ledger/Cargo.toml
📚 Learning: 2025-10-21T14:00:54.642Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579

Applied to files:

  • magicblock-ledger/Cargo.toml
📚 Learning: 2025-10-28T13:15:42.706Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 596
File: magicblock-processor/src/scheduler.rs:1-1
Timestamp: 2025-10-28T13:15:42.706Z
Learning: In magicblock-processor, transaction indexes were always set to 0 even before the changes in PR #596. The proper transaction indexing within slots will be addressed during the planned ledger rewrite.

Applied to files:

  • magicblock-ledger/src/ledger_truncator.rs
  • magicblock-ledger/src/store/api.rs
📚 Learning: 2025-11-21T10:22:07.520Z
Learnt from: taco-paco
Repo: magicblock-labs/magicblock-validator PR: 661
File: magicblock-committor-service/src/intent_executor/single_stage_executor.rs:20-28
Timestamp: 2025-11-21T10:22:07.520Z
Learning: In magicblock-committor-service's SingleStageExecutor and TwoStageExecutor (single_stage_executor.rs and two_stage_executor.rs), the fields transaction_strategy, junk, and patched_errors are intentionally public because these executors are designed to be used independently outside of the IntentExecutor scope, and callers need access to these execution reports for cleanup and error handling.

Applied to files:

  • magicblock-ledger/src/ledger_truncator.rs
📚 Learning: 2025-11-07T13:20:13.793Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 589
File: magicblock-processor/src/scheduler/coordinator.rs:227-238
Timestamp: 2025-11-07T13:20:13.793Z
Learning: In magicblock-processor's ExecutionCoordinator (scheduler/coordinator.rs), the `account_contention` HashMap intentionally does not call `shrink_to_fit()`. Maintaining slack capacity is beneficial for performance by avoiding frequent reallocations during high transaction throughput. As long as empty entries are removed from the map (which `clear_account_contention` does), the capacity overhead is acceptable.

Applied to files:

  • magicblock-ledger/src/ledger_truncator.rs
🧬 Code graph analysis (3)
magicblock-ledger/src/database/rocks_db.rs (3)
magicblock-ledger/src/store/api.rs (1)
  • delete_range_cf (1220-1230)
magicblock-ledger/src/database/db.rs (1)
  • delete_range_cf (130-147)
magicblock-ledger/src/database/write_batch.rs (1)
  • delete_range_cf (57-65)
magicblock-ledger/src/database/ledger_column.rs (1)
magicblock-ledger/src/database/columns.rs (8)
  • key (125-125)
  • key (170-174)
  • key (234-243)
  • key (318-323)
  • key (393-398)
  • key (495-497)
  • key (566-571)
  • key (641-643)
magicblock-ledger/src/store/api.rs (4)
magicblock-ledger/src/ledger_truncator.rs (4)
  • ledger (173-173)
  • ledger (181-181)
  • ledger (189-189)
  • ledger (200-200)
magicblock-ledger/src/database/rocks_db.rs (1)
  • delete_range_cf (124-132)
magicblock-ledger/src/database/db.rs (1)
  • delete_range_cf (130-147)
magicblock-ledger/src/database/write_batch.rs (1)
  • delete_range_cf (57-65)
🔇 Additional comments (8)
magicblock-ledger/Cargo.toml (1)

23-23: LGTM!

The workspace dependency addition is appropriate for integrating the new ledger truncator metrics.

magicblock-ledger/src/database/rocks_db.rs (1)

124-132: LGTM!

Clean low-level wrapper for RocksDB's range delete. The method correctly delegates range semantics to the caller, consistent with the existing API pattern.

magicblock-ledger/src/database/ledger_column.rs (1)

257-264: LGTM!

The delete_range method follows the established patterns in LedgerColumn for key conversion and backend delegation.

magicblock-metrics/src/metrics/mod.rs (1)

478-484: LGTM!

The helper functions follow the established patterns for histogram timers and closure-based observation.

magicblock-ledger/src/ledger_truncator.rs (3)

162-227: LGTM - well-structured range deletion logic.

The implementation correctly:

  • Uses to_slot + 1 for RocksDB's exclusive end semantics
  • Decreases counters by the exact slot count for slot-indexed columns
  • Resets counters to DIRTY_COUNT for columns where element count is unknown
  • Uses composite key boundaries correctly for SlotSignatures

The macro for resetting entry counters is a clean pattern.


329-344: LGTM!

Clean RAII pattern combining the existing Measure with the new HistogramTimer. Both record on drop for consistent timing measurement.


275-304: LGTM!

The refactored flow correctly sequences: set cleanup boundary → insert tombstones → flush → compact. This aligns with the PR objective of using range deletes with CompactionFilter cleanup.

magicblock-ledger/src/store/api.rs (1)

174-179: Initialize lowest_cleanup_slot on open – behavior change to be aware of

Calling initialize_lowest_cleanup_slot() during do_open means that, for a populated ledger, lowest_cleanup_slot (and thus the oldest_slot beacon used by the compaction filter) is now derived from the current lowest Blockhash slot instead of always starting at 0. That seems aligned with the goal of resuming cleanup across restarts; just make sure there are no callers/tests that implicitly relied on get_lowest_cleanup_slot() returning 0 immediately after Ledger::open on a non-empty store.

Copy link
Contributor

@bmuddha bmuddha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with few minor nits

LEDGER_COLUMNS_COUNT_DURATION_SECONDS.observe_closure_duration(f)
}

pub fn start_ledger_truncator_compaction_timer() -> HistogramTimer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cleaner approach is to use scoped timer guards, using closure based measuring makes the invocation site quite cumbersome to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still use scoped guard on call site

        let _measure = CompactionMeasure {
            measure: Measure::start("Manual compaction"),
            _histogram_timer: start_ledger_truncator_compaction_timer(),
        };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just more code imo

}
}

pub trait HasColumn<C>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Better name would be TruncateableColumn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree, this isn't only for columns on which we delete, this is implemented for every column we have - just return a corresponding &LedgerColumn.

In some cases we want to map passed generic C to its column, but without this trait there's no good way to do it. Maybe only matching C::Name to expected set of names, but this will result in function returning Option<LedgerColumn> which is ugly imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against using the trait, just pointing out that the name is confusing

@taco-paco taco-paco merged commit 2be43c5 into master Dec 2, 2025
18 checks passed
@taco-paco taco-paco deleted the fix/ledger-truncator/heavy-deletes branch December 2, 2025 08:47
Dodecahedr0x pushed a commit that referenced this pull request Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants