-
Notifications
You must be signed in to change notification settings - Fork 25
Fix: abort truncation if validator exited #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR updates the console-subscriber dependency from 0.2.0 to 0.5.0 in workspace dependencies. It adds graceful shutdown handling for the ledger truncator in magicblock-api by awaiting and logging errors from the truncator's join result. The ledger truncator is enhanced with cancellation support through a new CancellationToken parameter on truncate_slot_range and compact_slot_range functions, enabling early exit during long-running operations. A new compact_cf_or_return macro simplifies cancellation-aware compaction logic, and a CompactionMeasure wrapper is introduced to log compaction durations. Possibly related PRs
Suggested reviewers
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Manual Deploy AvailableYou can trigger a manual deploy of this PR branch to testnet: Alternative: Comment
Comment updated automatically when the PR is synchronized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
magicblock-api/src/magic_validator.rs (1)
744-766: Based on my thorough investigation, I can now provide the final rewritten review comment. The analysis confirms the original concern is valid and identifies a real race condition.Move
ledger_truncator.join()beforeledger.shutdown()to ensure the truncator finishes all compaction operations before database shutdown.The
ledger_truncator.stop()call (line 744) only signals cancellation via a token; it does not wait for the worker thread to complete. If the truncator is mid-compaction whenledger.shutdown(true)(line 760) is called, the blocking RocksDBcompact_range()operation cannot be interrupted. However,shutdown(true)callscancel_all_background_work(true), which may interfere with the in-flight compaction, potentially causing data corruption or incomplete operations.The cancellation token is only checked between individual column compactions (via the
compact_cf_or_return!macro), not within eachcompact_range()call. By movingledger_truncator.join()to immediately afterledger_truncator.stop()and beforeledger.shutdown(true), you guarantee the truncator thread has fully exited and all compaction is complete before the database begins its shutdown sequence.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
Cargo.toml(1 hunks)magicblock-api/src/magic_validator.rs(1 hunks)magicblock-ledger/src/ledger_truncator.rs(7 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-11-07T14:20:31.457Z
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.
Applied to files:
magicblock-api/src/magic_validator.rs
📚 Learning: 2025-10-21T14:00:54.642Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579
Applied to files:
magicblock-api/src/magic_validator.rs
📚 Learning: 2025-11-19T12:55:48.931Z
Learnt from: Dodecahedr0x
Repo: magicblock-labs/magicblock-validator PR: 639
File: test-integration/test-tools/src/validator.rs:193-197
Timestamp: 2025-11-19T12:55:48.931Z
Learning: In the magicblock-validator codebase, when constructing arguments for the light CLI test validator in start_light_validator_with_config (test-integration/test-tools/src/validator.rs), the shlex::split approach is necessary because the light CLI does not handle whitespaces properly in validator arguments. The string concatenation + shlex parsing pattern should not be refactored to direct argument construction in this specific case.
Applied to files:
magicblock-api/src/magic_validator.rs
📚 Learning: 2025-11-24T14:21:00.996Z
Learnt from: Dodecahedr0x
Repo: magicblock-labs/magicblock-validator PR: 639
File: Cargo.toml:58-58
Timestamp: 2025-11-24T14:21:00.996Z
Learning: In the magicblock-validator codebase, magicblock-api/Cargo.toml intentionally uses borsh = "1.5.3" (instead of the workspace version 0.10.4) because it needs to deserialize types from the magic-domain-program external dependency, which requires borsh 1.5.x compatibility. This is an intentional exception for interoperability with the magic domain program.
Applied to files:
Cargo.toml
🧬 Code graph analysis (1)
magicblock-ledger/src/ledger_truncator.rs (3)
magicblock-ledger/src/database/ledger_column.rs (1)
Self(294-294)magicblock-ledger/src/store/api.rs (1)
compact_slot_range_cf(1327-1337)magicblock-ledger/src/database/columns.rs (1)
columns(105-117)
🔇 Additional comments (8)
magicblock-ledger/src/ledger_truncator.rs (7)
1-48: LGTM!The addition of
ControlFlowimport andCancellationTokento the worker struct properly supports the cancellation-aware truncation flow.
50-102: LGTM!The
tokio::select!correctly handles cancellation, and the token is properly propagated totruncate_slot_range. The async cancellation and synchronous cancellation check cooperation is well-designed.
104-152: LGTM!The cancellation token is correctly propagated to
compact_slot_range. The early-exit check withincompact_slot_range(line 270-273) will handle cancellation before starting the expensive compaction work.
279-291: LGTM!The
CompactionMeasurewrapper withDropimplementation is a clean pattern that ensures compaction duration is logged even on early return due to cancellation.
434-456: LGTM!The
compact_cf_or_returnmacro cleanly encapsulates the cancellation-aware compaction pattern. The two variants handle both slot-range and full-compaction cases correctly, and the post-compaction cancellation check with informative logging aids debugging during shutdown.
409-431: LGTM!The
stop()andjoin()methods implement a clean state machine that supports both immediate cancellation and graceful awaiting of the worker thread. This properly addresses the PR objective of allowing shutdown during long-running truncation.
293-345: The review comment's concern about AccountModDatas is unfounded—it is intentionally excluded from compaction.The column count claim is also inaccurate.
columns()defines exactly 9 column families, not 11. More importantly,AccountModDatasis intentionally not compacted during slot truncation because itsColumntrait implementation explicitly returnstruefromkeep_all_on_compaction()(line 663 indatabase/columns.rs). The accompanying comment states: "We don't clean AccountModData on compaction as it isn't slot based." This is by design—sinceAccountModDatasis indexed byu64rather thanSlot, slot-range truncation does not apply to it.Likely an incorrect or invalid review comment.
magicblock-api/src/magic_validator.rs (1)
764-766: LGTM!The addition of
ledger_truncator.join()with error logging ensures the truncator thread is properly awaited during shutdown, addressing the PR objective where the thread was previously being aborted by the OS without proper awaiting. The error handling provides visibility if the truncator encounters issues during shutdown.
| chrono = "0.4" | ||
| clap = "4.5.40" | ||
| console-subscriber = "0.2.0" | ||
| console-subscriber = "0.5.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the tokio version and other relevant dependencies in Cargo.toml
head -n 100 Cargo.toml | grep -E "(tokio|console-subscriber)"Repository: magicblock-labs/magicblock-validator
Length of output: 108
🌐 Web query:
console-subscriber 0.5.0 changelog breaking changes
💡 Result:
Breaking changes in console-subscriber 0.5.0
- Upgraded tonic to 0.13 — breaks compatibility with tonic 0.12.x (affects users of console-api and console-subscriber). [1]
- Upgraded tonic to 0.14 — breaks compatibility with tonic 0.13.x (affects users of console-api and console-subscriber). [2]
Sources: [1] docs.rs CHANGELOG for console-subscriber 0.5.0; [2] tokio-rs/console GitHub release notes for console-subscriber v0.5.0.
🌐 Web query:
console-subscriber 0.2.0 vs 0.5.0 migration guide
💡 Result:
Short migration checklist: console-subscriber 0.2.0 -> 0.5.0
-
Read the 0.5.0 changelog (breaking changes). The release upgrades tonic (breaking for consumers of console-api/console-subscriber) and continues the series of tonic/prost bumps introduced in earlier releases. [1]
-
Update Cargo.toml
- Bump console-subscriber = "0.5" (or exact 0.5.0).
- Make your tonic/prost versions compatible with console-subscriber’s public dependencies (console-subscriber 0.5.x requires newer tonic; older tonic 0.12.x/0.11.x/0.9.x are incompatible). Align your workspace/crate tonic to the same major/minor that console-subscriber exposes. [1][2]
- Resolve dependency conflicts
- If you also depend on tonic (or console-api) directly, update those crates to the matching tonic version (the changelog notes upgrades to tonic 0.10 / 0.12 / 0.13 / 0.14 across recent releases). Use cargo update -p tonic and cargo tree to find version mismatches; fix them by pinning compatible versions or using cargo patch if necessary. [1][2]
- Prost/prost-types and MSRV
- Prior releases moved prost/prost-types and raised MSRV (see 0.4.0 notes). If you use prost types or support older Rust toolchains, ensure your toolchain meets the minimum required Rust version (MSRV raised in 0.4.x) and update prost/prost-types to compatible versions. [1]
- API / runtime checks
- Typical init/Builder APIs (ConsoleLayer::builder(), init(), spawn(), etc.) remain available; verify your usage still compiles (see docs). If you integrate console-api protos into your own tonic servers/clients, re-generate protobuf code if necessary to match the console-api proto changes. [2][3]
- Test
- Rebuild, run unit/integration tests, and run the tokio-console consumer to confirm telemetry is reported and the gRPC wire compatibility works between producer and consumer.
Quick references
- console-subscriber CHANGELOG (0.5.0 breaking notes). [1]
- console-subscriber docs / API (Builder/init examples). [2]
- repo/PR notes (dependabot bump to 0.5.0). [3]
If you want, I can:
- produce the exact Cargo.toml diffs for your project (paste your Cargo.toml),
- or run cargo-tree style guidance to resolve specific version conflicts. Which would you prefer?
Sources
[1] console-subscriber CHANGELOG (0.5.0 breaking notes).
[2] console-subscriber docs.rs (crate docs/Builder/init examples).
[3] dependabot/PR release notes bumping console-subscriber to 0.5.0.
🏁 Script executed:
#!/bin/bash
# Get the full Cargo.toml dependencies section
cat Cargo.tomlRepository: magicblock-labs/magicblock-validator
Length of output: 7895
🏁 Script executed:
#!/bin/bash
# Search for console-subscriber usage in the codebase
rg -n "console_subscriber|console-subscriber" --type rust -C 3Repository: magicblock-labs/magicblock-validator
Length of output: 592
🏁 Script executed:
#!/bin/bash
# Search for tonic usage
rg -n "tonic" --type rust -C 2Repository: magicblock-labs/magicblock-validator
Length of output: 1205
🏁 Script executed:
#!/bin/bash
# Search for tonic in individual crate Cargo.toml files
find . -name "Cargo.toml" -type f ! -path "./target/*" | xargs grep -l "^tonic\s*=" 2>/dev/null | head -20Repository: magicblock-labs/magicblock-validator
Length of output: 62
🌐 Web query:
console-subscriber 0.5.0 tonic version dependency
💡 Result:
console-subscriber v0.5.0 requires tonic >= 0.13 (the release upgraded the public tonic dependency to 0.13, breaking compatibility with tonic 0.12.x — later notes also reference upgrading to 0.14). [1][2]
Sources:
- console-subscriber 0.5.0 changelog (docs.rs). [1]
- tokio-rs/console release notes (GitHub). [2]
🌐 Web query:
tonic-build 0.9.2 compatible tonic version
💡 Result:
Use the matching 0.9.2 release — tonic = "0.9.2" is compatible with tonic-build 0.9.2. [1][2]
Sources:
[1] docs.rs entry for tonic 0.9.2.
[2] hyperium/tonic GitHub issue showing tonic and tonic-build both at v0.9.2.
Critical compatibility issue: console-subscriber 0.5.0 requires tonic >= 0.13, but tonic-build 0.9.2 (used in the build) is incompatible with that version.
The upgrade from 0.2.0 to 0.5.0 introduces a breaking dependency conflict. Console-subscriber 0.5.0 brings tonic >= 0.13 as a transitive dependency, while tonic-build 0.9.2 (in storage-proto/build.rs) expects tonic 0.9.2. These versions are incompatible and will cause build failures or runtime issues.
Required action: Either:
- Update
tonic-buildto a version compatible withtonic >= 0.13(e.g., 0.13.x or later), or - Revert
console-subscriberto a version with compatible tonic requirements (e.g., 0.2.x or earlier).
Consider adding an explicit tonic dependency to the workspace to pin the version and avoid hidden conflicts.
|
|
||
| ControlFlow::Continue(()) | ||
| }); | ||
|
|
||
| // Flush memtables with tombstones prior to compaction | ||
| if let Err(err) = ledger_copy.flush() { | ||
| error!("Failed to flush ledger: {err}"); | ||
| } | ||
| Self::compact_slot_range(ledger, from_slot, to_slot); | ||
| Self::compact_slot_range( | ||
| ledger, | ||
| from_slot, | ||
| to_slot, | ||
| cancellation_token, | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Consider adding cancellation check before flush and compaction.
When the deletion loop breaks early due to cancellation (via ControlFlow::Break), the code still proceeds to flush and call compact_slot_range. While compact_slot_range does check for cancellation at its start (line 270-273), the flush operation will still execute unnecessarily.
Apply this diff to skip post-deletion work when cancelled:
ControlFlow::Continue(())
});
+ if cancellation_token.is_cancelled() {
+ info!("Validator shutting down - skipping post-truncation flush and compaction");
+ return;
+ }
+
// Flush memtables with tombstones prior to compaction
if let Err(err) = ledger_copy.flush() {
error!("Failed to flush ledger: {err}");
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ControlFlow::Continue(()) | |
| }); | |
| // Flush memtables with tombstones prior to compaction | |
| if let Err(err) = ledger_copy.flush() { | |
| error!("Failed to flush ledger: {err}"); | |
| } | |
| Self::compact_slot_range(ledger, from_slot, to_slot); | |
| Self::compact_slot_range( | |
| ledger, | |
| from_slot, | |
| to_slot, | |
| cancellation_token, | |
| ); | |
| ControlFlow::Continue(()) | |
| }); | |
| if cancellation_token.is_cancelled() { | |
| info!("Validator shutting down - skipping post-truncation flush and compaction"); | |
| return; | |
| } | |
| // Flush memtables with tombstones prior to compaction | |
| if let Err(err) = ledger_copy.flush() { | |
| error!("Failed to flush ledger: {err}"); | |
| } | |
| Self::compact_slot_range( | |
| ledger, | |
| from_slot, | |
| to_slot, | |
| cancellation_token, | |
| ); |
🤖 Prompt for AI Agents
In magicblock-ledger/src/ledger_truncator.rs around lines 240 to 253, the
deletion loop may have been terminated early by cancellation but the code still
proceeds to flush memtables and call compact_slot_range; add a cancellation
check after the loop and before the flush/compaction so that if the
cancellation_token indicates cancellation you skip both the ledger_copy.flush()
and Self::compact_slot_range(...) calls (return or continue accordingly),
otherwise proceed with the existing flush and compaction logic.
bmuddha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if cancelled() check in the macro, only short circuits after the compaction operation finished, and considering that it might take a while, the truncator thread might still be blocking the shutdown sequence due to the introduction of thread joining. It's still better than waiting for the whole thing to finish though.
Not entirely correct. On |
bmuddha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* master: fix: don't unbork for same delegation slot as remote slot (#702) feat: use simplified configuration management (#685) Report Intent's patched errors (#667) fix: replace cargo-expand with syn-based verification (#611) fix: make db thread safe (#680) fix: reset program cache upon error (#696) chore: use smaller runners (#695) feat: cranked commits (#656) fix: refresh stuck undelegating accounts that are closed on-chain (#691) Fix: abort truncation if validator exited (#689)
In
LedgerTrunctationWorker::runthe branch of executionLedgerTrunctationWorker::truncatemay take a very long time. If in the meantime validator shutsdown this branch may hold it until it finishes execution.UPD: with latest changes the
LedgerTruncatorthread is just aborted by OS as we don't await on it.Summary by CodeRabbit
Release Notes
Dependencies
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.