Reject snapshot-less join against primary holding a more recent snapshot#7844
Open
Reject snapshot-less join against primary holding a more recent snapshot#7844
Conversation
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/d3383e6c-3063-4f30-ba6d-e324856cc41f Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Implement behavior for joining nodes to original ledger
Reject snapshot-less join against primary holding a more recent snapshot
Apr 28, 2026
cjen1-msft
reviewed
Apr 28, 2026
cjen1-msft
reviewed
Apr 28, 2026
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/ee92835a-b9f3-4451-aa71-bfc04a2d774e Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the node join protocol to prevent snapshot-less joiners from being accepted by an “original” primary that has already committed snapshots, avoiding full-ledger replay by requiring a sufficiently recent startup snapshot on the initial join attempt.
Changes:
- Add
retry_countto the node-to-node join request and track it on the joiner to distinguish first-attempt joins from retries after snapshot fetch. - Update primary-side join acceptance logic to compare the joiner’s
startup_seqnoagainst the latest committed snapshot on disk (first attempt only), falling back to the previous check on retries. - Add an E2E regression test covering rejection of snapshot-less joins and successful join after fetching a snapshot; add a changelog entry.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/node/rpc/node_call_types.h |
Adds optional retry_count to join request type. |
src/node/rpc/serialization.h |
Adds retry_count to JSON serialization for join request. |
src/node/node_state.h |
Tracks and sends join_retry_count when retrying join after snapshot fetch. |
src/node/rpc/node_frontend.h |
Implements primary-side min acceptable seqno check using latest on-disk committed snapshot on first attempt. |
tests/reconfiguration.py |
Adds regression test for joining without a snapshot against an original primary holding committed snapshots. |
CHANGELOG.md |
Adds [Unreleased] fixed entry describing the behavior change. |
eddyashton
reviewed
Apr 28, 2026
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/77ef7836-6f94-4a76-a8fd-ec303efa9014 Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Contributor
Author
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/0152a93d-bea4-4acf-8376-7be729e36da8 Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
eddyashton
reviewed
Apr 28, 2026
…ror messages Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/1dbff290-d0e6-4564-a52f-6633edf7df3e Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
eddyashton
reviewed
Apr 28, 2026
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/a18950f8-94db-4a15-b9a5-cae39445e222 Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
achamayou
reviewed
Apr 28, 2026
| ccf::errors::StartupSeqnoIsOld, | ||
| fmt::format( | ||
| "Node requested to join from seqno {} which is older than " | ||
| "this node's preferred recent snapshot seqno {} (the latest " |
Member
There was a problem hiding this comment.
We should avoid apostrophes/single quotes in error messages and keep them compact:
"Joiner startup snapshot sequence number ({}) is lower than minimum acceptable value ({})", in.startup_seqno.value(), preferred_seqno
achamayou
reviewed
Apr 28, 2026
| "Node requested to join from seqno {} which is older than this " | ||
| "node startup seqno {}. A snapshot at least as recent as {} must " | ||
| "be used instead.", | ||
| "node's required minimum snapshot seqno {} (this node's " |
achamayou
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The primary should express an opinion about what snapshot is acceptable. Currently this opinion is "at least my startup_seqno".
To fix #7835 we should have a two step system.
On the first try the primary should require the joiner to have a snapshot which is at its latest snapshot number.
On subsequent tries, that primary should fall back to the safety constraint of greater than
startup_seqno.Copilot generated description:
A joiner with
startup_seqno == 0was accepted by an original (alsostartup_seqno == 0) primary even when that primary held committed snapshots, causing the joiner to replay the entire ledger.Changes
retry_countfield onJoinNetworkNodeToNode::In, incremented by the joiner each time aStartupSeqnoIsOldresponse triggers a snapshot fetch + retry.node_frontend.haccepthandler):retry_count == 0: requirestartup_seqno >= max(this_startup_seqno, latest_on_disk_snapshot_seqno). Themaxkeeps the existing behaviour when this node holds no snapshot.retry_count > 0: fall back to the priorstartup_seqno >= this_startup_seqnocheck, so the joiner doesn't chase a snapshot committed mid-fetch.NodeConfigurationSubsystem+snapshots::find_latest_committed_snapshot_in_directory(i.e. what the joiner could actually fetch from this node), not from theSNAPSHOT_STATUSKV table.NodeState): tracksjoin_retry_count, bumped where the existingStartupSeqnoIsOld+fetch_recent_snapshotpath is taken, and sent on the next request.run_join_no_snapshot_against_original_primaryintests/reconfiguration.pystarts a single-node network, generates a committed snapshot, then asserts:from_snapshot=False, fetch_recent_snapshot=Falseis rejected withStartupSeqnoIsOld(previously silently replayed the ledger);from_snapshot=False, fetch_recent_snapshot=Truesucceeds and the joiner reports a non-zerostartup_seqno.[Unreleased] / Fixedentry.