Skip to content

Reject snapshot-less join against primary holding a more recent snapshot#7844

Open
Copilot wants to merge 7 commits intomainfrom
copilot/implement-snapshot-joining-behaviour
Open

Reject snapshot-less join against primary holding a more recent snapshot#7844
Copilot wants to merge 7 commits intomainfrom
copilot/implement-snapshot-joining-behaviour

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 28, 2026

The primary should express an opinion about what snapshot is acceptable. Currently this opinion is "at least my startup_seqno".
To fix #7835 we should have a two step system.
On the first try the primary should require the joiner to have a snapshot which is at its latest snapshot number.
On subsequent tries, that primary should fall back to the safety constraint of greater than startup_seqno.

Copilot generated description:

A joiner with startup_seqno == 0 was accepted by an original (also startup_seqno == 0) primary even when that primary held committed snapshots, causing the joiner to replay the entire ledger.

Changes

  • Join request: new optional retry_count field on JoinNetworkNodeToNode::In, incremented by the joiner each time a StartupSeqnoIsOld response triggers a snapshot fetch + retry.
  • Primary-side check (node_frontend.h accept handler):
    • retry_count == 0: require startup_seqno >= max(this_startup_seqno, latest_on_disk_snapshot_seqno). The max keeps the existing behaviour when this node holds no snapshot.
    • retry_count > 0: fall back to the prior startup_seqno >= this_startup_seqno check, so the joiner doesn't chase a snapshot committed mid-fetch.
  • Latest snapshot source: read from disk via NodeConfigurationSubsystem + snapshots::find_latest_committed_snapshot_in_directory (i.e. what the joiner could actually fetch from this node), not from the SNAPSHOT_STATUS KV table.
  • Joiner (NodeState): tracks join_retry_count, bumped where the existing StartupSeqnoIsOld + fetch_recent_snapshot path is taken, and sent on the next request.
  • Test: run_join_no_snapshot_against_original_primary in tests/reconfiguration.py starts a single-node network, generates a committed snapshot, then asserts:
    • join with from_snapshot=False, fetch_recent_snapshot=False is rejected with StartupSeqnoIsOld (previously silently replayed the ledger);
    • join with from_snapshot=False, fetch_recent_snapshot=True succeeds and the joiner reports a non-zero startup_seqno.
  • CHANGELOG: new [Unreleased] / Fixed entry.

Copilot AI changed the title [WIP] Implement behavior for joining nodes to original ledger Reject snapshot-less join against primary holding a more recent snapshot Apr 28, 2026
Copilot AI requested a review from cjen1-msft April 28, 2026 11:22
Comment thread src/node/rpc/node_call_types.h
Comment thread src/node/rpc/node_frontend.h
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/ee92835a-b9f3-4451-aa71-bfc04a2d774e

Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
Copilot AI requested a review from cjen1-msft April 28, 2026 12:14
@cjen1-msft cjen1-msft marked this pull request as ready for review April 28, 2026 12:21
@cjen1-msft cjen1-msft requested a review from a team as a code owner April 28, 2026 12:21
Copilot AI review requested due to automatic review settings April 28, 2026 12:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the node join protocol to prevent snapshot-less joiners from being accepted by an “original” primary that has already committed snapshots, avoiding full-ledger replay by requiring a sufficiently recent startup snapshot on the initial join attempt.

Changes:

  • Add retry_count to the node-to-node join request and track it on the joiner to distinguish first-attempt joins from retries after snapshot fetch.
  • Update primary-side join acceptance logic to compare the joiner’s startup_seqno against the latest committed snapshot on disk (first attempt only), falling back to the previous check on retries.
  • Add an E2E regression test covering rejection of snapshot-less joins and successful join after fetching a snapshot; add a changelog entry.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/node/rpc/node_call_types.h Adds optional retry_count to join request type.
src/node/rpc/serialization.h Adds retry_count to JSON serialization for join request.
src/node/node_state.h Tracks and sends join_retry_count when retrying join after snapshot fetch.
src/node/rpc/node_frontend.h Implements primary-side min acceptable seqno check using latest on-disk committed snapshot on first attempt.
tests/reconfiguration.py Adds regression test for joining without a snapshot against an original primary holding committed snapshots.
CHANGELOG.md Adds [Unreleased] fixed entry describing the behavior change.

Comment thread CHANGELOG.md Outdated
Comment thread src/node/rpc/node_frontend.h Outdated
Comment thread CHANGELOG.md Outdated
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 28, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh gh pr view --json number (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/0152a93d-bea4-4acf-8376-7be729e36da8

Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
Comment thread src/node/rpc/node_frontend.h Outdated
…ror messages

Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/1dbff290-d0e6-4564-a52f-6633edf7df3e

Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Copilot AI requested a review from eddyashton April 28, 2026 12:35
Comment thread src/node/rpc/node_frontend.h Outdated
ccf::errors::StartupSeqnoIsOld,
fmt::format(
"Node requested to join from seqno {} which is older than "
"this node's preferred recent snapshot seqno {} (the latest "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid apostrophes/single quotes in error messages and keep them compact:

"Joiner startup snapshot sequence number ({}) is lower than minimum acceptable value ({})", in.startup_seqno.value(), preferred_seqno

"Node requested to join from seqno {} which is older than this "
"node startup seqno {}. A snapshot at least as recent as {} must "
"be used instead.",
"node's required minimum snapshot seqno {} (this node's "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Join of a snapshot-less node to an original node replays the ledger

5 participants