Skip to content

fix: add monotonic term/epoch for replication fencing (PILOT-328)#18

Open
matthew-pilot wants to merge 1 commit into
mainfrom
openclaw/pilot-328-20260530-025812
Open

fix: add monotonic term/epoch for replication fencing (PILOT-328)#18
matthew-pilot wants to merge 1 commit into
mainfrom
openclaw/pilot-328-20260530-025812

Conversation

@matthew-pilot
Copy link
Copy Markdown
Collaborator

Summary

Adds a monotonic replication term/epoch to prevent stale-primary writes. When a standby is promoted to primary, the term increments, fencing any resurrected former primary. Standbys reject snapshots with a stale term.

Changes

  • wal/snapshot.go: Add Term uint64 to Snapshot struct
  • server.go: Add term uint64 field to Server
  • server_api.go: Add PromoteToPrimary() and Term() methods
  • replication.go: Include term in snapshotJSON(); reject stale-term snapshots in applySnapshot()
  • server_persist.go: Restore term on load()

Verification

  • Build: go build ./...
  • Tests: go test ./... ✅ (all 19 packages, 0 failures)
  • Vet: go vet ./...

Scope

 replication.go    | 14 ++++++++++++++
 server.go         |  4 ++++
 server_api.go     | 20 ++++++++++++++++++++
 server_persist.go |  1 +
 wal/snapshot.go   |  4 ++++
 5 files changed, 43 insertions(+)

Ticket

PILOT-328 — registry replication: no fencing/lease → old primary can keep writing after partition

Add Term field to wal.Snapshot for persistence and replication.
Snapshot includes the current term; applySnapshot rejects stale-term
pushes from a resurrected former primary. PromoteToPrimary() increments
the term and persists immediately, fencing the old primary.

Tracking: PILOT-328
Scope: wal/snapshot.go, server.go, server_api.go, replication.go, server_persist.go
@matthew-pilot matthew-pilot added the matthew-fix-larger Medium-scope autonomous fix (≤10 files, ≤200 LoC) label May 30, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@matthew-pilot
Copy link
Copy Markdown
Collaborator Author

📊 Status — PR #18 (PILOT-328)

  • PR state: OPEN, MERGEABLE ✅ (CLEAN — no merge conflicts)
  • CI: 2/2 ✅ (test ✅, codecov/patch ✅)
  • Canary: not configured — rendezvous has no canary workflow in pilot-canary
  • Jira: PILOT-328IN WORK (unassigned)
  • Scope: 5 files, +43 LoC — medium tier (matthew-fix-larger)
  • Operator action: no TeoSlayer activity on this PR since creation (02:58 UTC)

@matthew-pilot
Copy link
Copy Markdown
Collaborator Author

🔍 Walkthrough — PILOT-328: Replication Term Fencing

This PR adds a monotonic term/epoch to prevent a resurrected former primary from overwriting state after a failover. 5 files, 43 lines added.

wal/snapshot.go:105-108 (struct field)

Adds Term uint64 to the Snapshot struct. This is a new field serialized as "term" in the JSON snapshot. omitempty ensures backward compatibility — old snapshots without the field deserialize as Term=0.

server.go:142-145 (field)

Adds term uint64 to the Server struct. This is the in-memory replication epoch, incremented on primary promotion. Zero-initialized at startup (or restored from snapshot).

server_api.go:30-49 (new methods)

  • PromoteToPrimary() — increments s.term under lock, calls flushSave() to persist, logs the promotion at WARN level. Call this on the standby being promoted.
  • Term() — thread-safe getter for the current epoch.

replication.go:177-179, 245-255, 433 (fencing logic)

  • Line 178: snapshotJSON() now includes snap.Term = s.term in every outgoing snapshot.
  • Lines 245-255: applySnapshot()the core fencing check. If the incoming snapshot has a Term lower than the current server term, it was sent by a stale former primary → the standby logs a warning and ignores it (return nil). This is a soft reject: no error, just a no-op.
  • Line 433: After a valid snapshot is applied, s.term = snap.Term advances the local epoch to match.

server_persist.go:512 (restore)

On server restart, the saved term is restored from the on-disk snapshot during load().

Design summary

Scenario Before After
Primary fails, standby promoted Old primary can keep writing Standby increments term → old primary snapshots rejected
Network partition heals Both sides write, last-write-wins on merge Term comparison gates out stale writes
Cold restart Term reset to 0 Term restored from snapshot on disk

The fencing is soft (log + ignore) rather than hard (panic/exit), which keeps the revived-but-stale primary alive as a read-only standby until an operator demotes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

matthew-fix-larger Medium-scope autonomous fix (≤10 files, ≤200 LoC)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant