fix: add monotonic term/epoch for replication fencing (PILOT-328)#18
fix: add monotonic term/epoch for replication fencing (PILOT-328)#18matthew-pilot wants to merge 1 commit into
Conversation
Add Term field to wal.Snapshot for persistence and replication. Snapshot includes the current term; applySnapshot rejects stale-term pushes from a resurrected former primary. PromoteToPrimary() increments the term and persists immediately, fencing the old primary. Tracking: PILOT-328 Scope: wal/snapshot.go, server.go, server_api.go, replication.go, server_persist.go
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
📊 Status — PR #18 (PILOT-328)
|
🔍 Walkthrough — PILOT-328: Replication Term FencingThis PR adds a monotonic term/epoch to prevent a resurrected former primary from overwriting state after a failover. 5 files, 43 lines added.
|
| Scenario | Before | After |
|---|---|---|
| Primary fails, standby promoted | Old primary can keep writing | Standby increments term → old primary snapshots rejected |
| Network partition heals | Both sides write, last-write-wins on merge | Term comparison gates out stale writes |
| Cold restart | Term reset to 0 | Term restored from snapshot on disk |
The fencing is soft (log + ignore) rather than hard (panic/exit), which keeps the revived-but-stale primary alive as a read-only standby until an operator demotes it.
Summary
Adds a monotonic replication term/epoch to prevent stale-primary writes. When a standby is promoted to primary, the term increments, fencing any resurrected former primary. Standbys reject snapshots with a stale term.
Changes
wal/snapshot.go: AddTerm uint64to Snapshot structserver.go: Addterm uint64field to Serverserver_api.go: AddPromoteToPrimary()andTerm()methodsreplication.go: Include term in snapshotJSON(); reject stale-term snapshots in applySnapshot()server_persist.go: Restore term on load()Verification
go build ./...✅go test ./...✅ (all 19 packages, 0 failures)go vet ./...✅Scope
Ticket
PILOT-328 — registry replication: no fencing/lease → old primary can keep writing after partition