fix(module-postgres): phase detection fix + diagnostics WAL budget warnings#606
Merged
Sleepful merged 2 commits intopowersync-ja:mainfrom Apr 14, 2026
Merged
Conversation
🦋 Changeset detectedLatest commit: a829796 The changes in this PR will be included in the next version bump. This PR includes changesets to release 12 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
bdd65b8 to
0f17640
Compare
Test exercises initSlot() with a lost slot and snapshotDone === false (Case 1). Currently fails: initSlot() hardcodes phase as streaming instead of deriving it from snapshotDone. The test expects phase to be snapshot so retry is blocked for snapshot failures requiring operator intervention to recover.
0f17640 to
54dddd0
Compare
When a slot is found lost in initSlot(), use the persisted snapshotDone flag to determine the phase instead of hardcoding streaming. If the snapshot was not completed (snapshotDone === false), report phase as snapshot to block retries for snapshot failures requiring operator intervention to recover. Also add WAL budget warnings to diagnostics API errors array: fatal error when slot is lost, warning when budget at or below 50%. Guard getSlotWalBudget call with slot_name check for validation endpoint. Clamp negative safe_wal_size to 0% in budget percentage calculation.
54dddd0 to
a829796
Compare
rkistner
approved these changes
Apr 14, 2026
This was referenced Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #554. Two fixes from review feedback:
Fix phase detection on retry — When a snapshot fails due to slot invalidation, the retry loop incorrectly reported
phase: 'streaming'on the nextinitSlot()call, allowingrestartReplication()and creating an infinite loop of snapshot failures requiring operator intervention. Fix: derive the phase from the persistedsnapshotDoneflag instead of hardcoding'streaming'.Diagnostics API WAL budget warnings — Add
ReplicationErrorentries to the diagnosticserrorsarray when the WAL budget is low (warning at 50%) or the slot is lost (fatal withPSYNC_S1146). GuardgetSlotWalBudgetwithslot_namecheck for the validation endpoint.Why
snapshotDoneis reliable for phase detectionsnapshotDoneis derived from persisted storage state:snapshot_donecheckpoint_lsnsnapshotDoneThe state is per-sync-rules-version, lives in the storage database (not in-memory), and is the same flag the existing code already uses to decide whether to run
startInitialReplication()or go straight to streaming.Recovery flow
After the fix, when a snapshot fails due to slot invalidation:
checkSlotHealth()throwsphase: 'snapshot'→ retry blockedinitSlot()sees lost slot +snapshotDone === false→ throwsphase: 'snapshot'→ retry blocked againlast_fatal_errorwithPSYNC_S1146max_slot_wal_keep_sizeand deletes the slotinitSlot()sees no slot → creates new slot, starts fresh snapshotFiles changed
modules/module-postgres/src/replication/WalStream.tsphase: snapshotDone ? 'streaming' : 'snapshot'modules/module-postgres/test/src/wal_stream.test.tspackages/service-core/src/api/diagnostics.tspackages/service-core/test/src/diagnostics.test.ts.changeset/bright-foxes-leap.mdTesting
initSlot()Case 1)getSyncRulesStatus())Manual test run
Setup
Snapshot phase (~30 seconds)
Created replication slot powersync_1_a587Replicating "public"."test_data" 10000/~2000000Flushed 2000 + 0 + 2000 updates, 1083kbSelf-invalidation (~2 min after snapshot start)
The snapshot's own storage writes exceeded the 1MB limit. At the next
checkSlotHealth()call:[PSYNC_S1146] Replication slot powersync_1_a587 was invalidated during snapshot (limit: 1.0MB)checkSlotHealth()detectedwal_status: 'lost', threw withphase: 'snapshot'. Snapshot aborted.Retry loop (immediate, continuous)
Replication error [PSYNC_S1146] ... {"phase":"snapshot","walStatus":"lost"}(repeated)initSlot()finds slot still lost,snapshotDone === false→ reportsphase: 'snapshot'→shouldRetryReplication()returnsfalse→ norestartReplication()→ job spins idlyCreated replication slotmessagesDiagnostics check (during retry loop)
{ "errors": [ {"level": "fatal", "message": "[PSYNC_S1146] Replication slot powersync_1_a587 was invalidated..."}, {"level": "fatal", "message": "[PSYNC_S1146] Replication slot WAL status is 'lost'..."} ], "connection": { "wal_status": "lost", "max_slot_wal_keep_size": 1048576, "initial_replication_done": false } }Two errors: the
last_fatal_errorfrom the sync process + our new diagnostics warning. Slot islost, snapshot never completed.Diagnostics check (budget warning — transient state)
Between health checks, PG briefly changed
wal_statusfromlosttounreservedafter recycling WAL. The diagnostics API caught the low budget warning:{ "errors": [ {"level": "warning", "message": "WAL budget is low: -71596% remaining..."} ], "connection": { "wal_status": "unreserved", "max_slot_wal_keep_size": 1048576 } }The warning fired correctly —
safe_wal_sizewas deeply negative (2.4GB consumed against 1MB limit), which proves the budget warning logic works. The negative percentage was a display bug, fixed by clamping toMath.max(0, ...).Recovery
~15 seconds later, the sync process detects the missing slot and recovers:
Created replication slot powersync_1_a587initSlot()sees missing slot +snapshotDone === false→ case 2 → creates new slotReplicating "public"."test_data" 600000/~2000000Diagnostics check (after recovery)
{ "errors": [], "connection": { "wal_status": "reserved", "max_slot_wal_keep_size": null, "initial_replication_done": false } }Clean — no errors, slot healthy (
reserved), limit unlimited (null), snapshot in progress.Bug found during manual testing
safe_wal_sizecan go negative when WAL consumed exceedsmax_slot_wal_keep_sizebut the slot hasn't been checkpointed aslostyet (wal_status: 'unreserved'). The budget percentage was computed as-71596%. Fixed by clamping toMath.max(0, ...).