fix(module-postgres): phase detection fix + diagnostics WAL budget warnings by Sleepful · Pull Request #606 · powersync-ja/powersync-service

Sleepful · 2026-04-14T06:16:13Z

Summary

Follow-up to #554. Two fixes from review feedback:

Fix phase detection on retry — When a snapshot fails due to slot invalidation, the retry loop incorrectly reported phase: 'streaming' on the next initSlot() call, allowing restartReplication() and creating an infinite loop of snapshot failures requiring operator intervention. Fix: derive the phase from the persisted snapshotDone flag instead of hardcoding 'streaming'.
Diagnostics API WAL budget warnings — Add ReplicationError entries to the diagnostics errors array when the WAL budget is low (warning at 50%) or the slot is lost (fatal with PSYNC_S1146). Guard getSlotWalBudget with slot_name check for the validation endpoint.

Why `snapshotDone` is reliable for phase detection

snapshotDone is derived from persisted storage state:

const snapshotDone = status.snapshot_done && status.checkpoint_lsn != null;

Scenario	`snapshot_done`	`checkpoint_lsn`	`snapshotDone`	Phase	Correct?
Never started	false	null	false	snapshot	Yes — retry is futile
Interrupted mid-snapshot	false	null	false	snapshot	Yes — same reason
Snapshot done, no streaming yet	true	null	false	snapshot	Safe — no checkpoint to resume from, would redo entire snapshot anyway
Snapshot done, streaming active	true	non-null	true	streaming	Yes — streaming retry is reasonable
Process restarted	persisted	persisted	persisted	—	Survives restarts

The state is per-sync-rules-version, lives in the storage database (not in-memory), and is the same flag the existing code already uses to decide whether to run startInitialReplication() or go straight to streaming.

Recovery flow

After the fix, when a snapshot fails due to slot invalidation:

checkSlotHealth() throws phase: 'snapshot' → retry blocked
Job retries → initSlot() sees lost slot + snapshotDone === false → throws phase: 'snapshot' → retry blocked again
Job spins checking slot status — no work is done per iteration
Error visible in diagnostics as last_fatal_error with PSYNC_S1146
Operator increases max_slot_wal_keep_size and deletes the slot
Next retry → initSlot() sees no slot → creates new slot, starts fresh snapshot

Files changed

File	Change
`modules/module-postgres/src/replication/WalStream.ts`	One-line fix: `phase: snapshotDone ? 'streaming' : 'snapshot'`
`modules/module-postgres/test/src/wal_stream.test.ts`	Integration test: interrupt snapshot, invalidate slot, retry
`packages/service-core/src/api/diagnostics.ts`	WAL budget warnings in errors array + slot_name guard + negative budget clamp
`packages/service-core/test/src/diagnostics.test.ts`	5 unit tests for diagnostics warnings
`.changeset/bright-foxes-leap.md`	Changeset

Testing

1 new integration test (phase detection — exercises previously untested initSlot() Case 1)
5 new unit tests (diagnostics warnings — first test coverage for getSyncRulesStatus())
All existing tests pass
CI run with just the "red test" failure

Manual test run

Setup

# Create dedicated test database with 2M rows (216 MB)
psql -U postgres -c "CREATE DATABASE powersync_manual_test"
psql -U postgres -d powersync_manual_test -c "CREATE PUBLICATION powersync FOR ALL TABLES"
# insert 2000000 rows to db "FROM generate_series(1, ${ROW_COUNT}) AS i;"

# Set WAL budget to 1MB (slot will be invalidated during snapshot)
psql -U postgres -d powersync_manual_test -c "ALTER SYSTEM SET max_slot_wal_keep_size = '1MB'"
psql -U postgres -d powersync_manual_test -c "SELECT pg_reload_conf()"

# Start API process (background)
node powersync-service/service/lib/entry.js start -r api -c wal-test-powersync.yaml > out/wal-test-api.log 2>&1 &

# Start sync process (foreground, watch logs)
node powersync-service/service/lib/entry.js start -r sync -c wal-test-powersync.yaml 2>&1 | tee out/wal-test-sync.log

Snapshot phase (~30 seconds)

Log output	Meaning
`Created replication slot powersync_1_a587`	Slot created, snapshot begins
`Replicating "public"."test_data" 10000/~2000000`	Snapshot progressing through 2M rows
`Flushed 2000 + 0 + 2000 updates, 1083kb`	Each chunk writes ~1MB to storage — generates WAL accumulating against the 1MB slot limit

Self-invalidation (~2 min after snapshot start)

The snapshot's own storage writes exceeded the 1MB limit. At the next checkSlotHealth() call:

Log output	Meaning
`[PSYNC_S1146] Replication slot powersync_1_a587 was invalidated during snapshot (limit: 1.0MB)`	`checkSlotHealth()` detected `wal_status: 'lost'`, threw with `phase: 'snapshot'`. Snapshot aborted.

Retry loop (immediate, continuous)

Log output	Meaning
`Replication error [PSYNC_S1146] ... {"phase":"snapshot","walStatus":"lost"}` (repeated)	`initSlot()` finds slot still lost, `snapshotDone === false` → reports `phase: 'snapshot'` → `shouldRetryReplication()` returns `false` → no `restartReplication()` → job spins idly
No `Created replication slot` messages	Retry is blocked — no futile snapshot restart

Diagnostics check (during retry loop)

curl -s http://localhost:8080/api/admin/v1/diagnostics -X POST -H "Authorization: Bearer dev" \
  | jq '.data.deploying_sync_rules | {errors, connection: .connections[0]}'

{
  "errors": [
    {"level": "fatal", "message": "[PSYNC_S1146] Replication slot powersync_1_a587 was invalidated..."},
    {"level": "fatal", "message": "[PSYNC_S1146] Replication slot WAL status is 'lost'..."}
  ],
  "connection": {
    "wal_status": "lost",
    "max_slot_wal_keep_size": 1048576,
    "initial_replication_done": false
  }
}

Two errors: the last_fatal_error from the sync process + our new diagnostics warning. Slot is lost, snapshot never completed.

Diagnostics check (budget warning — transient state)

Between health checks, PG briefly changed wal_status from lost to unreserved after recycling WAL. The diagnostics API caught the low budget warning:

{
  "errors": [
    {"level": "warning", "message": "WAL budget is low: -71596% remaining..."}
  ],
  "connection": { "wal_status": "unreserved", "max_slot_wal_keep_size": 1048576 }
}

The warning fired correctly — safe_wal_size was deeply negative (2.4GB consumed against 1MB limit), which proves the budget warning logic works. The negative percentage was a display bug, fixed by clamping to Math.max(0, ...).

Recovery

# Remove the WAL limit
psql -U postgres -d powersync_manual_test -c "ALTER SYSTEM SET max_slot_wal_keep_size = '-1'"
psql -U postgres -d powersync_manual_test -c "SELECT pg_reload_conf()"

# Delete the lost slot to trigger recovery
psql -U postgres -d powersync_manual_test -c "SELECT pg_drop_replication_slot('powersync_1_a587')"

~15 seconds later, the sync process detects the missing slot and recovers:

Log output	Meaning
`Created replication slot powersync_1_a587`	`initSlot()` sees missing slot + `snapshotDone === false` → case 2 → creates new slot
`Replicating "public"."test_data" 600000/~2000000`	Fresh snapshot progressing with unlimited WAL budget

Diagnostics check (after recovery)

{
  "errors": [],
  "connection": {
    "wal_status": "reserved",
    "max_slot_wal_keep_size": null,
    "initial_replication_done": false
  }
}

Clean — no errors, slot healthy (reserved), limit unlimited (null), snapshot in progress.

Bug found during manual testing

safe_wal_size can go negative when WAL consumed exceeds max_slot_wal_keep_size but the slot hasn't been checkpointed as lost yet (wal_status: 'unreserved'). The budget percentage was computed as -71596%. Fixed by clamping to Math.max(0, ...).

changeset-bot · 2026-04-14T06:16:19Z

🦋 Changeset detected

Latest commit: a829796

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 12 packages

Name	Type
@powersync/service-core	Patch
@powersync/service-module-postgres	Patch
@powersync/service-core-tests	Patch
@powersync/service-module-core	Patch
@powersync/service-module-mongodb-storage	Patch
@powersync/service-module-mongodb	Patch
@powersync/service-module-mssql	Patch
@powersync/service-module-mysql	Patch
@powersync/service-module-postgres-storage	Patch
@powersync/service-image	Patch
test-client	Patch
@powersync/service-schema	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Test exercises initSlot() with a lost slot and snapshotDone === false (Case 1). Currently fails: initSlot() hardcodes phase as streaming instead of deriving it from snapshotDone. The test expects phase to be snapshot so retry is blocked for snapshot failures requiring operator intervention to recover.

When a slot is found lost in initSlot(), use the persisted snapshotDone flag to determine the phase instead of hardcoding streaming. If the snapshot was not completed (snapshotDone === false), report phase as snapshot to block retries for snapshot failures requiring operator intervention to recover. Also add WAL budget warnings to diagnostics API errors array: fatal error when slot is lost, warning when budget at or below 50%. Guard getSlotWalBudget call with slot_name check for validation endpoint. Clamp negative safe_wal_size to 0% in budget percentage calculation.

Sleepful force-pushed the wal-slot-phase-fix branch from bdd65b8 to 0f17640 Compare April 14, 2026 08:08

Sleepful force-pushed the wal-slot-phase-fix branch from 0f17640 to 54dddd0 Compare April 14, 2026 08:11

Sleepful force-pushed the wal-slot-phase-fix branch from 54dddd0 to a829796 Compare April 14, 2026 08:13

Sleepful marked this pull request as ready for review April 14, 2026 08:36

Sleepful requested a review from rkistner April 14, 2026 08:47

rkistner approved these changes Apr 14, 2026

View reviewed changes

Sleepful merged commit 9add445 into powersync-ja:main Apr 14, 2026
32 checks passed

This was referenced Apr 15, 2026

fix: remove duplicate diagnostics error, add recovery guidance, fix stale comments #607

Merged

Document new WAL errors powersync-ja/powersync-docs#392

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(module-postgres): phase detection fix + diagnostics WAL budget warnings#606

fix(module-postgres): phase detection fix + diagnostics WAL budget warnings#606
Sleepful merged 2 commits intopowersync-ja:mainfrom
Sleepful:wal-slot-phase-fix

Sleepful commented Apr 14, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Sleepful commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why snapshotDone is reliable for phase detection

Recovery flow

Files changed

Testing

Setup

Snapshot phase (~30 seconds)

Self-invalidation (~2 min after snapshot start)

Retry loop (immediate, continuous)

Diagnostics check (during retry loop)

Diagnostics check (budget warning — transient state)

Recovery

Diagnostics check (after recovery)

Bug found during manual testing

Uh oh!

changeset-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sleepful commented Apr 14, 2026 •

edited

Loading

Why `snapshotDone` is reliable for phase detection

changeset-bot bot commented Apr 14, 2026 •

edited

Loading