Skip to content

fix(module-postgres): phase detection fix + diagnostics WAL budget warnings#606

Merged
Sleepful merged 2 commits intopowersync-ja:mainfrom
Sleepful:wal-slot-phase-fix
Apr 14, 2026
Merged

fix(module-postgres): phase detection fix + diagnostics WAL budget warnings#606
Sleepful merged 2 commits intopowersync-ja:mainfrom
Sleepful:wal-slot-phase-fix

Conversation

@Sleepful
Copy link
Copy Markdown
Contributor

@Sleepful Sleepful commented Apr 14, 2026

Summary

Follow-up to #554. Two fixes from review feedback:

  1. Fix phase detection on retry — When a snapshot fails due to slot invalidation, the retry loop incorrectly reported phase: 'streaming' on the next initSlot() call, allowing restartReplication() and creating an infinite loop of snapshot failures requiring operator intervention. Fix: derive the phase from the persisted snapshotDone flag instead of hardcoding 'streaming'.

  2. Diagnostics API WAL budget warnings — Add ReplicationError entries to the diagnostics errors array when the WAL budget is low (warning at 50%) or the slot is lost (fatal with PSYNC_S1146). Guard getSlotWalBudget with slot_name check for the validation endpoint.

Why snapshotDone is reliable for phase detection

snapshotDone is derived from persisted storage state:

const snapshotDone = status.snapshot_done && status.checkpoint_lsn != null;
Scenario snapshot_done checkpoint_lsn snapshotDone Phase Correct?
Never started false null false snapshot Yes — retry is futile
Interrupted mid-snapshot false null false snapshot Yes — same reason
Snapshot done, no streaming yet true null false snapshot Safe — no checkpoint to resume from, would redo entire snapshot anyway
Snapshot done, streaming active true non-null true streaming Yes — streaming retry is reasonable
Process restarted persisted persisted persisted Survives restarts

The state is per-sync-rules-version, lives in the storage database (not in-memory), and is the same flag the existing code already uses to decide whether to run startInitialReplication() or go straight to streaming.

Recovery flow

After the fix, when a snapshot fails due to slot invalidation:

  1. checkSlotHealth() throws phase: 'snapshot' → retry blocked
  2. Job retries → initSlot() sees lost slot + snapshotDone === false → throws phase: 'snapshot' → retry blocked again
  3. Job spins checking slot status — no work is done per iteration
  4. Error visible in diagnostics as last_fatal_error with PSYNC_S1146
  5. Operator increases max_slot_wal_keep_size and deletes the slot
  6. Next retry → initSlot() sees no slot → creates new slot, starts fresh snapshot

Files changed

File Change
modules/module-postgres/src/replication/WalStream.ts One-line fix: phase: snapshotDone ? 'streaming' : 'snapshot'
modules/module-postgres/test/src/wal_stream.test.ts Integration test: interrupt snapshot, invalidate slot, retry
packages/service-core/src/api/diagnostics.ts WAL budget warnings in errors array + slot_name guard + negative budget clamp
packages/service-core/test/src/diagnostics.test.ts 5 unit tests for diagnostics warnings
.changeset/bright-foxes-leap.md Changeset

Testing

  • 1 new integration test (phase detection — exercises previously untested initSlot() Case 1)
  • 5 new unit tests (diagnostics warnings — first test coverage for getSyncRulesStatus())
  • All existing tests pass
  • CI run with just the "red test" failure
Manual test run

Setup

# Create dedicated test database with 2M rows (216 MB)
psql -U postgres -c "CREATE DATABASE powersync_manual_test"
psql -U postgres -d powersync_manual_test -c "CREATE PUBLICATION powersync FOR ALL TABLES"
# insert 2000000 rows to db "FROM generate_series(1, ${ROW_COUNT}) AS i;"

# Set WAL budget to 1MB (slot will be invalidated during snapshot)
psql -U postgres -d powersync_manual_test -c "ALTER SYSTEM SET max_slot_wal_keep_size = '1MB'"
psql -U postgres -d powersync_manual_test -c "SELECT pg_reload_conf()"

# Start API process (background)
node powersync-service/service/lib/entry.js start -r api -c wal-test-powersync.yaml > out/wal-test-api.log 2>&1 &

# Start sync process (foreground, watch logs)
node powersync-service/service/lib/entry.js start -r sync -c wal-test-powersync.yaml 2>&1 | tee out/wal-test-sync.log

Snapshot phase (~30 seconds)

Log output Meaning
Created replication slot powersync_1_a587 Slot created, snapshot begins
Replicating "public"."test_data" 10000/~2000000 Snapshot progressing through 2M rows
Flushed 2000 + 0 + 2000 updates, 1083kb Each chunk writes ~1MB to storage — generates WAL accumulating against the 1MB slot limit

Self-invalidation (~2 min after snapshot start)

The snapshot's own storage writes exceeded the 1MB limit. At the next checkSlotHealth() call:

Log output Meaning
[PSYNC_S1146] Replication slot powersync_1_a587 was invalidated during snapshot (limit: 1.0MB) checkSlotHealth() detected wal_status: 'lost', threw with phase: 'snapshot'. Snapshot aborted.

Retry loop (immediate, continuous)

Log output Meaning
Replication error [PSYNC_S1146] ... {"phase":"snapshot","walStatus":"lost"} (repeated) initSlot() finds slot still lost, snapshotDone === false → reports phase: 'snapshot'shouldRetryReplication() returns false → no restartReplication() → job spins idly
No Created replication slot messages Retry is blocked — no futile snapshot restart

Diagnostics check (during retry loop)

curl -s http://localhost:8080/api/admin/v1/diagnostics -X POST -H "Authorization: Bearer dev" \
  | jq '.data.deploying_sync_rules | {errors, connection: .connections[0]}'
{
  "errors": [
    {"level": "fatal", "message": "[PSYNC_S1146] Replication slot powersync_1_a587 was invalidated..."},
    {"level": "fatal", "message": "[PSYNC_S1146] Replication slot WAL status is 'lost'..."}
  ],
  "connection": {
    "wal_status": "lost",
    "max_slot_wal_keep_size": 1048576,
    "initial_replication_done": false
  }
}

Two errors: the last_fatal_error from the sync process + our new diagnostics warning. Slot is lost, snapshot never completed.

Diagnostics check (budget warning — transient state)

Between health checks, PG briefly changed wal_status from lost to unreserved after recycling WAL. The diagnostics API caught the low budget warning:

{
  "errors": [
    {"level": "warning", "message": "WAL budget is low: -71596% remaining..."}
  ],
  "connection": { "wal_status": "unreserved", "max_slot_wal_keep_size": 1048576 }
}

The warning fired correctly — safe_wal_size was deeply negative (2.4GB consumed against 1MB limit), which proves the budget warning logic works. The negative percentage was a display bug, fixed by clamping to Math.max(0, ...).

Recovery

# Remove the WAL limit
psql -U postgres -d powersync_manual_test -c "ALTER SYSTEM SET max_slot_wal_keep_size = '-1'"
psql -U postgres -d powersync_manual_test -c "SELECT pg_reload_conf()"

# Delete the lost slot to trigger recovery
psql -U postgres -d powersync_manual_test -c "SELECT pg_drop_replication_slot('powersync_1_a587')"

~15 seconds later, the sync process detects the missing slot and recovers:

Log output Meaning
Created replication slot powersync_1_a587 initSlot() sees missing slot + snapshotDone === false → case 2 → creates new slot
Replicating "public"."test_data" 600000/~2000000 Fresh snapshot progressing with unlimited WAL budget

Diagnostics check (after recovery)

{
  "errors": [],
  "connection": {
    "wal_status": "reserved",
    "max_slot_wal_keep_size": null,
    "initial_replication_done": false
  }
}

Clean — no errors, slot healthy (reserved), limit unlimited (null), snapshot in progress.

Bug found during manual testing

safe_wal_size can go negative when WAL consumed exceeds max_slot_wal_keep_size but the slot hasn't been checkpointed as lost yet (wal_status: 'unreserved'). The budget percentage was computed as -71596%. Fixed by clamping to Math.max(0, ...).

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 14, 2026

🦋 Changeset detected

Latest commit: a829796

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 12 packages
Name Type
@powersync/service-core Patch
@powersync/service-module-postgres Patch
@powersync/service-core-tests Patch
@powersync/service-module-core Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mssql Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres-storage Patch
@powersync/service-image Patch
test-client Patch
@powersync/service-schema Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@Sleepful Sleepful force-pushed the wal-slot-phase-fix branch from bdd65b8 to 0f17640 Compare April 14, 2026 08:08
Test exercises initSlot() with a lost slot and snapshotDone === false
(Case 1). Currently fails: initSlot() hardcodes phase as streaming
instead of deriving it from snapshotDone. The test expects phase to
be snapshot so retry is blocked for snapshot failures requiring
operator intervention to recover.
@Sleepful Sleepful force-pushed the wal-slot-phase-fix branch from 0f17640 to 54dddd0 Compare April 14, 2026 08:11
When a slot is found lost in initSlot(), use the persisted snapshotDone
flag to determine the phase instead of hardcoding streaming. If the
snapshot was not completed (snapshotDone === false), report phase as
snapshot to block retries for snapshot failures requiring operator
intervention to recover.

Also add WAL budget warnings to diagnostics API errors array: fatal
error when slot is lost, warning when budget at or below 50%. Guard
getSlotWalBudget call with slot_name check for validation endpoint.
Clamp negative safe_wal_size to 0% in budget percentage calculation.
@Sleepful Sleepful force-pushed the wal-slot-phase-fix branch from 54dddd0 to a829796 Compare April 14, 2026 08:13
@Sleepful Sleepful marked this pull request as ready for review April 14, 2026 08:36
@Sleepful Sleepful requested a review from rkistner April 14, 2026 08:47
@Sleepful Sleepful merged commit 9add445 into powersync-ja:main Apr 14, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants