Fix replication switchover #308

rkistner · 2025-07-21T15:07:40Z

Background

When deploying new sync rules, we create a new stream (using for example a new postgres logical replication slot) for the new version, and process it while the current version stays active. When initial replication is complete, clients switch over to sync from the new copy.

For the new sync rules themselves, replication works roughly as follows:

Get a snapshot position.
Do an initial snapshot.
Resume streaming data from the initial snapshot position.

The issue

The main issue is that when the initial snapshot is complete, there could still be a long period before streaming replication has caught up. This is typically not a problem for instances with small data volumes, it could be significant in cases where replication takes a couple of hours, and a lot of new data has come in during that time.

A secondary issue is specific to replicating MongoDB data - until replication has caught up, there could be inconsistent data synced to clients.

The fix

This refactors the "autoActivate" behavior - we now only switch over to the new sync rules version when it has a consistent checkpoint.

Additionally, for MongoDB replication, we update streaming progress during the initial catch-up phase, so that we can resume replication at the same point in the case of restart.

This is not a complete fix yet - at that point replication of the new sync rules could still be behind and take a while to fully catch up, but it a significant improvement already. For this, we're re-purposing "snapshot_lsn" as a more general "resume_from_lsn".

Additional smaller fixes

We now bypass the previous "resnapshot" behavior unless we have TOAST values we need to re-replicate.
This improves stability of some tests.
At "rate limiting" to avoid touching probes on every change in MongoDB change streams - this added significant overhead, causing replication catch-up to take longer than it should.

changeset-bot · 2025-07-21T15:07:45Z

🦋 Changeset detected

Latest commit: ff8de76

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 11 packages

Name	Type
@powersync/service-module-postgres-storage	Minor
@powersync/service-module-mongodb-storage	Minor
@powersync/service-core-tests	Minor
@powersync/service-module-postgres	Minor
@powersync/service-module-mongodb	Minor
@powersync/service-core	Minor
@powersync/service-module-mysql	Minor
@powersync/service-schema	Minor
@powersync/service-image	Minor
@powersync/service-module-core	Patch
test-client	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

…hover

stevensJourney

I wen't through the logic flow and could not spot any issues. This LGTM.

rkistner added 11 commits July 21, 2025 13:35

Refactor autoActivate logic to execute on the first consistent commit.

0070aa1

Refactor resuming of replication.

07251e5

Fix some postgres tests.

47bbae7

Add missing file.

7f8d2cd

Postgres storage: activate on keepalive.

5bcd4e8

We do need a commit after snapshots.

18e2a2f

Only trigger resnapshot if needed.

7b4a91f

Improve test stability.

39260a0

Fix metrics test.

d58f601

Improve GA labels.

954cc8f

Add changeset.

9a85cd6

rkistner added 6 commits July 21, 2025 17:26

Further test stability improvements.

4668cd0

Merge remote-tracking branch 'origin/main' into fix-replication-switc…

3ec327b

…hover

And more test fixes.

f4d9a40

Periodically persist replication progress in absense of commits.

2f69538

Avoid waiting for probes.touch().

d30d3b5

Merge remote-tracking branch 'origin/main' into fix-replication-switc…

bf4cd75

…hover

rkistner marked this pull request as ready for review July 22, 2025 08:31

Another attempt at making tests more stable.

ff8de76

rkistner requested a review from stevensJourney July 22, 2025 09:15

stevensJourney approved these changes Jul 22, 2025

View reviewed changes

rkistner merged commit d56eeb9 into main Jul 22, 2025
21 checks passed

rkistner deleted the fix-replication-switchover branch July 22, 2025 11:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix replication switchover #308

Fix replication switchover #308

Uh oh!

rkistner commented Jul 21, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

stevensJourney left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix replication switchover #308

Fix replication switchover #308

Uh oh!

Conversation

rkistner commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

The issue

The fix

Additional smaller fixes

Uh oh!

changeset-bot bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

stevensJourney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rkistner commented Jul 21, 2025 •

edited

Loading

changeset-bot bot commented Jul 21, 2025 •

edited

Loading