-
Notifications
You must be signed in to change notification settings - Fork 25
Fix replication switchover #308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: ff8de76 The changes in this PR will be included in the next version bump. This PR includes changesets to release 11 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wen't through the logic flow and could not spot any issues. This LGTM.
Background
When deploying new sync rules, we create a new stream (using for example a new postgres logical replication slot) for the new version, and process it while the current version stays active. When initial replication is complete, clients switch over to sync from the new copy.
For the new sync rules themselves, replication works roughly as follows:
The issue
The main issue is that when the initial snapshot is complete, there could still be a long period before streaming replication has caught up. This is typically not a problem for instances with small data volumes, it could be significant in cases where replication takes a couple of hours, and a lot of new data has come in during that time.
A secondary issue is specific to replicating MongoDB data - until replication has caught up, there could be inconsistent data synced to clients.
The fix
This refactors the "autoActivate" behavior - we now only switch over to the new sync rules version when it has a consistent checkpoint.
Additionally, for MongoDB replication, we update streaming progress during the initial catch-up phase, so that we can resume replication at the same point in the case of restart.
This is not a complete fix yet - at that point replication of the new sync rules could still be behind and take a while to fully catch up, but it a significant improvement already. For this, we're re-purposing "snapshot_lsn" as a more general "resume_from_lsn".
Additional smaller fixes