Skip to content

[MongoDB] Fix checkpoint stalling#659

Merged
rkistner merged 7 commits into
mainfrom
fix-checkpoint-stalling
Jun 4, 2026
Merged

[MongoDB] Fix checkpoint stalling#659
rkistner merged 7 commits into
mainfrom
fix-checkpoint-stalling

Conversation

@rkistner
Copy link
Copy Markdown
Contributor

@rkistner rkistner commented Jun 3, 2026

This fixes an issue surfaced in #573 (v1.20.3). Even though the issue was not directly caused by that PR, it removed a fallback path that would have bypassed the issue.

What happens:

  1. In the replication process, createCheckpoint() is called.
  2. A connection error causes the write to fail after the server received it.
  3. The driver retries the command automatically.
  4. The server sees the retry as a no-op since it already applied the original one, and does not apply it again.
  5. The client session's operationTime still increases after the retry.
  6. Now, the operationTime reports the time of the second operation, while the change stream event contains the clusterTime of the first operation.
  7. This causes the lsn >= waitForCheckpointLsn condition to not be hit when the change stream event is received.
  8. Since [MongoDB] Fix write checkpoint throughput #573, nothing else creates an update that passes the filter. The replication process indefinitely waits for a change stream event that it never receives.
  9. This causes the replication process to never create a new sync checkpoint.

Issue symptoms

The issue shows up as replication lag building up, without an actual backlog of data to process. Specifically:

  1. An initial slow batch with long processing time.
  2. After that, replication lag builds up indefinitely.
  3. No specific large data volumes that cause replication lag. For example, there may be many small batches, where the majority of time is just spent waiting for new changes.
  4. A restart resolves the issue, without processing a backlog.

The fix

A good long term fix would be to use a sentinel-based approach, rather than relying on MongoDB clusterTime behavior and potentially making bad assumptions around that - see #605. However, that's a big and risky change.

The fix here is to bypass the internal driver's retry mechanism for writes. By adding our own retry logic, the retry will actually produce a new write with a new change stream event, which avoids this issue.

AI usage

Used Codex gpt-5.5 to debug the issue and assist in writing a test for reproducing it.

The fix here was implemented manually.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 3, 2026

🦋 Changeset detected

Latest commit: 3b3d65c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 12 packages
Name Type
@powersync/service-module-mongodb Patch
@powersync/service-schema Patch
@powersync/service-image Patch
@powersync/service-core Patch
@powersync/service-module-convex Patch
@powersync/service-module-core Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-mssql Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres-storage Patch
@powersync/service-module-postgres Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Comment thread modules/module-mongodb/src/replication/ChangeStream.ts
@rkistner rkistner marked this pull request as ready for review June 3, 2026 16:38
@rkistner rkistner requested a review from stevensJourney June 3, 2026 16:38
Copy link
Copy Markdown
Collaborator

@stevensJourney stevensJourney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here look good to me

@rkistner rkistner merged commit adca892 into main Jun 4, 2026
46 checks passed
@rkistner rkistner deleted the fix-checkpoint-stalling branch June 4, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants