Skip to content

MongoDB replication tracing#616

Merged
rkistner merged 18 commits into
mainfrom
change-stream-tracing
May 4, 2026
Merged

MongoDB replication tracing#616
rkistner merged 18 commits into
mainfrom
change-stream-tracing

Conversation

@rkistner
Copy link
Copy Markdown
Contributor

This is a re-implementation of #615, using a new lightweight tracing API, adding the timings to the "Processed batch" logs:

info: [powersync_4] Flushed 1 + 0 + 1 updates, 1kb in 1ms. Last op_id: 5515892. Replication lag: 7s {"flushed":{"bucket_data_count":1,"current_data_count":1,"duration":1,"parameter_data_count":0,"replication_lag_seconds":7,"size":609}}
info: [powersync_4] Processed batch of 1 changes / 992 bytes in 37ms {"bytes":992,"count":1,"duration":37,"t":{"changestream":2,"evaluate":1,"processing":4,"storage":31}}
info: [powersync_4] Flushed 2000 + 0 + 2000 updates, 1190kb in 51ms. Last op_id: 5517892 {"flushed":{"bucket_data_count":2000,"current_data_count":2000,"duration":51,"parameter_data_count":0,"size":1218433}}
info: [powersync_4] Flushed 2000 + 0 + 2000 updates, 1190kb in 43ms. Last op_id: 5519892 {"flushed":{"bucket_data_count":2000,"current_data_count":2000,"duration":43,"parameter_data_count":0,"size":1218546}}
info: [powersync_4] Flushed 1000 + 0 + 1000 updates, 595kb in 26ms. Last op_id: 5520892. Replication lag: 7s {"flushed":{"bucket_data_count":1000,"current_data_count":1000,"duration":26,"parameter_data_count":0,"replication_lag_seconds":7,"size":609231}}
info: [powersync_4] Processed batch of 6000 changes / 4055118 bytes in 327ms {"bytes":4055118,"count":6000,"duration":327,"t":{"changestream":70,"evaluate":85,"processing":91,"storage":150}}

The goal here is to help diagnose slow replication issues on a high level. When replication is slower than expected, we want to know whether the bottleneck is (1) the source db or network, (2) the storage db, or (3) replication process CPU. The stats here help identify that.

Timings:

  1. duration: Duration for processing the batch, excluding changestream.
  2. changestream: Time spent waiting for the next batch on the change stream, including the source db waiting for more changes, scanning the oplog, processing the change stream pipeline, and network transfer time.
  3. processing: Time spent converting from raw change stream buffers to input for sync config, as well as other batch processing overhead.
  4. evaluate: Time spent evaluating sync queries.
  5. storage: Time spent writing changes to the storage database.

This also removes the previous behavior of Updating resume LSN to ${lsn} after 20000 changes. We don't need that anymore - the flush after each change stream batch takes care of that now.

Implementation

#615 used manual tracking of start and end timestamps for various replication sections, and reported those in the APIs.
This instead keeps the APIs clean, only passing a PerformanceTracer instance through where needed. By using the Disposable interface, we can keep the code to generate the spans fairly clean.

While the code keeps track of the entire trace structure with nested spans, the main output currently is just the duration we spend in each span itself, excluding nested spans - this is what we use for logs.

Perfetto Traces

This does have the ability to produce a trace file during development:

export POWERSYNC_TRACE_FILE=trace.json
node lib/entry.js start -r sync -c powersync.yaml

This produces a trace file in the Chromium Trace Event format. This can be viewed on https://ui.perfetto.dev/.

image

Right now, this is mostly useful to visualize where the aggregated timings come from. In the future we can expose traces in other formats in addition to / instead of Perfetto.

Alternatives considered

  1. Manual timings like in MongoDB replication instrumentation #615: This is effort to maintain and difficult to ensure it's accurate.
  2. performance.measure:
    1. Unlike in browsers, this is not actually reported anywhere by default in NodeJS - it's only available via an API.
    2. More effort to use directly.
    3. These stay in memory until manually cleared, so not suitable for production.
    4. In the future we can "export" our own traces to this.
  3. OpenTelemetry tracing:
    1. This has the advantage of providing off-the-shelf instrumentation for many built-in libraries such as MongoDB.
    2. However, it feels to "heavy" for this use case - we're not using distributed traces anywhere, and it seems more tricky to get the aggregated timings out for our logs.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 28, 2026

🦋 Changeset detected

Latest commit: e44ba9f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 12 packages
Name Type
@powersync/service-module-postgres-storage Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-mongodb Patch
@powersync/service-core Patch
@powersync/service-schema Patch
@powersync/service-module-mssql Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres Patch
@powersync/service-image Patch
@powersync/service-core-tests Patch
@powersync/service-module-core Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@rkistner rkistner marked this pull request as ready for review April 28, 2026 10:33
@rkistner rkistner requested a review from stevensJourney April 28, 2026 10:33
Copy link
Copy Markdown
Collaborator

@stevensJourney stevensJourney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Using Disposables for spans looks really neat.

@rkistner rkistner merged commit 7c7b525 into main May 4, 2026
74 of 75 checks passed
@rkistner rkistner deleted the change-stream-tracing branch May 4, 2026 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants