[MongoDB] Direct BSON Buffer -> JSON conversion by rkistner · Pull Request #599 · powersync-ja/powersync-service

rkistner · 2026-04-08T13:19:20Z

Builds on #598.

This provides an alternative implementation for rawToSqliteRow. For nested documents and arrays, this converts from Buffer -> JSON Buffer -> string, without intermediate bson.deserialize or JSON(Big).stringify. This can significantly reduce allocations and improve throughput in cases with large nested documents or arrays.

Initial micro-benchmarks comparing the new approach to the old one:

Scenario          Full doc    Event       Benchmark                               Ops/s       MiB/s
--------          --------    -----       ---------                               -----       -----
insert 1 KB       1.0 KB      1.3 KB      parseChangeDocument + rawToSqliteRow    148,970     183
                                          bson.deserialize + documentToSqliteRow  106,492     131
insert 10 KB      10 KB       10 KB       parseChangeDocument + rawToSqliteRow    133,342     1,336
                                          bson.deserialize + documentToSqliteRow  52,176      523
insert 100 KB     100 KB      100 KB      parseChangeDocument + rawToSqliteRow    75,847      7,426
                                          bson.deserialize + documentToSqliteRow  8,583       840
update 1 KB       1.0 KB      1.9 KB      parseChangeDocument + rawToSqliteRow    142,705     271
                                          bson.deserialize + documentToSqliteRow  81,868      155
update 10 KB      10 KB       20 KB       parseChangeDocument + rawToSqliteRow    121,012     2,357
                                          bson.deserialize + documentToSqliteRow  31,346      610
update 100 KB     100 KB      200 KB      parseChangeDocument + rawToSqliteRow    71,743      14,008
                                          bson.deserialize + documentToSqliteRow  4,483       875

End-to-end benchmarks compared to main (excludes #598 and #591). This is tested using documents of 100KB+ in size for initial snapshot, and making small updates to 2MB+ in size for the change stream test. This uses a local bucket storage database on NVMe disk, which significantly reduces the typical bucket storage overhead, instead just focusing on the CPU and memory overhead.

Type	Initial snapshot	Change stream	Peak memory usage (change stream)
main	28.0MB/s	26.1MB/s	833MB
this PR	57.8MB/s	70.7MB/s	484MB

Implementation

The implementation uses a custom BSON parser. For each top-level value converted to JSON, we write the results into a Buffer, then convert that buffer to a string. In my early benchmarks, this was faster than using direct string concatenation.

I attempted to optimize the common cases as much as possible. The more esoteric types like regular expressions, DBPointer, etc are supported, but not specifically optimized for performance.

The implementation was largely using AI-assisted development (Codex), but with lots of manual effort to direct, review and test the implementation.

Since there are many edge cases, this relies on an extensive test suite to check for correctness, including matching the old implementation for the most part.

The old implementation is still kept around for:

Testing
Sampling the source types for the schema API

Copying from the jsdocs:
This attempts to match the behavior of bson.deserialize -> constructAfterRecord -> applyRowContext for the most part, with some intentional differences:

Regular expression patterns options are preserved as-is, while the above normalizes to JS RegExp values.
Full UTF-8 validation is not performed - we attempt to continue using replacement characters, as long as the resulting output remains valid.
bson.deserialize has special-case handler for converting documents containing {$ref} -> DBRef. We don't do that here.

General principles followed:

Correctness: Never produce invalid JSON.
Performance: Optimize to be as performant as possible for common cases.
Full BSON support: Support all valid BSON documents as input, including deprecated types, but without specifically optimizing for performance here.
The source database is responsible for producing valid BSON - we don't test for all edge cases of invalid BSON.
We do a best-effort attempt to support "degenerate" BSON cases as documented at https://specifications.readthedocs.io/en/latest/bson-corpus/bson-corpus/, since MongoDB can produce many of these cases.

Future optimizations

With these changes, CPU should be much less of a bottleneck for replicating from MongoDB. If we do need to optimize it further, there are some options:

We can make the JSON serialization lazy - only triggering it on-demand when used by sync queries.
We can take that further and only serialize specific sub-fields that are used in the sync queries, if relevant.
We can use a native extension to do the conversion. In some early tests, using a Rust + N-API implementation could increase throughput by another 2x. However, that could add significant complexity to the build pipeline, so may not be worth it.

changeset-bot · 2026-04-08T13:19:30Z

⚠️ No Changeset found

Latest commit: 57e73c3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

simolus3

This also looks good to me, the only potential issue I see are UTF-16 surrogate pairs. It might make sense to add tests for those.

I didn't check testcases and the benchmark in detail, but the implementation makes sense to me.

modules/module-mongodb/src/replication/JsonBufferWriter.ts

modules/module-mongodb/src/replication/bufferToSqlite.ts

The base branch was changed.

Merges upstream main which includes PR #591 (raw change streams) and PR #599 (direct BSON Buffer -> JSON conversion). Auth fix conflicts (types.ts, config.test.ts) resolved — both sides had the same fix, upstream also added database name decoding. ChangeStream.ts has 11 unresolved conflicts — PR #591 replaced the MongoDB driver ChangeStream with a custom RawChangeStream using raw aggregate + getMore. Our Cosmos DB changes need to be re-applied to the new code structure. Resolved in the next commit.

Merges upstream main which includes PR #591 (raw change streams) and PR #599 (direct BSON Buffer -> JSON conversion). Auth fix conflicts (types.ts, config.test.ts) resolved — both sides had the same fix, upstream also added database name decoding. ChangeStream.ts has 11 unresolved conflicts — PR #591 replaced the MongoDB driver ChangeStream with a custom RawChangeStream using raw aggregate + getMore. Our Cosmos DB changes need to be re-applied to the new code structure. Resolved in the next commit. resolve: ChangeStream.ts merge conflicts for raw change streams Re-applied all Cosmos DB changes to the new raw change stream code structure from PR #591. The raw aggregate approach is better for Cosmos DB: no lazy ChangeStream init, explicit cursor management, $changeStream stage built directly in pipeline. Changes applied to new structure: - detectCosmosDb() calls in getSnapshotLsn, initReplication, streamChangesInternal - getEventTimestamp() adapted to ProjectedChangeStreamDocument type - Sentinel checkpoint with BSON.deserialize for fullDocument (raw Buffer) - Pipeline guards: skip $changeStreamSplitLargeEvent and showExpandedEvents - Cluster-level aggregate (admin db + allChangesForCluster) when isCosmosDb - startAtOperationTime fix (startAfter != null) - Keepalive guard for Cosmos DB resume tokens - .lte() dedup guard skip on Cosmos DB - wallTime tracking for replication lag - Added changeset for @powersync/service-module-mongodb (minor) Verified: 59/59 standard MongoDB tests pass. Cosmos DB cluster is currently down — tests blocked by TLS timeout. Code audit of RawChangeStream.ts found no compatibility issues: cursor ID type auto-fixed by BigInt, postBatchResumeToken needs empirical verification when cluster is back.

rkistner mentioned this pull request Apr 8, 2026

[MongoDB] Raw buffers #598

Merged

rkistner changed the title ~~[WIP] [MongoDB] Direct BSON Buffer -> JSON conversion~~ [MongoDB] Direct BSON Buffer -> JSON conversion Apr 9, 2026

rkistner marked this pull request as ready for review April 9, 2026 12:26

rkistner requested a review from simolus3 April 9, 2026 13:03

simolus3 reviewed Apr 9, 2026

View reviewed changes

modules/module-mongodb/src/replication/JsonBufferWriter.ts Show resolved Hide resolved

modules/module-mongodb/src/replication/bufferToSqlite.ts Show resolved Hide resolved

modules/module-mongodb/src/replication/bufferToSqlite.ts Outdated Show resolved Hide resolved

simolus3 approved these changes Apr 9, 2026

View reviewed changes

rkistner requested a review from simolus3 April 9, 2026 15:08

simolus3 previously approved these changes Apr 9, 2026

View reviewed changes

rkistner force-pushed the mongo-json-direct branch from 6d7e475 to b68bea1 Compare April 13, 2026 11:57

Base automatically changed from mongo-json-direct to main April 13, 2026 13:03

rkistner added 18 commits April 13, 2026 15:15

Add experimental bufferToSqlite implementation.

514863d

Use shared writer to reduce allocations.

87463bb

Add bson -> SqliteRow tests.

6df374c

Fix issues picked up by tests.

58274cc

Restructure tests to include the actual expected output.

d8bd300

Fix handling of UUID edge cases and non-finite numbers.

c6b9ad8

Further restructure and simplify tests.

519134e

Improve and test Regexp option handling.

6f0f539

Further regexp tests.

ba0c6ac

Add tests for invalid UTF-8.

432c0f9

Add some length checks for invalid BSON.

de79319

Add tests for degenerate arrays.

2bc6db0

Initial restructuring and comments.

c070f8f

Use consts.

684fef8

Further cleanup and comments.

a85100a

Simplify date serialization.

7cf965c

Handle date options in compatibility context.

a33c455

Update docs.

a88fca9

rkistner added 11 commits April 13, 2026 15:15

Rename converters, switch to the new one.

86276f6

Return Uint8Array instead of Buffer.

aeb3147

Use raw queries for mongo_test.

2f4048a

Remove DBRef tests - those were not actually using DBPointer types.

08c79e7

Add more DBPointer and DBRef tests.

7e3136a

Fix benchmark script; tweak docs.

c27e6f8

Fix DBPointer implementation.

8b5be04

Avoid parsing a string just to get the end position.

5749386

Custom parser for parseDocumentId.

1490d8d

Optimize UUID implementation.

db2effc

Add surrogate pair tests.

57e73c3

rkistner force-pushed the custom-bson-to-json branch from d02a866 to 57e73c3 Compare April 13, 2026 13:17

simolus3 approved these changes Apr 13, 2026

View reviewed changes

rkistner merged commit e5074f0 into main Apr 13, 2026
44 checks passed

rkistner deleted the custom-bson-to-json branch April 13, 2026 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MongoDB] Direct BSON Buffer -> JSON conversion#599

[MongoDB] Direct BSON Buffer -> JSON conversion#599
rkistner merged 29 commits intomainfrom
custom-bson-to-json

rkistner commented Apr 8, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

simolus3 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rkistner commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Future optimizations

Uh oh!

changeset-bot bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

simolus3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rkistner commented Apr 8, 2026 •

edited

Loading

changeset-bot bot commented Apr 8, 2026 •

edited

Loading