Add raw bson types for zero-copy handling of bson wire data. #136

jcdyer · 2019-12-20T19:54:49Z

Fixes #133

Implements a RawBsonDocBuf to hold a Vec<u8> of bson data, and a RawBson<'a> for a borrowed version of the same, along with various other types for supporting individual values and specific types within a bson type.
Implements lossless, but fallible conversion to the existing structured Bson type (fallible because well-formedness checking is not exhaustive at object creation time).
Implements serde deserialization to Bson and custom structs, including allowing custom structs to handle particular bson binary subtypes, and special handling for object ids utc datetimes, and other types that don't line up with serde's data model.
Provides impressive speed gains over deserializing directly to to the structured Bson type, and for deserializing to custom structs.
Where RawBson is slower (repeated random access to document members), speed of converting RawBson to Bson (without serde) is comparable to the existing direct decode from bytes to Bson (sometimes slightly faster, sometimes slightly slower).
Structured Bson is still needed for mutable access and constructing new bson documents.

Caveats:

There may be some rough edges around handling serde deserialization of obscure types.
I haven't tried to handle Uuid and OldUuid types properly for different input formats. It just returns the raw bytes it got. This is apparently an unexpectedly tricky piece of Bson history, having to do with endianness, and which parts of a UUID count as a discrete value.
RawBson handles BinarySubtype::OldBinary differently than the existing Bson type, as the spec seems to indicate that the first four bytes should be treated as a second length specifier, but the Bson type doesn't do this.

Looking forward to your feedback.

saghm · 2020-02-10T16:38:33Z

I've filed https://jira.mongodb.org/browse/RUST-284 to track this as well as the corresponding issue

…map for better validation and flexibility.

Breaks the test to deserialize to a HashMap<String, String>, but why is that a valid bson representation? I think we need a stricter separation between bson and extended json format. Serde shouldn't try to paper over that.

jcdyer · 2020-02-11T01:21:19Z

Thanks! As requested, I'll find some time this week to separate this PR into multiple parts, to ease the burden on reviewers.

I'm envisioning three parts:

The basic RawBson types, which will implement TryInto conversion into Bson, and method-based access to fields (get_f32(), as_object_id(), etc.)
Serde deserialization into strictly defined types.
Serde deserialization into Bson.

I think those also go in order roughly from least invasive to most invasive, so the first one should be relatively safe to merge and experiment with, before merging the others.

saghm · 2020-02-11T16:15:04Z

I had a similar but slightly different idea of how to split it up:

The RawBson and RawDocument types (same as your first part)
Serde serialization/deserialization of scalar (i.e. non-recursive) types
Deserialization for the non-scalar types (raw forms of Document, Array, and Bson itself)
Serialization for the non-scalar types

I agree that we can do most of these changes without needing to mess with the existing code too much. In particular, I think we still want to keep the existing Document and Bson APIs intact, but just offer raw versions as well for improved performance, as well as conversions between the raw and non-raw types. Assuming everything goes well and the raw types do prove to be faster (which I expect will be the case), we could change the driver to receive/return raw types so that the user gets the faster versions by default but can still opt-into the traditional Document type by just calling a method to convert what they send over and receive from the driver.

saghm · 2020-02-11T19:26:43Z

Apologies for the CI failures; we just updated the repository to check for clippy lints and merged a change to fix/whitelist all of them. I think if you rebase with master, everything should turn green.

jcdyer · 2020-02-13T19:06:27Z

I pulled back all the serde related changes.

One thing I'd like to clean up a little more is the error handling. It would be nice to unify the return types of the field accessors (.get_<type>() and .as_<type>())with those of OrderedDocument and Bson, but the raw fields have a couple more error conditions. I'd prefer to add these to the same error type that the OrderedDocument::get_* and Bson::as_* methods return, but while the OrderedDocument methods return a Result, the Bson methods return an Option.

So we could:

Update Bson::as_* to return a Result (the OrderedDocument::get_* methods were changed from Option to Result not too long ago), and consolidate on the RawBsonError type (renaming it to something more appropriate).
Leave the Bson::as_* return type as Option<T>, but consolidate the error types of OrderedDocument::get_*, RawBson::as_* and RawBsonDoc::get_*.
Leave all the return types alone.

The reason for having the extra Error Variants for raw documents are as follows:

First, the raw types can reveal errors with malformed bson at query time, whereas that cost is paid at instantiation time with the parsed types, so the RawError type has a MalformedDocument error.

Second, and to my mind more importantly, is that the raw type can return a Utf8 error, if mongo returns a string that is not valid UTF-8. This is something I've seen in production mongo databases, and often there is no way to recover (that I've found) with other drivers. IIRC, the go driver crashes during fetching, and the python driver also crashes unless you pass a poorly documented unicode_decode_error_handler argument to the client. The RawBson types, by contrast, don't error unless you try to access the bad bytes (as long as the string has a valid length), and if you do access them, the RawBsonError type actually returns the bytes of the invalid string, so you can easily inspect them, write them out as binary, or whatever you want with them.

Unfortunately, either Option 1 or 2 would be a semver incompatible change. It would be fairly trivial in the case of Option 2, but Option 1 would leave the library feeling overall more consistent in the end.

jcdyer · 2020-02-13T19:33:56Z

One benefit of unifying the return types is that the following expressions would be interchangeable, which seems desireable to me.

let oid1 = rawdoc.get("_id").and_then(RawBson::as_object_id);
let oid2 = rawdoc.get_object_id("_id");
let oid3 = OrderedDocument::from(rawdoc).get("_id").and_then(Bson::as_object_id);
let oid4 = OrderedDocument::from(rawdoc).get_object_id("_id");
let oid5 = rawdoc.get("_id").map(Bson::from).and_then(Bson::as_object_id);

jcdyer · 2020-02-13T19:48:31Z

Assuming everything goes well and the raw types do prove to be faster

If you run the included benchmarks you should see that repeated random access on the raw type eventually gets slower than the parsed type, as you would expect (because access to a given element takes linear time), but in-order iteration and direct access to a single element are significantly faster.

can still opt-into the traditional Document type by just calling a method to convert what they send over and receive from the driver.

rawdoc.into() will do this, and performs comparably well to the current way of constructing OrderedDocuments. :).

Note that one important use of the parsed type is constructing bson documents from scratch. The raw type isn't well suited to this, so methods like Collection::find should take either an OrderedDocument or probably better, a T: Into<RawBsonDoc>.

jcdyer · 2020-02-13T19:56:37Z

Oops. I just saw that the From<RawBsonDoc> for OrderedDocument implementation panics on invalid input. I can change that to implement TryFrom<RawBsonDoc, Error=RawValueAccessError> instead.

jcdyer · 2021-01-10T21:45:26Z

I've published this work as a standalone crate (https://github.com/jcdyer/rawbson / https://crates.io/crates/rawbson). Whenever you get ready to incorporate this into bson-rust, let me know, and we can figure out the best way to integrate it. For now, I'll close this PR.

univerz referenced this pull request in mongodb/mongo-rust-driver Jan 10, 2020

RUST-202 Implement basic I/O functionality for connections

7399a60

jcdyer and others added 28 commits February 10, 2020 20:13

Add handling for raw bson types

7bd6f77

Align api with Bson api.

6e4e6ba

Flesh out tests for more types.

580cab4

Add serde deserialization of raw types.

4dd9f7f

Extract sub-modules from de.rs

f3e6c81

Accommodate intellij macro inference

906eed0

Clean up

d556885

use struct deserialization for object id and binary types instead of …

cce95e4

…map for better validation and flexibility.

fmt

3a44b40

fmt

2f6d31a

Add i128 handling to deserializer.

ad9164d

Clean up number handling

9db86e6

work on decimal128 implementation.

84450ff

Return Results from rawbson accessor methods.

2aacbd5

Fix tests and cleanup

40d6172

Get object ID compiling. Deserialization to bson still doesn't work.

6b1cba8

Add code to handle object IDs when deserializing to Bson.

711a596

Breaks the test to deserialize to a HashMap<String, String>, but why is that a valid bson representation? I think we need a stricter separation between bson and extended json format. Serde shouldn't try to paper over that.

Fix to_bson from $oid

4a1cb9d

cleanup

1227ba6

Add benchmarks.

a648fa3

clean up benches

cf2e4eb

Make criterion a dev-dependency

6cc7634

Handle errors more robustly.

4e369ea

Add pub use declarations for raw decoder methods.

24249bd

Add RawBsonDocBuf implementation for carrying around owned data.

4cbea43

Add from_rawdoc_buf decoder

714c9c3

Fix imports

c3c3a17

Allow (more) allocation in error path

52869b1

jcdyer added 13 commits February 10, 2020 20:13

Remove extra iteration.

ab4e183

Build benchmarks (but don't run them

7106aa4

Add javascript with scope deserialization.

21479fb

Got failing test.

e2c9080

Fix js_with_scope implementation

c38bfa6

Resolve compiler warnings

6d88fcb

Add regexp handling.

ae0bb6c

Preliminary datetime & timestamp handling

e5fcb00

Fix test of deserializing to chrono datetime in a struct

ea8ec11

Sort out binary subtype handling

36f6476

Fix struct handling of binary output.

d1ed9f8

Clean up lints and fmt

75d7586

fix lint

f19c29f

jcdyer force-pushed the rawbson branch from 612e219 to f19c29f Compare February 11, 2020 01:13

jcdyer added 4 commits February 13, 2020 13:31

Back out changes from decoder pieces

51985f5

Back out raw bson serde work.

8339c3c

Fix lint issue in src/ordered.rs

e2f7af4

Undo changes to unrelated test file.

97c7c36

jcdyer force-pushed the rawbson branch from 5b3df24 to 97c7c36 Compare March 13, 2020 15:11

saghm mentioned this pull request Jul 27, 2020

Improve performance by allow unordored HashMap #205

Closed

agolin95 added the tracked-in-jira Ticket filed in Mongo's Jira system label Nov 30, 2020

jcdyer closed this Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add raw bson types for zero-copy handling of bson wire data. #136

Add raw bson types for zero-copy handling of bson wire data. #136

jcdyer commented Dec 20, 2019 •

edited

saghm commented Feb 10, 2020

jcdyer commented Feb 11, 2020 •

edited

saghm commented Feb 11, 2020

saghm commented Feb 11, 2020

jcdyer commented Feb 13, 2020

jcdyer commented Feb 13, 2020

jcdyer commented Feb 13, 2020 •

edited

jcdyer commented Feb 13, 2020

jcdyer commented Jan 10, 2021 •

edited

Add raw bson types for zero-copy handling of bson wire data. #136

Add raw bson types for zero-copy handling of bson wire data. #136

Conversation

jcdyer commented Dec 20, 2019 • edited

saghm commented Feb 10, 2020

jcdyer commented Feb 11, 2020 • edited

saghm commented Feb 11, 2020

saghm commented Feb 11, 2020

jcdyer commented Feb 13, 2020

jcdyer commented Feb 13, 2020

jcdyer commented Feb 13, 2020 • edited

jcdyer commented Feb 13, 2020

jcdyer commented Jan 10, 2021 • edited

jcdyer commented Dec 20, 2019 •

edited

jcdyer commented Feb 11, 2020 •

edited

jcdyer commented Feb 13, 2020 •

edited

jcdyer commented Jan 10, 2021 •

edited