Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add raw bson types for zero-copy handling of bson wire data. #136

Closed
wants to merge 46 commits into from

Conversation

jcdyer
Copy link
Contributor

@jcdyer jcdyer commented Dec 20, 2019

Fixes #133

  • Implements a RawBsonDocBuf to hold a Vec<u8> of bson data, and a RawBson<'a> for a borrowed version of the same, along with various other types for supporting individual values and specific types within a bson type.
  • Implements lossless, but fallible conversion to the existing structured Bson type (fallible because well-formedness checking is not exhaustive at object creation time).
  • Implements serde deserialization to Bson and custom structs, including allowing custom structs to handle particular bson binary subtypes, and special handling for object ids utc datetimes, and other types that don't line up with serde's data model.
  • Provides impressive speed gains over deserializing directly to to the structured Bson type, and for deserializing to custom structs.
  • Where RawBson is slower (repeated random access to document members), speed of converting RawBson to Bson (without serde) is comparable to the existing direct decode from bytes to Bson (sometimes slightly faster, sometimes slightly slower).
  • Structured Bson is still needed for mutable access and constructing new bson documents.

Caveats:

  • There may be some rough edges around handling serde deserialization of obscure types.
  • I haven't tried to handle Uuid and OldUuid types properly for different input formats. It just returns the raw bytes it got. This is apparently an unexpectedly tricky piece of Bson history, having to do with endianness, and which parts of a UUID count as a discrete value.
  • RawBson handles BinarySubtype::OldBinary differently than the existing Bson type, as the spec seems to indicate that the first four bytes should be treated as a second length specifier, but the Bson type doesn't do this.

Looking forward to your feedback.

univerz referenced this pull request in mongodb/mongo-rust-driver Jan 10, 2020
@saghm
Copy link
Contributor

saghm commented Feb 10, 2020

I've filed https://jira.mongodb.org/browse/RUST-284 to track this as well as the corresponding issue

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 11, 2020

Thanks! As requested, I'll find some time this week to separate this PR into multiple parts, to ease the burden on reviewers.

I'm envisioning three parts:

  1. The basic RawBson types, which will implement TryInto conversion into Bson, and method-based access to fields (get_f32(), as_object_id(), etc.)
  2. Serde deserialization into strictly defined types.
  3. Serde deserialization into Bson.

I think those also go in order roughly from least invasive to most invasive, so the first one should be relatively safe to merge and experiment with, before merging the others.

@saghm
Copy link
Contributor

saghm commented Feb 11, 2020

I had a similar but slightly different idea of how to split it up:

  1. The RawBson and RawDocument types (same as your first part)
  2. Serde serialization/deserialization of scalar (i.e. non-recursive) types
  3. Deserialization for the non-scalar types (raw forms of Document, Array, and Bson itself)
  4. Serialization for the non-scalar types

I agree that we can do most of these changes without needing to mess with the existing code too much. In particular, I think we still want to keep the existing Document and Bson APIs intact, but just offer raw versions as well for improved performance, as well as conversions between the raw and non-raw types. Assuming everything goes well and the raw types do prove to be faster (which I expect will be the case), we could change the driver to receive/return raw types so that the user gets the faster versions by default but can still opt-into the traditional Document type by just calling a method to convert what they send over and receive from the driver.

@saghm
Copy link
Contributor

saghm commented Feb 11, 2020

Apologies for the CI failures; we just updated the repository to check for clippy lints and merged a change to fix/whitelist all of them. I think if you rebase with master, everything should turn green.

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 13, 2020

I pulled back all the serde related changes.

One thing I'd like to clean up a little more is the error handling. It would be nice to unify the return types of the field accessors (.get_<type>() and .as_<type>())with those of OrderedDocument and Bson, but the raw fields have a couple more error conditions. I'd prefer to add these to the same error type that the OrderedDocument::get_* and Bson::as_* methods return, but while the OrderedDocument methods return a Result, the Bson methods return an Option.

So we could:

  1. Update Bson::as_* to return a Result (the OrderedDocument::get_* methods were changed from Option to Result not too long ago), and consolidate on the RawBsonError type (renaming it to something more appropriate).
  2. Leave the Bson::as_* return type as Option<T>, but consolidate the error types of OrderedDocument::get_*, RawBson::as_* and RawBsonDoc::get_*.
  3. Leave all the return types alone.

The reason for having the extra Error Variants for raw documents are as follows:

First, the raw types can reveal errors with malformed bson at query time, whereas that cost is paid at instantiation time with the parsed types, so the RawError type has a MalformedDocument error.

Second, and to my mind more importantly, is that the raw type can return a Utf8 error, if mongo returns a string that is not valid UTF-8. This is something I've seen in production mongo databases, and often there is no way to recover (that I've found) with other drivers. IIRC, the go driver crashes during fetching, and the python driver also crashes unless you pass a poorly documented unicode_decode_error_handler argument to the client. The RawBson types, by contrast, don't error unless you try to access the bad bytes (as long as the string has a valid length), and if you do access them, the RawBsonError type actually returns the bytes of the invalid string, so you can easily inspect them, write them out as binary, or whatever you want with them.

Unfortunately, either Option 1 or 2 would be a semver incompatible change. It would be fairly trivial in the case of Option 2, but Option 1 would leave the library feeling overall more consistent in the end.

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 13, 2020

One benefit of unifying the return types is that the following expressions would be interchangeable, which seems desireable to me.

let oid1 = rawdoc.get("_id").and_then(RawBson::as_object_id);
let oid2 = rawdoc.get_object_id("_id");
let oid3 = OrderedDocument::from(rawdoc).get("_id").and_then(Bson::as_object_id);
let oid4 = OrderedDocument::from(rawdoc).get_object_id("_id");
let oid5 = rawdoc.get("_id").map(Bson::from).and_then(Bson::as_object_id);

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 13, 2020

Assuming everything goes well and the raw types do prove to be faster

If you run the included benchmarks you should see that repeated random access on the raw type eventually gets slower than the parsed type, as you would expect (because access to a given element takes linear time), but in-order iteration and direct access to a single element are significantly faster.

can still opt-into the traditional Document type by just calling a method to convert what they send over and receive from the driver.

rawdoc.into() will do this, and performs comparably well to the current way of constructing OrderedDocuments. :).

Note that one important use of the parsed type is constructing bson documents from scratch. The raw type isn't well suited to this, so methods like Collection::find should take either an OrderedDocument or probably better, a T: Into<RawBsonDoc>.

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 13, 2020

Oops. I just saw that the From<RawBsonDoc> for OrderedDocument implementation panics on invalid input. I can change that to implement TryFrom<RawBsonDoc, Error=RawValueAccessError> instead.

@jcdyer
Copy link
Contributor Author

jcdyer commented Jan 10, 2021

I've published this work as a standalone crate (https://github.com/jcdyer/rawbson / https://crates.io/crates/rawbson). Whenever you get ready to incorporate this into bson-rust, let me know, and we can figure out the best way to integrate it. For now, I'll close this PR.

@jcdyer jcdyer closed this Jan 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracked-in-jira Ticket filed in Mongo's Jira system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support RawBson objects.
3 participants