Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize byte objects directly #152

Closed
wants to merge 9 commits into from

Conversation

aspin
Copy link

@aspin aspin commented Dec 25, 2020

Closes #95.

I'm not very familiar with PyO3 so let me know if this look ok!

@ijl
Copy link
Owner

ijl commented Dec 29, 2020

Thank you for working on this. I think there are some questions about how this should work. The first use case given was writing base64 bytes objects without converting. That's something like orjson.dumps({"key": b"dmFsdWU=", option=orjson.OPT_SERIALIZE_BYTES) yielding b'{"key":"dmFsdWU="). This implementation doesn't support that. I know another use case is to serialize ASCII byte strings that are dictionary keys, which would require the same behavior. I see your use case in the linked issue that requires it not be formatted as a string and I wonder whether you've benchmarked it now that you have this implementation and seen a difference? Either way another thing to look at is more testing, in particular with invalid JSON and invalid UTF8. It's at the caller's risk to give that but it needs to be tested to not cause a crash.

@aspin
Copy link
Author

aspin commented Dec 29, 2020

Ah, I see. So the first issue wants the bytes to be serialized directly as a string rather than to raw JSON. That seems like it'll require a separate option to distinguish the behavior. Yeah, I'll do a bit a of benchmarking to check out it out, and do some testing of invalid JSON.

@aspin
Copy link
Author

aspin commented Dec 29, 2020

Ok, I've added support for the other use case as OPT_SERIALIZE_BYTES_AS_STRING and renamed my example use case as OPT_SERIALIZE_BYTES_AS_JSON. Let me know how that looks.

Added a test case for invalidly formatted JSON, which seems to work without any changes. Same with UTF8 –– I'm not really familiar with creating invalid UTF8, so I just picked a string that .decode("utf-8") fails on.

Benchmarking:

10000 samples, cached size = 744 bytes, total message size = 827 bytes

baseline 0.3442249298095703s
cold_cache 0.42847299575805664s
hot_cache 0.11748290061950684s

@ijl
Copy link
Owner

ijl commented Jan 3, 2021

Looking at this it's really not possible to have both options as-is. What happens if I specify option=OPT_SERIALIZE_BYTES_AS_JSON | OPT_SERIALIZE_BYTES_AS_STRING? The behavior is an implementation detail of how the comparisons are ordered and in my opinion is too confusing and unreliable to be made part of the API.

I think the way forward for OPT_SERIALIZE_BYTES_AS_JSON would be a separate type provided by orjson, like python-rapidjson's RawJSON. With an option enabled you can type check an object's ob_type to match the id of the orjson-provided class unambiguously. That's probably a lot more expensive and maybe a bit much for your interest. Sorry about that.

I don't expect you want to spend time implementing the bytes as ASCII str case because it's not helpful to your use. If you do, I would want to support bytes as dict keys and don't understand why the implementation in json/src/ser.rs paid for going through serialize_element etc rather than the simpler version you have, if you could explain? Having another patch against serde-json is undesirable and probably they had a reason for that implementation but I'm not familiar.

@aspin
Copy link
Author

aspin commented Jan 4, 2021

I suppose the other simple alternative is to just throw an exception when both options are specified, since that's a unresolvable logical request. Do you have any objections to that? Happy to spend sometime creating another type if so.

Yeah, so the existing behavior serialized bytes as an array of byte codes (for example b'{"a":"a","b":1}' => [123,34,97,34,58,34,97,34,44,34,98,34,58,49,125]. orjson never hit that section of code since it rejected bytes and never put it into the serde data model. I'm not really sure as to the exact motivation, except that the serde-json developers probably thought that that was the appropriate output if the user was supplying bytes as an input in their JSON. I think the underlying assumptions is that in most cases bytes are not so closely linked to strings as they are in orjson.

@ijl
Copy link
Owner

ijl commented Jan 12, 2021

Thanks for explaining the serde-json change. I think it's avoidable to allow the user to be in a position where it's possible to specify conflicting options and being hard to misuse is better. To look at it in terms of docs, something like:

Fragment

orjson.Fragment is a class that wraps byte strings of already-serialized JSON. This can be useful when proxying valid JSON.

>>> import orjson
>>> orjson.dumps({"data": orjson.Fragment(b"[1,2]")})
b'{"data":[1,2}'

This outputs invalid JSON if the input is invalid. The contents are passed through unmodified with no checks.

>>> import orjson
>>> orjson.dumps([orjson.Fragment(b"zxc]"), ])
b'[zxc]]'
>>> orjson.loads(_)
orjson.JSONDecodeError: ...

@ijl
Copy link
Owner

ijl commented Jan 22, 2021

I'll close because it's inactive but feel free to reopen.

@ijl ijl closed this Jan 22, 2021
@aspin
Copy link
Author

aspin commented Jan 22, 2021

Sounds good. Been busy lately, but I'll get around to it soonish.

@aspin
Copy link
Author

aspin commented Jan 26, 2021

What reference for PyO3 did you use for writing this library? I've been looking at https://pyo3.rs/v0.13.1/class.html and checked out the docs for v0.12, but the patterns described here seem pretty different from what's in this library. I'm not quite sure how to define a Rust struct for use in Python using just Cython calls.

@ijl
Copy link
Owner

ijl commented Jan 26, 2021

You would need to use pyo3/src/ffi directly. Examples in or tutorials for C projects would apply then. pyo3's wrappers are convenient but would be much slower than the current implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serialise bytes objects directly?
2 participants