Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I'm interested in this project, so I'd like to share some my knowledge #1

Open
goccy opened this issue Aug 11, 2021 · 7 comments
Open

Comments

@goccy
Copy link

goccy commented Aug 11, 2021

Hello. I'm the author of a Go's JSON library goccy/go-json.
I commented because this is a very interesting project !

I have read through the contents of the README at this time.

I'm developing the above JSON library with the goal of making it more convenient and faster while maintaining compatibility with encoding/json.
Therefore, I think that it has a similar concept to this project, and I'm very much looking forward to the future of this project.

So, I would like to share some knowledge (mainly performance aspect) when developing goccy/go-json. Is it okay to write it here?
The README mentions not to use unsafe, so I understand the policy differences from the library I'm developing.
Including that, I would like to share my knowledge to find useful points.

Also, I know a lot of other 3rd party JSON library's implementations,
so I would like to share the approach to performance improvement in each library if you want.

@mvdan
Copy link
Collaborator

mvdan commented Aug 12, 2021

Hello, and thanks for reaching out :)

You're right that this library has the same design principles as encoding/json, including no unsafe, but also no code generation. We also don't want this library to be very heavy in terms of code size.

And, just like the standard library, it prioritises correctness and API usability over performance. The default behavior is a bit slower for the sake of extra sanity checks, for example, which can be disabled by the user.

Over the past year we've mainly worried about the library's design and actually getting it to work, so optimizations haven't been the biggest worry. The design should still allow for good performance, but I would not expect its current version to be among the fastest json libraries today.

With all that in mind, what kind of knowledge would you like to share, or how would you like to contribute? I believe we are already aware of how other fast json libraries work, at a high level, but perhaps there's some specific detail of one of them that would benefit our design or implementation.

Something else you could try is measuring the performance of this library, and looking into the cases where it's significantly slower than you anticipated. That's something we will transition more towards, as we get more confident that the current design and API are decent.

@goccy
Copy link
Author

goccy commented Aug 12, 2021

Thank you for your reply ! :)

I also want the standard library to have a simple implementation.
On top of that, I would like to contribute if there are any points that can be improved in terms of designs or implementation.
Regarding the features, I imagine that you have already thoroughly discussed and selected it from among the many feature requests.
Therefore, I wanted to provide useful information on the implementation side in terms of non-feature aspects (e.g. performance).

At this time, I'm not sure what features you're trying to implement as an option,
However, as an example, I would like to introduce the features that we would be happy to offer as an option, primarily in terms of performance.

(First) encoding/json always sorts when encoding a map value. Many performance-focused libraries supports to encode without sorting as the option, so you might want to consider it.
(Second) In decoding the struct field, encoding/json behaves like a win after ( overwrite with what it find later ), but there is no way to notice the duplicated key. In addition, if options can change this behavior to a first win, it will be a performance advantage and may be considered.

As mentioned above, I was hoping to propose an improvement plan while comparing with other libraries about the problems that encoding/json has.
I couldn't think of a place to share it properly, so I wrote here.

Please let me know if there is a more suitable place.
Also, if you're not currently open to accepting ideas and discussions, I'll refrain from commenting.

@dsnet
Copy link
Collaborator

dsnet commented Aug 12, 2021

Hi @goccy, thanks for offering your knowledge!

At this time, I'm not sure what features you're trying to implement as an option,

A recent example is the decision to reject duplicate names by default. The presence of duplicate entries is almost always a bug and is a potential security hole. There's an obvious performance cost to this, but users can opt out by setting RejectDuplicateNames to false. Even though we plan on enabling it by default, there are a number of tricks that we can play to reduce the performance overhead. For example, when unmarshaling into a struct, we always a convert a JSON object name into a Go struct field index. Assuming that most structs have less than 64 fields, we could use a uint64 bitmap to track which names we saw before.

(First) encoding/json always sorts when encoding a map value

We currently do not plan on sorting map keys anymore for several reasons:

  1. The performance hit (as you noted). Most usages do not actually care about the order. For cases where people care about stable serialization, there's the RawValue.Canonicalize method.
  2. We trying to aim for pure streaming encoding/decoding. Sorting goes against the desire to be purely streaming since it's fundamentally cannot sort a stream.

(Second) In decoding the struct field, ...

Can you elaborate more on the relationship between performance and merging into an existing field?

@goccy
Copy link
Author

goccy commented Aug 13, 2021

@dsnet Thank you for the answer !

A recent example is the decision to reject duplicate names by default. The presence of duplicate entries is almost always a bug and is a potential security hole.

We currently do not plan on sorting map keys anymore for several reasons

OK, I see. Thank you.

when unmarshaling into a struct, we always a convert a JSON object name into a Go struct field index. Assuming that most structs have less than 64 fields, we could use a uint64 bitmap to track which names we saw before.

I think this is also good.
First, some of the optimizations used by goccy/go-json are listed at https://github.com/goccy/go-json#how-it-works .
Please take a look if you like. The implementation mentioned here is also close here.
I'm in favor of your implementation as I already know that this implementation is good in terms of performance.

Regarding the implementation, what are your thoughts on improvement performance by caching the encoding and decoding process using the address to the type information ?
Also, what are your thoughts on how to use the memory pool on the library side, such as buffer reuse using sync.Pool ( https://github.com/goccy/go-json#buffer-reuse ) ?

@dsnet
Copy link
Collaborator

dsnet commented Aug 13, 2021

First, some of the optimizations used by goccy/go-json are listed at https://github.com/goccy/go-json#how-it-works .

Thank you for suggesting these, but unfortunately most of these we cannot use since they rely on unsafe one way or another.

We do pre-process marshal/unmarshal functionality for a given type. It's as close to your typeToEncoder map as one can get without using unsafe.

Also, what are your thoughts on how to use the memory pool on the library side, such as buffer reuse using sync.Pool ( https://github.com/goccy/go-json#buffer-reuse ) ?

If possible, we'd like to avoid sync.Pool, but I suspect that we'll end up using them in a few places throughout the module.

That said, safety is a high priority for us. Any use of sync.Pool cannot leak buffers to the user. An example of this is that we can't pool the buffer used when calling io.Reader.Read since we cannot prove that the buffer doesn't escape through a user's implementation of the Read method.

Also, CPU performance isn't the only priority either. Use of sync.Pool tends to reduce CPU utilization by the GC, but sometimes at the expense of a lot more memory. Improper use of sync.Pool can cause pinning arbitrary amounts of memory (see golang/go#27735).

For some cases, we would probably provide an explicit API for reuse (e.g., some Reset method). Thus, more control over buffer reuse can be manually done by the user.

@goccy
Copy link
Author

goccy commented Aug 13, 2021

Thank you for your answer. I understood your plan.
At this time, there seems to be no information I have that can be communicated without mentioning new features.
I look forward to the success of this project !

Thank you.

@dsnet
Copy link
Collaborator

dsnet commented Sep 22, 2023

@goccy. I recently updated the benchmarks to include correctness tests and I noticed that goccy/go-json non-deterministically fails on some of them.

I'm using v0.10.2.

See https://github.com/go-json-experiment/jsonbench#correctness
Some of the failures are really scary like nil pointer, out-of-bounds, and even a few fatal GC runtime errors.

You can reproduce the failures by doing:

while true; do go test -run=Roundtrip///GoJSON; done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants