Join GitHub today
proposal: encoding/json: Opt-in for true streaming support #33714
I have long wanted proper streaming support in the
In a nutshell: The library implicitly guarantees that marshaling will never write an incomplete JSON object due to an error, and that during unmarshaling, it will never pass an incomplete JSON message to
Work toward this has been done on a couple of occasions, but abandoned or stalled for various reasons. See https://go-review.googlesource.com/c/go/+/13818/ and https://go-review.googlesource.com/c/go/+/135595
See also my related post on golang-nuts: https://groups.google.com/d/msg/golang-nuts/ABD4fTkP4Nc/bliIAAAeAQAJ
The problem to be solved
Dealing with large JSON structures is inefficient, due to the internal buffering done by
When encoding, even with
The same problem occurs in reverse--when reading a large JSON object: you cannot begin processing the result until the entire result is received.
A naïve solution
I believe a simple solution (simple from the perspective of a consumer of the library--the internal changes are not so simple) would be to add two interfaces:
During (un)marshaling, where
With this change, and the requisite internal changes, it would be possible to begin streaming large JSON data to a server immediately, from within a
The drawback is that it violates the above mentioned promise of complete reads and writes, even with errors.
Making it Opt-in
To accommodate this requirement, I believe it would be possible to expose the streaming functionality only with the
The default behavior, even when a type implements one of the new
Enabling streaming with the
CLs 13818 and 135595 can serve as informative for this part of the discussion. I've also done some digging in the
A large number of internal changes will be necessary to allow for this. I started playing around with a few internals, and I believe this is doable, but will mean a lot of code churn, so will need to be done carefully, in small steps with good code review.
As an exercise, I have successfully rewritten
An open question is how these changes might impact performance. My benchmarks after changing
With the internals rewritten to support streams, then it's just a matter of doing the internal buffering at the appropriate place, such as at API boundaries (i.e. in
To be clear, I am interested in working on this. I’m not just trying to throw out a “nice to have, now would somebody do this for me?” type of proposal. But I want to make sure I fully understand the history and context of this situation before I start too far down this rabbit hole.
I'm curious to hear the opinions of others who have been around longer. Perhaps such a proposal was already discussed (and possibly rejected?) in greater length than I can find in the above linked tickets. If so, please point me to the relevant conversation(s).
I am aware of several third-party libraries that offer some support like this, but most have various drawbacks (relying on code generation, or over-complex APIs). I would love to see this kind of support in the standard library.
If this general direction is approved, I think the first step is to break it into smaller parts that can be accomplished incrementally. I have given this thought, but so as not to jump the gun too much, will withhold my thoughts for a while, to allow proper discussion.
And one last aside: CL 13818 also added support for marshaling channels. That may or may not be a good idea (my personal feeling: probably not), but that can be addressed separately.
Thanks for filing this issue. There have indeed been previous discussions around the topic, but they've all been in separate places. The most recent discussion I remember is https://go-review.googlesource.com/c/go/+/135595, which included a working implementation for the encoder, and benchmark numbers. Edit: just realised you link it above as well.
I assume that this proposal is mainly driven by performance. If that's the case, what's the expected win from such API changes and internal refactors? It's hard to make a decision without experimental numbers. For example, if the wins on the current benchmarks are within a few percent, I'd say it's not worth the extra complexity and complex rewrite.
I'd also say that you should look at a recent master, or at least a 1.13 tag, when experimenting with changes. For example, the
Yes--Improving performance, and reducing cumbersome code which works around the current limitations. As an example, I've written some pretty ugly code using the Tokenizer interface, to read large JSON responses from CouchDB.
The above linked code provides a CouchDB analog to the
This exact benefit is tricky to measure accurately with a Go benchmark suite.
That said, I expect there is room for some easily-measured performance gains. I'll try to put together some benchmarks, and add them to this issue.
Good suggestion, and of course for any serious testing, I will do that.
Fair enough. With the encoder, if one wants to stream lots of elements, it's been suggested before to do something like:
I understand that this is harder to do with the decoder, as you have to then deal with the tokenizer, like you did. So it seems like your "large JSON" problem is more about decoding than encoding - is that correct?
I think both problems are worth solving. I don't know which is "bigger". Probably for my own use case, decoding is more painful (if only because working with the Tokenizer is more cumbersome). It seems historically more people have complained about the decoding instance, too.
Where the encoding problem becomes cumbersome is when your "write
To provide a real-world example (again, from CouchDB), to upload a file attachment, you include the following value in your JSON document*:
In this scenario, the ideal situation would probably be to read the content of the files directly from disk, and stream to the network, rather than buffering internally. The "
I hope that all makes sense :)
*This isn't the only way to upload attachments--there are methods that don't require bloating your payload with base64, but I think this example still illustrates the point.
This isn't the most interesting benchmark yet, but it's what I could throw together quickly, based on my previous experimentation: flimzy#1
This rewrites the
I'm not sure how informative this would be, given that the other libraries (that I know of) take a vastly different approach, making any benchmarks against them an apples/oranges comparison. For example, json-iterator exposes special functions for every data type, to avoid reflection. The other leading third-party json libraries use code generation and/or don't support streaming.
This would be informative. A complete back-of-napkin statistics from
That depends on the project's needs. Obviously some people feel that the different approaches are useful, or the libraries wouldn't be used.
I don't think the standard library is likely to adopt the techniques used by those libraries (and for good reason), so I'm not sure what value such benchmarks provide to this discussion. If you're just curious about benchmarks, most of theses projects provide them. See here, for example.
I have some more benchmarks to share. Still nothing earth-shattering, but building on the previous work I mentioned above with streaming the
Before (standard implementation):
After (streaming implementation):