proposal: encoding/json: add a new DecodeRaw API #18821
I've run into some performance issues when decoding large json documents and have a proposal for adding a new API to encoding/json package. I'm willing to write the code, tests, go through the review process etc, but before I do that I'd like to get approval that this new API would be accepted.
As I mentioned I'm currently using the Decoder in encoding/json to parse a large json documents which in my case containing a json list of objects. The problem that I have is that the Decode function in the Decoder does two things. It first reads the bytes of the next object in the list and then unmarshals the json into a golang object. The main consumer of CPU however is unmarshaling the data. To test this I wrote the following program.
The program reads the first 500000 objects from the list and then exits. It discards the objects after reading them. What I found was that 35.36s of 43.51s (81.26%) of the time spent running the program are spent unmarshaling data.
For large json objects we can only have a single reader so we limited by the speed that we can process this large json object even if we have more CPU's available to us.
My proposal would be to add a new API to the Decoder which I will call DecodeRaw. DecodeRaw will read the next json object just like Decode does, but DecodeRaw will return the bytes of that object back to the user instead of the unmarshaled data. The user can then unmarshal the bytes when they deem appropriate. What this would allow developers to do is to have a single reader reading the large json document. The could then create multiple goroutines to unmarshal the data. This would allow developers to take advantage of all CPU's on their machines for reading large JSON files. My initial proposal is that the DecodeRaw function would look something like below. Note that this function is not my final proposal. It is just meant to drive discussion.
So my question is whether the community thinks this would be a useful function to add to the Decoder?
I'm going to close this since I've verified that using RawMessage will solve this problem and as a result there is no need to add a new API. One thing that I am planning on looking into a little bit more is why there is still a fair amount of time spent in the unmarshal call when using RawMessage. The new timings for example are 2.54s/5.76s being spent in unmarshal. It seems like there may be some sort of optimization that can be added here since we should just be able to pass the bytes back to the user. I'll investigate this more and if I find anything I will open a separate issue.