Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: encoding/json: add a new DecodeRaw API #18821

Closed
mikewied opened this issue Jan 27, 2017 · 4 comments

Comments

Projects
None yet
4 participants
@mikewied
Copy link

commented Jan 27, 2017

I've run into some performance issues when decoding large json documents and have a proposal for adding a new API to encoding/json package. I'm willing to write the code, tests, go through the review process etc, but before I do that I'd like to get approval that this new API would be accepted.

As I mentioned I'm currently using the Decoder in encoding/json to parse a large json documents which in my case containing a json list of objects. The problem that I have is that the Decode function in the Decoder does two things. It first reads the bytes of the next object in the list and then unmarshals the json into a golang object. The main consumer of CPU however is unmarshaling the data. To test this I wrote the following program.

func ReadListMinimal(path string) error {
	f, err := os.Create("/tmp/pprof")
	if err != nil {
		fmt.Printf("could not create CPU profile: %v", err)
	}

	if err := pprof.StartCPUProfile(f); err != nil {
		fmt.Printf("could not start CPU profile: %v", err)
	}
	defer pprof.StopCPUProfile()

	docsRead := 0
	file, err := os.Open(path)
	if err != nil {
		return err
	}

	decoder := NewDecoder(file)

	// Read opening bracket
	if _, err := decoder.Token(); err != nil {
		return fmt.Errorf("expected a JSON list of objects")
	}


	item := &QueuedItem{nil, nil, nil}
	for decoder.More() {
		docsRead++
		err = decoder.DecodeRaw(&item.bytes)
		if err != nil {
			return fmt.Errorf("expected a JSON list of objects")
		}

		if docsRead > 500000 {
			return nil
		}
	}

	// Read closing bracket
	if _, err := decoder.Token(); err != nil {
		return fmt.Errorf("expected a JSON list of objects")
	}


	return file.Close()
}

The program reads the first 500000 objects from the list and then exits. It discards the objects after reading them. What I found was that 35.36s of 43.51s (81.26%) of the time spent running the program are spent unmarshaling data.

For large json objects we can only have a single reader so we limited by the speed that we can process this large json object even if we have more CPU's available to us.

My proposal would be to add a new API to the Decoder which I will call DecodeRaw. DecodeRaw will read the next json object just like Decode does, but DecodeRaw will return the bytes of that object back to the user instead of the unmarshaled data. The user can then unmarshal the bytes when they deem appropriate. What this would allow developers to do is to have a single reader reading the large json document. The could then create multiple goroutines to unmarshal the data. This would allow developers to take advantage of all CPU's on their machines for reading large JSON files. My initial proposal is that the DecodeRaw function would look something like below. Note that this function is not my final proposal. It is just meant to drive discussion.

func (dec *Decoder) DecodeRaw(bytes *[]byte) error {
	if dec.err != nil {
		return dec.err
	}

	if err := dec.tokenPrepareForDecode(); err != nil {
		return err
	}

	if !dec.tokenValueAllowed() {
		return &SyntaxError{msg: "not at beginning of value"}
	}

	// Read whole value into buffer.
	n, err := dec.readValue()
	if err != nil {
		return err
	}

	*bytes = dec.buf[dec.scanp : dec.scanp+n]
	dec.scanp += n

	// fixup token streaming state
	dec.tokenValueEnd()

	return err
}

So my question is whether the community thinks this would be a useful function to add to the Decoder?

@ALTree ALTree changed the title Request to add a new DecodeRaw API to encoding/json Decoder proposal: encoding/json: add a new DecodeRaw API Jan 27, 2017

@ALTree ALTree added the Proposal label Jan 27, 2017

@ALTree ALTree added this to the Proposal milestone Jan 27, 2017

@vcabbage

This comment has been minimized.

Copy link
Member

commented Jan 28, 2017

Would decoding into json.RawMessage work for your purposes?

Here's a modified version json.Decoder.Decode's example: https://play.golang.org/p/--kgxo7k_J

@mikewied

This comment has been minimized.

Copy link
Author

commented Jan 28, 2017

That sounds like it would solve my issue and would remove the need to add a new API. Ill try this out over the weekend and close this issue if it works.

@mikewied

This comment has been minimized.

Copy link
Author

commented Jan 30, 2017

I'm going to close this since I've verified that using RawMessage will solve this problem and as a result there is no need to add a new API. One thing that I am planning on looking into a little bit more is why there is still a fair amount of time spent in the unmarshal call when using RawMessage. The new timings for example are 2.54s/5.76s being spent in unmarshal. It seems like there may be some sort of optimization that can be added here since we should just be able to pass the bytes back to the user. I'll investigate this more and if I find anything I will open a separate issue.

@mikewied mikewied closed this Jan 30, 2017

@mikewied

This comment has been minimized.

Copy link
Author

commented Jan 30, 2017

Also, @vcabbage thanks for pointing out the use of RawMessage for this case.

@golang golang locked and limited conversation to collaborators Jan 30, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.