Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: encoding/json: garbage-free reading of tokens #40128

Open
rogpeppe opened this issue Jul 9, 2020 · 0 comments
Open

proposal: encoding/json: garbage-free reading of tokens #40128

rogpeppe opened this issue Jul 9, 2020 · 0 comments
Labels
Projects
Milestone

Comments

@rogpeppe
Copy link
Contributor

@rogpeppe rogpeppe commented Jul 9, 2020

As @bradfitz noted in the reviews of the original API, the Decoder.ReadToken API is a garbage factory. Although, as @rsc noted at the time, "a clean API ... is more important here. I expect people to use it to get to the position they want in the stream and then call Decode", the inefficiency is a problem in practice for anyone that wishes to use the encoding/json tokenizer as a basis for some other kind of decoder.

Dave Cheney's "Building a high performance JSON parser" details some of the issues involved. He comes to the conclusion that the interface-based nature of json.Token is a fundamental obstacle. I like the current interface-based API, but it does indeed make it impossible to return arbitrary tokens without creating garbage. Dave suggests a new Scanner API, somewhat more complex, that is also not backwardly compatible with the current API in encoding/json.

I propose instead that the following method be added to the encoding/json package:

// TokenBytes is like Token, except that for strings and numbers, it returns
// a static Token value with the actual data payload in the []byte parameter,
// which is only valid until the next call to Token or TokenBytes or Decode.
// For strings, the returned Token will be ""; for a number, it will be
// Number("0"); for all other kinds of token, the Token will be returned as by
// Token method and the []byte value will be nil.
//
// This is more efficient than using Token because it avoids the
// allocations required by that API.
func (dec *Decoder) TokenBytes() (Token, []byte, error)

Token can be implemented in terms of TokenBytes as follows:

func (dec *Decoder) Token() (Token, error) {
	tok, data, err := dec.TokenBytes()
	if err != nil || data == nil {
		return tok, err
	}
	switch tok {
	case "":
		return string(data)
	case Number(0):
		return Number(data)
	}
	panic("unreachable")
}

Discussion

This proposal relies on the observation that the Decoder.Token API only generates garbage for two kinds of tokens: numbers and strings. For all other token types, no garbage need be generated, as small numbers (json.Delim and bool) do not incur an allocation when boxed in an interface.

It maintains the current API as-is. Users can opt-in to the new API if they require efficiency at some risk of incorrectness (the caller could hold onto the data slice after the next call to Decode). The cognitive overhead of TokenBytes is arguably low because of its similarity to the existing API.

If this proposal is accepted, an Encoder.EncodeTokenBytes could easily be added to provide garbage-free streaming JSON generation too.

@gopherbot gopherbot added this to the Proposal milestone Jul 9, 2020
@gopherbot gopherbot added the Proposal label Jul 9, 2020
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Incoming
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.