Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: add Encoder.EncodeToken method #40127

Open
rogpeppe opened this issue Jul 9, 2020 · 12 comments
Open

encoding/json: add Encoder.EncodeToken method #40127

rogpeppe opened this issue Jul 9, 2020 · 12 comments

Comments

@rogpeppe
Copy link
Contributor

@rogpeppe rogpeppe commented Jul 9, 2020

Currently there is a way to decode JSON token-by-token, but the JSON package does not support encoding tokens that way. It has been stated that just writing the tokens yourself is straightforward enough to do (citation needed), but it's not actually that easy to do.

Here's some code that streams JSON from input to output: https://play.golang.org/p/H6Xl_twRIyC. It's at least 50 lines of code to do the basic streaming functionality, it's easy to get wrong by forgetting to put a colon or a comma in the right place, and if you want indentation or HTML escaping, you have to do it yourself.

I propose that we add an EncodeToken method to json.Encoder:

// EncodeToken encodes a single JSON token to the encoder. The order of
// tokens must result in valid JSON syntax (as returned by the
// Decoder.Token method). That is, delimiters [ ] { } must be properly
// nested and matched, and object keys must be strings (but see the note
// on using Encode below).
//
// EncodeToken returns an error if the token cannot be encoded or is
// misplaced, for example because there's a delimiter mismatch, a
// non-string token is passed where an object key is expected, a string
// contains invalid UTF-8 or a Number isn’t formatted correctly.
//
// Note that it’s OK to mix calls to EncodeToken with calls to Encode.
// In particular, it’s also valid to call Encode when an object key is
// expected - in this case, the value is subject to the same conversion
// rules as for map keys.
func (enc *Encoder) EncodeToken(tok Token) error

There would be no way to produce syntactically invalid output (other than truncating the output by not completing the encoded value). The Encoder would keep track of the current state in a stack internally.

An example: if we wanted to stream a large array of JSON objects, we could do:

enc.EncodeToken(json.Delim('['))
for {
    enc.Encode(arrayItem)
}
enc.EncodeToken(json.Delim(']'))

A slightly more comprehensive example is here

The code to stream a set of JSON values would become considerably simpler: https://play.golang.org/p/Wec5wepCYbE

Completeness validation

It might be useful to provide callers with a final check that their encoded value is in fact complete (that is, no final closing tokens have been omitted). To do that, the following method could be provided:

// ClosingToken returns the token necessary to close the current encoding object or array.
// It returns either Delim('}'), Delim(']'), or nil if there is no current object or array.
func (enc *Encoder) ClosingToken() Token

This method makes it straightforward to check that the object is complete (if env.ClosingToken == nil), but also could be used for sanity checks when developing, better errors, or to implement a Flush method that automatically closes all open objects and arrays:

for enc.ClosingToken() != nil {
   enc.EncodeToken(enc.ClosingToken())
}

Discussion

As with the current Decoder.Token implementation, Encoder.EncodeToken will not be particularly efficient, as it requires passing strings and numbers as strings inside an interface value. I think that this can be considered an orthogonal issue: a solution to the Decoder.Token garbage issue may also be applied to Decoder.Token. The symmetry between Decoder.Token and Encoder.EncodeToken is a useful property which makes this API easier to explain and use, and shouldn’t be discarded lightly, in my view.

Once Encoder.Encode is invoked, there’s no way to return to streaming mode, for example to encode a large array within a MarshalJSON method. This is a limitation of the json.Marshaler interface, and could potentially be addressed by a fix in that area.

It would almost be possible to implement this without adding a method at all by changing the Encode method to recognize Delim. However, it’s currently valid to pass a Delim value to encode (it encodes as a number), so this would break backward compatibility. Also, it’s arguably cleaner to have a separation between the low level token API and the higher level Encode API.

Related proposals

#33714

@rsc
Copy link
Contributor

@rsc rsc commented Oct 28, 2020

EncodeToken seems like a nice analog of Decoder.Token.

I'm less convinced about ClosingToken. If the program doesn't know what syntax comes next, what business does it have encoding JSON token by token?

@rsc rsc moved this from Incoming to Active in Proposals Oct 28, 2020
@rogpeppe
Copy link
Contributor Author

@rogpeppe rogpeppe commented Oct 29, 2020

Thanks for the review!

I'm less convinced about ClosingToken. If the program doesn't know what syntax comes next, what business does it have encoding JSON token by token?

I'm not deeply attached to ClosingToken either. I was trying to think through what might happen if this API is being used across modules. The program should always produce valid syntax, but perhaps some component has produced an invalid token and not checked the error. My thought it that it would be nice to be able to have some assurance that the JSON that has been encoded is valid without parsing the whole thing again. Initially I thought of just a Complete() bool method, but an error message produced because that returns false wouldn't be very informative - it would be better, I think, to say something about how it's incomplete, which was where the idea came from for ClosingToken which provides that functionality, would be trivial to implement and might possibly have other uses too.

@dsnet
Copy link
Member

@dsnet dsnet commented Nov 2, 2020

What's the expected flushing behavior for this feature? Do we expect that every call to Encoder.EncodeToken results in at least one Write call to the underlying io.Writer?

I see several reasonable semantics:

  1. Every call to Encoder.EncodeToken results in the data immediately written to the underlying io.Writer. However, there would be significant performance implications for this semantic.
  2. Encoder.EncodeToken only flushes to the underlying io.Writer whenever the internal buffer gets sufficiently full. In order to ensure the data is fully flushed, we provide an Encoder.Flush method.
  3. Encoder.EncodeToken flushes to the underlying io.Writer whenever the internal buffer gets sufficiently full OR a top-level JSON value has been completely encoded. For example, encoding the following sequence of tokens: [, "foo", 1, null, ] would only flush with the final ], which marks the end of the top-level JSON array.

In my opinion, I think option 3 is the most reasonable. It requires no new API, and has decent performance. Tracking whether the top-level value is complete is logic the Encoder has to do anyways.

@ChrisHines
Copy link
Contributor

@ChrisHines ChrisHines commented Nov 2, 2020

I also vote for option 3. For structured logging where each log record is typically a top-level JSON object flushing on each object has nice properties.

@rogpeppe
Copy link
Contributor Author

@rogpeppe rogpeppe commented Nov 3, 2020

I also agree that option 3 seems like the right approach, although it might also be useful to provide an explicit Flush method.

@rsc
Copy link
Contributor

@rsc rsc commented Nov 4, 2020

It seems like people are generally in favor of this (if accepted, it would be for Go 1.17). Do I have that right?

@dsnet
Copy link
Member

@dsnet dsnet commented Nov 11, 2020

I'm okay with this API, but I'd like to sanity check the behavior for Numbers.

The documentation currently says:

EncodeToken returns an error if ... a Number isn’t formatted correctly.

Since there are no Number constructors in the json package, it leaves formatting the number up to the user. The most obvious functionality to use would be strconv.FormatFloat. I think it is important property that if strconv.FormatFloat is used, the formatted number be guaranteed to be a valid JSON number.

It seems to me that:

  • The e, E, f, g, and G format flags will always produce output that is a valid JSON number. (Technically, Go's grammar for a floating-point number is more expressive than JSON's grammar, but it seems that strconv.FormatFloat is implemented in a way that is compatible; e.g. no leading zeros and at least one digit after the decimal point).
  • The only exceptions are for NaN and Infinity, but I think it's acceptable that those values result in a encode error as they already do (i.e., json.Marshal(math.NaN()) reports an error today).

Is that correct?

@rogpeppe
Copy link
Contributor Author

@rogpeppe rogpeppe commented Nov 11, 2020

@dsnet That's a good call. I think you're right, but it would be good to have definite assurance here (and we could mention FormatFloat in the docs too).

@rsc
Copy link
Contributor

@rsc rsc commented Nov 11, 2020

Yes, strconv.FormatFloat produces valid JSON for finite numbers, except perhaps if you use prec=0 with 'f'.

@rsc
Copy link
Contributor

@rsc rsc commented Nov 11, 2020

Based on the discussion above, this seems like a likely accept.

@rsc rsc moved this from Active to Likely Accept in Proposals Nov 11, 2020
@rsc
Copy link
Contributor

@rsc rsc commented Nov 18, 2020

No change in consensus, so accepted.

@rsc rsc moved this from Likely Accept to Accepted in Proposals Nov 19, 2020
@rsc rsc changed the title proposal: encoding/json: add Encoder.EncodeToken method encoding/json: add Encoder.EncodeToken method Nov 19, 2020
@rsc rsc modified the milestones: Proposal, Backlog Nov 19, 2020
@dividinglimits
Copy link

@dividinglimits dividinglimits commented Feb 5, 2021

Is there an appetite to change the function name to prevent stutter? I see that Encoder.Token() to maintain consistency with Decoder.Token() is not possible as it is already already defined and exported. I do not have a suggestion for a contextually appropriate alternative at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Accepted
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants