Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: Decoder.Token does not return an error for incomplete JSON #69449

Open
gazerro opened this issue Sep 13, 2024 · 8 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@gazerro
Copy link
Contributor

gazerro commented Sep 13, 2024

When dealing with incomplete JSON, the (*Decoder).Token method returns io.EOF at the end of the input instead of providing an appropriate error. For example, the following code does not produce an error:

https://go.dev/play/p/HHEwVkRCs7a

dec := json.NewDecoder(strings.NewReader(`[123,`))
for {
	_, err := dec.Token()
	if err == io.EOF {
		break
	}
	if err != nil {
		panic(err)
	}
}

According to the documentation, Token guarantees that the delimiters [ ] { } it returns are properly nested and matched. However, in this example, [ is not properly nested and matched.

@seankhliao seankhliao changed the title json: Token does not return an error for incomplete JSON encoding/json: Decoder.Token does not return an error for incomplete JSON Sep 13, 2024
@timothy-king timothy-king added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 13, 2024
@timothy-king timothy-king added this to the Backlog milestone Sep 13, 2024
@timothy-king
Copy link
Contributor

CC @rsc, @dsnet, @bradfitz, @mvdan.

@adonovan
Copy link
Member

adonovan commented Sep 13, 2024

It's not obvious to me that this is a bug. The doc comments state that Token returns (nil, EOF) at the end of the stream, which it does; and that Token guarantees that the delimiters it returns are properly nested and matched, which it does too, to the extent that you'll never see a mismatched or premature close bracket. The actual error in this case is that we ran out of input (for which EOF is appropriate), not that there was something wrong with the input we received, so it would be strange for it to report a syntax error, or to synthesize close bracket tokens that weren't present.

What would you have it do?

@dsnet
Copy link
Member

dsnet commented Sep 14, 2024

doc comments state that Token returns (nil, EOF) at the end of the stream

The question comes down to: what is a stream?

The docs for json.Decoder.Token defines a "stream" as:

The input stream consists of basic JSON values —bool, string, number, and null—along with delimiters [ ] { } of type Delim to mark the start and end of arrays and objects.

This is unfortunately problematic. The term "value" according to RFC 8259 implies a complete object or array (i.e., including the paired terminating ']' or '}'), but the prose in the latter half of the sentence is describing what RFC 8259 would actually call a "token".

Essentially:

  • If Decoder is reading a stream of JSON values, then io.ErrUnexpectedEOF is correct.
  • If Decoder is reading a stream of JSON tokens, then io.EOF is correct.

The "github.com/go-json-experiment/json" module reports io.ErrUnexpectedEOF since it defines jsontext.Decoder as decoding a stream of top-level "values". In all my own usages of jsontext.Decoder, this is the semantic I would have wanted. While I'm parsing a JSON value token-by-token, I'm often still operating under the assumption that the entire value is valid.

@gazerro
Copy link
Contributor Author

gazerro commented Sep 14, 2024

The same example with XML returns the error XML syntax error on line 1: unexpected EOF:

https://go.dev/play/p/wCvkSOlqeH0

dec := xml.NewDecoder(strings.NewReader(`<values>123`))
for {
	_, err := dec.Token()
	if err == io.EOF {
		break
	}
	if err != nil {
		panic(err)
	}
}

Therefore, I notice an inconsistency in the behavior of Decoder.Token methods between encoding/json and encoding/xml. This inconsistency doesn't seem justified by the differences between JSON and XML.

I believe the correct behavior would be to return an io.ErrUnexpectedEOF error when encountering incomplete input.

@adonovan
Copy link
Member

adonovan commented Sep 14, 2024

I believe the correct behavior would be to return an io.ErrUnexpectedEOF error when encountering incomplete input.

ErrUnexpectedEOF is appropriate for the case when the EOF occurs in the middle of a token, such as an unclosed string literal, or in the middle of "null", "true", or "false"; and indeed that's what the decoder does. But Token promises to deliver tokens, and a missing close bracket is not a token-decoding error.

The input stream consists of basic JSON values —bool, string, number, and null—along with delimiters...

@dsnet: This is unfortunately problematic. The term "value" according to RFC 8259 implies a complete object or array (i.e., including the paired terminating ']' or '}'), but the prose in the latter half of the sentence is describing what RFC 8259 would actually call a "token".

The quoted sentence says "basic values", and immediately defines them as the elementary, single-token values; I don't think one can argue that value is meant here in its general sense. So Token returns a stream of "basic values and delimiters".

@dsnet
Copy link
Member

dsnet commented Sep 14, 2024

immediately defines them as the elementary, single-token values

If this were true, then I would expect the parser to be handling the tokens as just a plain sequence of JSON tokens (i.e., only validating for the grammar for a JSON "token", but not the grammar for a JSON "value"). However, that's not quite how it behaves. Consider the following:

dec := json.NewDecoder(strings.NewReader(`{ "hello" }`))
for {
	tok, err := dec.Token()
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Println(tok)
}

which prints:

{
hello
invalid character '}' after object key

The fact that we see invalid character '}' after object key implies that it's using a push-down state-machine to validate whether the sequence of tokens is valid according to the grammar for a JSON "value". The difference between io.EOF or io.ErrUnexpectedEOF at the very end comes down to whether the push-down state machine has an empty stack or not.

@gazerro
Copy link
Contributor Author

gazerro commented Sep 14, 2024

In Go 1.5 and earlier, the behavior of Decoder.Token in the encoding/xml package was identical to the current behavior in the encoding/json package. Proposal #11405 recommended changing this behavior, and the change was introduced via CL https://golang.org/cl/14315 in Go 1.6.

Given that the Decoder.Token method was added to the encoding/json package around the same time as the modification to encoding/xml, it seems like an unfortunate mistake that the behavior in encoding/json was not updated to reflect this change.

@adonovan
Copy link
Member

The fact that we see invalid character '}' after object key implies that it's using a push-down state-machine to validate whether the sequence of tokens is valid according to the grammar for a JSON "value". The difference between io.EOF or io.ErrUnexpectedEOF at the very end comes down to whether the push-down state machine has an empty stack or not.

Hmm, good point. Both the (documented) no-mismatched-paren rule and the (undocumented) object grammar enforcement you just mentioned are evidence that Token actually intends to return only valid, complete sequences of tokens, and so returning ErrUnexpectedEOF if the sequence is incomplete seems reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants