Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: add line number to SyntaxError #43513

Open
AlexanderYastrebov opened this issue Jan 5, 2021 · 17 comments
Open

encoding/json: add line number to SyntaxError #43513

AlexanderYastrebov opened this issue Jan 5, 2021 · 17 comments

Comments

@AlexanderYastrebov
Copy link
Contributor

@AlexanderYastrebov AlexanderYastrebov commented Jan 5, 2021

What version of Go are you using (go version)?

$ go version
go version go1.15.6 linux/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"os"
)

func main() {
	dec := json.NewDecoder(os.Stdin)
	for {
		var m map[string]interface{}
		if err := dec.Decode(&m); err == io.EOF {
			break
		} else if err != nil {
			fmt.Printf("%#v\n", err)
			break
		}
	}
}
$ echo -e '{"x":\ny}' | go run jsonline.go 
&json.SyntaxError{msg:"invalid character 'y' looking for beginning of value", Offset:7}

At the moment SyntaxError provides an offset of the character that caused parsing error.
The proposal is add the line number as well.

I could not find such an issue therefore opened this one, please close if it was already resolved.

@gopherbot gopherbot added this to the Proposal milestone Jan 5, 2021
@mvdan
Copy link
Member

@mvdan mvdan commented Jan 5, 2021

Could you expand a bit on why the line number would help? Naively, an offset is enough to work out the line number if you were parsing a chunk of bytes, a buffer, or a file. Just count the number of newlines up to the offset.

If you were parsing a stream of data, then I'm not convinced that a line number would actually help anyone. A line number is usually helpful for a human to go look at the input in the right place, but that is not possible if the stream isn't saved somewhere.

Keeping track of extra information like line numbers also has an inherent cost, so I want to make sure the arguments in favor are solid.

@AlexanderYastrebov
Copy link
Contributor Author

@AlexanderYastrebov AlexanderYastrebov commented Jan 5, 2021

I agree, it should be possible to count lines having input at hand and without input it does not makes much sense.
I thought about it as of convenience feature but (after I've attempted to implement it) I am not sure its worth the cost of implementing/maintaining.

The proposal is coming from the usecase where parsing module does not have input at hand but bubbles up the error kubernetes/kubernetes#97651

@mvdan
Copy link
Member

@mvdan mvdan commented Jan 5, 2021

encoding/json's syntax errors aren't very end-user-friendly, so it's entirely expected that a user-facing piece of software would wrap them or decorate them to better show the user what's going on. Calculating the line number if there is a syntax error should take just a couple of lines of code, so I agree it's not worth adding straight to SyntaxError.

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 6, 2021

It would be nice to have a type LineCountReader in io. It's easy to write yourself, but seems like a common utility.

@AlexanderYastrebov
Copy link
Contributor Author

@AlexanderYastrebov AlexanderYastrebov commented Jan 6, 2021

@networkimprov As far as I can tell line counter would not help in this case due to buffering around here

func (dec *Decoder) refill() error {
// Make room to read more into the buffer.
// First slide down data already consumed.
if dec.scanp > 0 {
dec.scanned += int64(dec.scanp)
n := copy(dec.buf, dec.buf[dec.scanp:])
dec.buf = dec.buf[:n]
dec.scanp = 0
}
// Grow buffer if not large enough.
const minRead = 512
if cap(dec.buf)-len(dec.buf) < minRead {
newBuf := make([]byte, len(dec.buf), 2*cap(dec.buf)+minRead)
copy(newBuf, dec.buf)
dec.buf = newBuf
}
// Read. Delay error for next iteration (after scan).
n, err := dec.r.Read(dec.buf[len(dec.buf):cap(dec.buf)])
dec.buf = dec.buf[0 : len(dec.buf)+n]
return err
}

i.e. the parsing error might not necessary happen after last new line read.

@mvdan
Copy link
Member

@mvdan mvdan commented Jan 6, 2021

Indeed, most parsers/decoders read in big chunks, because doing one read for every single byte consumed would be incredibly expensive. So a io.Reader layer like @networkimprov suggests won't work.

@mvdan mvdan changed the title proposal: encoding/json: add line number to SyntaxError encoding/json: add line number to SyntaxError Jan 6, 2021
@mvdan mvdan removed this from the Proposal milestone Jan 6, 2021
@mvdan
Copy link
Member

@mvdan mvdan commented Jan 6, 2021

I don't think this needs the attention from the proposal review committee. They already have a backlog of hundreds of proposals as it is. We can make this decision amongst the encoding/json maintainers; CC @dsnet @rsc @bradfitz.

My opinion is still that we should reject this idea. If anyone has thoughts on a helper API to "prettify" json errors for end users, perhaps we could consider that, but one could always write that in a few lines of simple Go code.

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 6, 2021

Well two possible APIs are

dc := json.NewDecoder(fd)

dc.SetOpts(&json.DecodeOpts{...}) // applies to all Decode()s
err := dc.Decode(&v)

err := dc.DecodeOpts(&v, &json.DecodeOpts{...})

I'd use this, as my apps import long (hundreds of lines) test sequences from JSON files.

@AlexanderYastrebov
Copy link
Contributor Author

@AlexanderYastrebov AlexanderYastrebov commented Jan 6, 2021

Line counting without input at hand may be implemented like:

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"os"
	"sort"
)

type (
	lineCounter struct {
		r       io.Reader
		d       *json.Decoder
		bytes   int64
		lines   int
		offsets []offset
	}

	offset struct {
		index  int
		offset int64
	}
)

func (c *lineCounter) Read(data []byte) (n int, err error) {
	// remove newline offsets less than input offset
	i := c.search(c.d.InputOffset())
	c.offsets = c.offsets[i:]

	n, err = c.r.Read(data)
	for i, b := range data[:n] {
		if b == '\n' {
			c.offsets = append(c.offsets, offset{c.lines, c.bytes + int64(i)})
			c.lines++
		}
	}
	c.bytes += int64(n)
	return
}

func (c *lineCounter) newlinesBefore(offset int64) int {
	if len(c.offsets) == 0 {
		return c.lines
	}
	i := c.search(offset)
	if i == len(c.offsets) {
		return c.lines
	}
	return c.offsets[i].index
}

func (c *lineCounter) search(offset int64) int {
	return sort.Search(len(c.offsets), func(i int) bool { return c.offsets[i].offset >= offset })
}

func main() {
	c := &lineCounter{r: os.Stdin}
	dec := json.NewDecoder(c)
	c.d = dec

	for {
		var m map[string]interface{}
		if err := dec.Decode(&m); err == io.EOF {
			break
		} else if se, ok := err.(*json.SyntaxError); ok {
			fmt.Printf("%#v\n", err)
			//fmt.Printf("%#v\n", c)
			fmt.Printf("Line: %d\n", c.newlinesBefore(se.Offset)+1)
			break
		}
	}
}

Not trivial but does not require Decoder change.

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 6, 2021

I updated my previous comment to suggest encoding/json type DecodeOpts struct {...}

@dsnet
Copy link
Member

@dsnet dsnet commented Jan 6, 2021

Line counting (given the entire input buffer as jsonData) from the offset is as simple as:

line := 1 + strings.Count(jsonData[:offset], "\n")
column := 1 + offset - (strings.LastIndex(jsonData[:offset], "\n") + len("\n"))

Given it's simplicity, it calls into question whether we must provide this feature. The main advantage of having encoding/json handling it is that it would work for io.Reader inputs as well, but it doesn't seem that difficult to copy the entire input to a []byte so that you have access to the original input after the fact.

Stepping back, I personally, have no problem with the concept of adding line:column information and even explored an implementation last month where that's provided [1]. However, as @mvdan mentioned earlier, the issue is the performance cost it imparts onto the Decoder implementation. It's one thing to advance through all whitespace characters, it's another to also count the number of newlines that occurs. I don't remember the numbers, but there was a non-trivial performance cost for the common case where the line:column information is not even used.

dc.CountLines(true)

I don't think another API endpoint just for more verbose errors justifies its addition. Also, it's not clear to me that this fully avoids any performance slow down. I've been regularly surprised by how simply checking a flag for whether some branch should be taken itself incurs a performance slow down.

Given the strong desire that we avoid slowing down decoding in the common case, I propose two possible approaches:

  1. We add an example (or helper function) with the snippet above that shows how easy it is to manually compute line:column information.
  2. Provide line and column information in SyntaxErrors only if the entire []byte input is provided since it's trivial to compute this after the fact. However, we don't do it for inputs reading from an io.Reader since we want to avoid buffering the entire input. However, I'm not entirely fond of the inconsistent behavior between []byte and io.Reader inputs.

[1] This is in reference to an entirely new Decoder implementation that is an order of magnitude faster than the current one. See my comment elsewhere. Today's implementation is slow enough that the cost of line-counting is probably overshadowed by other inefficiencies. However, the existence of other inefficiencies today should not be justification for adding a performance cost that can never be removed since it leaks to the API.

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 6, 2021

When reading a file or network message, you almost always use json.Decoder, not json.Unmarshal. It would be crazy in many cases to save the whole stream to a buffer in case of error.

I don't think a performance cost is unreasonable when the feature is triggered by an API. When you really can't afford the cost, you simply don't incur it. Note that I suggested an extensible API which can accommodate other decoding options. Maybe accept_comments? :-)

@dsnet
Copy link
Member

@dsnet dsnet commented Jan 6, 2021

When you really can't afford the cost, you simply don't incur it.

That's not quite true. We still need to check a flag for whether the feature is enabled or not and checking the flag itself is a non-zero cost.

Note that I suggested an extensible API which can accommodate other decoding options.

This is outside the scope of this specific issue and has been explored quite a bit in other issues related to encoding/json. It's likely that the json API will evolve towards an options struct, but there's a lot of other issues that need to be considered simultaneously.

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 7, 2021

Worst case, you check the flag once per Decode(), to select a parsing function, no?

Is there an issue proposing a decoding options type? If not, this could be turned into one.

What other decoding options have been requested to date?

@mvdan
Copy link
Member

@mvdan mvdan commented Jan 7, 2021

  1. We add an example (or helper function) with the snippet above that shows how easy it is to manually compute line:column information.

I agree with this solution, if we want a solution.

What other decoding options have been requested to date?

https://github.com/golang/go/issues?q=is%3Aissue+is%3Aopen+encoding+json+in%3Atitle+label%3AProposal is a good start.

Is there an issue proposing a decoding options type? If not, this could be turned into one.

Personally, I don't think a GitHub thread is a good place to bikeshed rethinking the encoding/json options API. I think this thread should stay on topic.

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 7, 2021

The only solution suitable for long json streams is the one Alexander suggested here: #43513 (comment). It's not very easy.

Perhaps we should tag this as ProposalHold; to be considered when a DecodeOptions type emerges.

@dsnet
Copy link
Member

@dsnet dsnet commented Jan 8, 2021

The only solution suitable for long json streams

Providing line:column information for large inputs is a weak use case.

The primary purpose of line:column position information is for human consumption. By nature, line:column information is a poor mechanism for programmatically identifying the location of an error since it fundamentally requires O(n) processing to count the number of preceding lines. In contrast, a byte offset is always O(1) since it identifies exactly where in the input a problem occurred and it's suitable for machine consumption as you can always accurately convert it to a line:column position. Thus, line:column information is generally only useful for inputs that are only a few KiBs in size. Even many text editors today start choking when trying to open a text file that are a few MiB in size.

Worst case, you check the flag once per Decode(), to select a parsing function, no?

That technique assumes that there is a specialized parsing function for each possible combination of features. Since combinations are fundamentally O(n!), this incurs significant implementation complexity (and associated binary bloat). As someone who would potentially be responsible for maintaining encoding/json for the long-run, I'm fairly opposed to over-specialization.

Perhaps we should tag this as ProposalHold; to be considered when a DecodeOptions type emerges.

I don't believe that is necessary. It presumes that an API just to opt into logic for line-number tabulation is the direction we want to head. Personally, I don't think an option just to opt-into line:column tabulation is justified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants