Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: encoding/json: add ability to unmarshal with offset information #16433

Closed
ajeddeloh opened this issue Jul 19, 2016 · 8 comments

Comments

Projects
None yet
6 participants
@ajeddeloh
Copy link

commented Jul 19, 2016

Problem

Currently there is no way to get offset information when unmarshalling json; once the json has been unmarshalled into an interface{}, the offsets which the values were derived from cannot be obtained.

This is needed to do detailed validation error reporting, such as when json is used as a configuration language and the source of the error should be reported with the error. For simple cases, such as when the json is invalid due to syntax errors or mismatched types, json.SyntaxError and json.UnmarshalTypeError will report the offset with the error. However for more complex cases, such as when the json is first completely unmarshalled then validated, there is no mechanism for getting the offset from which the unmarshalled value came.

Proposed solution:

Add a new type to encoding/json which can be unmarshalled into like an interface{} but also includes offset information.

type Node struct {
        Start    int  // The offset this value starts at
        End      int // The offset this value ends at
        KeyStart int // If this value is part of struct, the offset its key starts at
        KeyEnd   int // If this value is part of a struct, the offset its key ends at
        Value    interface{} // The value to unmarshal into, just as with unmarshalling to interface{}. 
}

The Value field would be:

bool, for JSON booleans
float64, for JSON numbers
string, for JSON strings
[]Node, for JSON arrays
map[string]Node, for JSON objects
nil for JSON null

I have written an implementation here:
ajeddeloh@60514ff

This change does not affect unmarshalling of other types, since it only takes affect when unmarshalling to a json.Node.

@ianlancetaylor ianlancetaylor added this to the Proposal milestone Jul 20, 2016

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2016

It seems to me this will only be useful if you litter your data structure with fields of type json.Node. That seems inconvenient. Useful for error checks but not the way you want your data to be represented in the long term.

Another possibility would be to define a new method, along the lines of UnmarshalJSON, that takes an offset as an argument. That lets the user of the package choose how to handle the offset.

Either way it's not clear to me this is worth the additional complexity in the standard library package.

@ajeddeloh

This comment has been minimized.

Copy link
Author

commented Jul 20, 2016

It seems to me this will only be useful if you litter your data structure with fields of type json.Node.

My intention was to have this an interface analogous to unmarshalling to an interface{}. The way I've been using it is to unmarshal twice, once into a struct with json tags and once into a json.Node, effectively producing two trees. If you walk down both trees in sync you can simultaneously access both the unmarshalled data for validation and the offsets from the Node tree. A nice side effect of this is you can access the original string that was decoded and use it to generate warnings (e.g. if you have a field representing file permissions and the user put a value in the wrong base like 0644 instead of 420, the decimal equivalent). Furthermore, because all the json gets decoded into the Node, you can check for things like unused fields.

Another possibility would be to have the Value field of the Node structure be the struct you want to unmarshal into, and also add a Children field (being either nil for leaves, map[string]Node for objects or []Node for arrays), but I wanted to get feedback on the simpler case first before implementing it. It's also unclear what the interface for this would look like? A separate UnmarshalWithOffsets function? Unmarshalling to a Node with a struct already set for Value?

Another possibility would be to define a new method, along the lines of UnmarshalJSON, that takes an offset as an argument. That lets the user of the package choose how to handle the offset.

This would require every type to implement the interface, which would lead to a huge amount of boilerplate code. I'm also not sure I follow how this would be used? I'm imagining something like:

type foo Foo
func (f *Foo) UnmarshalJSONOffset(data []byte, offset int) error {
    // record offset somewhere for later use
    foo f2 := foo(*f)
    if err := json.Unmarshal(data, f2); err != nil {
        return err
    }
    *f = Foo(f2)
}

But the problem is that when json.Unmarshal gets called the offsets it uses will be relative to the start of the data passed to UnmarshalJSONOffset, not the start of the entire json blob. I suppose you could add something like json.UnmarsalWithBaseOffset which could automatically add the offset for you, but that seems somewhat messy.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2016

Over in #16426 minux suggested a Validator interface that could accept offset information, which might also help here.

I don't understand your suggestion of unmarshaling twice. If you do that at top level you will just get the top level offset. If you do that at every level then you can keep track of the offset yourself. How would your suggestion work?

@ajeddeloh

This comment has been minimized.

Copy link
Author

commented Jul 21, 2016

If you do that at top level you will just get the top level offset.

When unmarshalling to a Node, its children (if any) are also Nodes (either via a map[string]Node or []Node). Thus every node in the tree is a Node and has offsets; you don't get just the top level offset.

I think I understand your idea better now. It would only save relative offsets for each element and leave computing absolute offsets to the user. Correct? I'm still of the opinion this would need a lot of boilerplate code and would force the user to either embed the offset information in their types or use a global to record the offsets.

Over in #16426 minux suggested a Validator interface that could accept offset information, which might also help here.

I like the Validator idea, but because it needs to validated while being unmarshalled, it cannot check for conflicts if two sections of the json depend on each other (i.e. one references the other). Because json.Unmarshal only returns one error[], it would be impossible to allow multiple validation errors to be returned. This also means you can't return warnings, since json parsing could stop after hitting the first error. Finally, it doesn't allow for the possibility of detecting extra keys. Providing an interface to get offset information from unmarshalling gives users more flexibility.

@pciet

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2016

Offset into strings for JSON is good debug or recovery information potentially.

{ "keyA": "blah", "keyB": "blah", (<- repeated many times) keyABCDEFGH": "blah", "keyABCEFGI": "blah", <- so on }

A bug: you get this instance of JSON once every two weeks with that missing quote. This would be a nightmare to track down with bad or no logging assuming a large dataset - getting the exact malformed JSON point makes knowing "hey a client sent this bad huge data set, they have a bug" provable.

The proposal is a superset of this though, an arbitrary matching of string character index to JSON data symbols for processing beyond just signaling error handling.

"If you look at the original JSON string, each map and slice in the unmarshaled struct and each of their elements and their values point to each of these character indexes" I think is what is being asked for.

A solution could be a method that holds onto the work by returning an independent map:

type JSONSymbol string
type StringOffset int64
func UnmarshalWithOffset(data []byte, v interface{}) (map[JSONSymbol]StringOffset, error)

Then you iterate through the map or directly lookup the JSON symbols you care about. The documentation would have to define the JSONSymbol encoding. Ordering could be part of the JSONSymbol encoding.

This is a specialized use of JSON. I think it would not be too difficult to write a third party implementation to do the work a second time (which leaves the standard library as is):

func OffsetsForUnmarshal(jsondata string, valid YourUnmarshalType) ([]string, map[string]int64)

Determining last character for an object is looking up next object first character, minus one.

@ajeddeloh

This comment has been minimized.

Copy link
Author

commented Jul 22, 2016

I experimented with a map solution as well, except it mapped from the addresses of the unmarshalled data to their offsets instead of mapping from strings to offsets. This avoids having to specify a special encoding for the JSONSymbol. I think this proposal is cleaner and more similar to the existing api, which is why I chose to propose it instead.

@robpike

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2016

This seems largely about validation, which had a proposal that was rejected (#16426), so it seems there is little need for this. Also, as you say, errors already do deliver offset information.

@robpike robpike closed this Sep 12, 2016

@crawford

This comment has been minimized.

Copy link

commented Sep 16, 2016

While I don't disagree that validation might be out of scope for the core library, I do think that this mechanism (or something similar) is necessary. To take the example of validation - to create that external validation package would require either support from the core JSON package or an alternate implementation of the unmarshaller. It seems largely a waste of time (and almost certainly error-prone) to rewrite the JSON unmarshaller just for the purpose of exposing more information to the consumer. Modifying the core library to expose this information, on the other hand, is a relatively small change and enables external packages to greatly extend the functionality.

@golang golang locked and limited conversation to collaborators Sep 16, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.