Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: revert "reuse values when decoding map elements" #39149

Closed
dsnet opened this issue May 19, 2020 · 20 comments
Closed

encoding/json: revert "reuse values when decoding map elements" #39149

dsnet opened this issue May 19, 2020 · 20 comments

Comments

@dsnet
Copy link
Member

@dsnet dsnet commented May 19, 2020

Regression testing showed that about ~50 targets would break with https://golang.org/cl/179337 which was submitted to address #31924.

Personally, I think https://golang.org/cl/179337 brings consistency between the handling of structs and maps and I appreciate that fact. However, the old behavior is technically specified as the documentation for maps says:

To unmarshal a JSON object into a map, ... Unmarshal then stores key-value pairs from the JSON object into the map.

The documentation uses the term "stores", which matches the behavior of go1.14 and before. On the other hand, https://golang.org/cl/179337 (for go1.15) has functionally changed the behavior of Unmarshal such that it "merges key-value pairs from the JSON object into the map".

While ~50 targets isn't terribly a lot in the grand scheme of things, it is definitely on the higher end for what is usually acceptable for changes. If the existing user-code was incorrectly using encoding/json, then there would be more justification for the Go standard library to break them. However, that does not seem to be the case here and I think we're bound by the Go1 compatibility agreement to preserve the old behavior.

\cc @dmitshur @mvdan @ianlancetaylor

@seebs
Copy link
Contributor

@seebs seebs commented May 20, 2020

I was going to concede this (perhaps grudgingly), but my argument for it convinced me the other way again.

When describing how decode works with various types, the documentation mostly avoids saying "stores". For instance:

To unmarshal JSON into a struct, Unmarshal matches incoming object keys to the keys used by Marshal (either the struct field name or its tag), preferring an exact match but also accepting a case-insensitive match. By default, object keys which don't have a corresponding struct field are ignored (see Decoder.DisallowUnknownFields for an alternative).

This doesn't actually explicitly state what happens when the keys match the keys used by Marshal, but we can reasonably infer that it stores them.

Of the specific types, only the map says that Unmarshal "stores" keys. And you draw a reasonable distinction between "store" and "merge".

But wait! What else, if anything, uses the word "stores"?

Unmarshal parses the JSON-encoded data and stores the result in the value pointed to by v. If v is nil or not a pointer, Unmarshal returns an InvalidUnmarshalError.

Which is to say: The behavior currently described for each of these types is what Unmarshal means by "stores".

If we are "storing" a value into a map, with Unmarshal, we should be doing to it the same thing we would have done had it been the top-level parameter to Unmarshal. Which is to say that, for a pointer to a struct, we should do what we would do with that normally: Store the value into the thing-pointed-to (or allocate a new zero-valued one if the pointer is currently nil), and then follow the rules given for storing into a struct -- which overwrites explicitly provided fields, but won't zero out previously-existing non-zero fields.

So I think that if existing users expected "store" to mean "overwrite deleting the old contents" for a pointer-to-a-struct, they were wrong, and if they expected it to mean that only if the value was in a map, they were... still probably wrong, because unmarshal is recursive, so "storing" is still governed by the definition the function provides for what it means to "store" json values in existing values.

@dsnet
Copy link
Member Author

@dsnet dsnet commented May 20, 2020

Interesting line of reasoning to derive a different definition of "store" than one would typically expect since it's already used elsewhere in the documentation to mean some form of merging. I'm inclined to agree with you and would prefer consistent behavior for structs and maps.

However, this doesn't change the fact that there's a decent amount of code that depends on the old behavior. There's a pretty high bar for what changes we permit that may break user code. I've been involved with Go release regression testing for about 4.5 years and this change is up there for most number of breakages caused (that I can remember).

The other instance that I can remember that had more regression failures was the monotonic time change which broke many targets relying on reflect.DeepEqual or == to compare timestamps. However, that change was justified because it provided indisputable benefit to Go users, and time.Time is documented as expecting users to use the Equal method. Thus, it was justified breaking all of those users since they were already doing some contrary to what is explicitly documented.

The arguments made for the monotonic time change don't quite apply here. While it is nice to unify the behavior of structs and maps and this change benefits some users, it's not a critical feature nor critical bug fix. Secondly, the fact that there's even confusion and debate about proper behavior means that it's unfair to tell affected users that they were holding it wrong for go1.14 and below and thus break them.

@mvdan
Copy link
Member

@mvdan mvdan commented May 20, 2020

Thanks @dsnet for bringing this up. Could you clarify what those 50 cases look like? Why are they relying on json to delete data, when the package almost never does that?

I agree with @seebs that I've never read "stores" as "overwrites" or "replaces"; I personally think it's pretty clear from this line at the top of the API, like he said:

Unmarshal parses the JSON-encoded data and stores the result in the value pointed to by v.

If we conclude that "store" is ambiguous, then we should conclude that most "merge" versus "replace" behavior is an implementation detail and not properly documented, which means we basically need to modify the docs to reflect hyrum's law :)

I never imagined so many internal Google targets would break with this change, though. I think they reasonably fall under "made incorrect assumptions from the docs", so I do think they should be fixed. Whether or not 50 is too many is a bit subjective. For example, out of how many in total using json? You mention your experience over the past five years of releases, but I imagine the number of targets has also been increasing.

@as
Copy link
Contributor

@as as commented May 20, 2020

If we are "storing" a value into a map, with Unmarshal, we should be doing to it the same thing we would have done had it been the top-level parameter to Unmarshal. Which is to say that, for a pointer to a struct, we should do what we would do with that normally: Store the value into the thing-pointed-to (or allocate a new zero-valued one if the pointer is currently nil), and then follow the rules given for storing into a struct -- which overwrites explicitly provided fields, but won't zero out previously-existing non-zero fields.

This is the reason I created #31924. We were pre-filling a large struct with default values and then calling Unmarshal to complete the structure after placing it into the map, only to find that the mapped result had a new structure take its place.

I suspect this was the first bug report because this particular issue is difficult to pin on Go itself. When you first encounter it, you tend to think it's a mistake somewhere in your code, especially when you look at the behavior of struct pointers directly. I ran into it twice before really investigating and finding out it was an inconsistency in the language itself.

@gopherbot
Copy link

@gopherbot gopherbot commented May 20, 2020

Change https://golang.org/cl/234559 mentions this issue: Revert "encoding/json: reuse values when decoding map elements"

@mvdan
Copy link
Member

@mvdan mvdan commented May 20, 2020

I've created a revert CL, just for the sake of being ready to move quickly before the beta is out.

If we decide to go ahead with the revert, I think two follow-up steps should be done:

  1. Clarify the documentation about what cases merge with existing values (e.g. struct fields) versus which replace existing values (e.g. map elements)
  2. File a separate issue about making this element of the API more consistent in a potential future redesign of the encoding/json API
@dsnet
Copy link
Member Author

@dsnet dsnet commented May 21, 2020

Could you clarify what those 50 cases look like? Why are they relying on json to delete data, when the package almost never does that?

The user code is calling json.Unmarshal on partially populated object and depending on structs having merge semantics, but maps having replacement semantics. Given that many users at Google have a background with protobufs, maybe (pure speculation) this expectation stems from protobufs since they follow merge semantics for singular messages, but replacement semantics for map entries.

As far as I can tell, no one seems to be depending on merge semantics due to duplicate keys in an object. If that were the case, then there would be greater justification to break them since RFC 7159 says that "when the names within an object are not unique, the behavior of software that receives such an object is unpredictable".

Whether or not 50 is too many is a bit subjective. For example, out of how many in total using json? You mention your experience over the past five years of releases, but I imagine the number of targets has also been increasing.

Definitely agree it's subjective, hence this issue to debate this :)

I'm not at liberty to reveal exactly how many usages of encoding/json exist, but I can say that encoding/json is not widely used at Google, since most services use protobufs (unsurprisingly). So if X% of all Go code at Google experience issues with this change, then I suspect that a number greater than X% outside Google may experience this problem since encoding/json seems to be much more widely used outside Google.

For some reference, I gave a talk back in 2017 about the regression testing that we do. Some things to note:

  • There's an increase of about 70 failures on 2016-10-20. I don't remember exactly what these were, but they were probably caused by several different CLs or a single CL that was sufficiently problematic and got rolled back.
  • In general, any regression failures more than a dozen from a single CL is alarming.
  • While this is from 2017, and there is certainly more Go code at Google, ~50 failures from a single CL is still very notable.

potential future redesign of the encoding/json API

+1, would love to see more consistent behavior in some hypothetical future v2 encoding/json.

@as
Copy link
Contributor

@as as commented May 21, 2020

I think it's a shame that 50 internal failures can justify solidifying such an inconsistency in the API. I propose that we somehow evaluate whether external open source code relies on this behavior because that would give us perspective over whether this is a result of a ubiquitous misunderstanding that many users have come to rely on or just the product of a few internal users at google writing code in the same way.

@dsnet
Copy link
Member Author

@dsnet dsnet commented May 21, 2020

I think it's a shame that 50 internal failures can justify solidifying such an inconsistency

If you watched my talk, this has nothing to do with Google vs outside Google.

The Go project has a high standard of trying not to break people's code (regardless of what company it belongs to). It is intended to be a language platform that people can depend on and we need to judiciously make informed decisions that evaluate the costs of making a change if any breakages are known (which is definitely the case here). Sometimes you break people if it is justified, sometimes you don't and bite your own tongue.

Google's suite of tests happens to be a decent representation of all the Go code out there in the world. We know it's not completely representative, but it's far better than nothing and pure speculation. Google funds the Go project and it's a blessing to have the resources to even perform such regression testing. However that regression testing is unsurprisingly geared towards Google's internal codebases since we're leveraging existing platforms for practical reasons.

I propose that we somehow evaluate whether external open source code relies on this behavior because

If you have access to a massive suite of open-source projects and can run the tip toolchain on millions of packages, I'd love to see that data. Otherwise it's not helpful objecting to concrete data that's been presented with a proposal to do something that's really nice in theory, but currently infeasible in practice.

@as
Copy link
Contributor

@as as commented May 22, 2020

The position of consistency is always easier to defend that correctness, especially in a language that wants to be backwards-compatible. However, I believe that baking in compatibility based on very open assumptions should be justified on better predicates, especially when the behavior is based on what many users see as a perceived bug and have presented good arguments for fixing.

We are making a decision based on the consistency of the entire encoding/json package vs 50 test failures from dubious packages to Google. The latter affects 50 targets, the former affects the entire package existentially and applies to all future users of the package. Should consumer's interpretation of a feature dictate its underlying implementation compatibility?

I don't know, because I can't see the packages that failed, or understand why they did in the first place. If some other organization complained about 50 test failures, we certainly would want to see the code that lead us to that point. Which is why I believe this actually is a Google vs outside Google thing, solely because if this was some other company that had test failures, we would want to see their code and determine whether their use case was compelling enough to justify retaining a bug in the standard library.

@dsnet
Copy link
Member Author

@dsnet dsnet commented May 22, 2020

The position of consistency is always easier to defend that correctness,

I disagree correctness is well-defined here. I certainly agree that the change is more consistent and have stated support for it already multiple times in this issue. However, consistency != correctness, especially when the documentation has not been clear regarding this.

50 test failures from dubious packages to Google.

It's evident that you don't trust Google, can you provide evidence that this isn't a problem in open-source? My stance in this is irrelevant of where the failures are occurring (Google or not). I've already stated before that I'm more concerned about the overall Go ecosystem.

affects the entire package existentially and applies to all future users of the package.

You're preaching to the choir. I maintain many widely used packages and it would be incredibly convenient if I could arbitrarily break users for the sake of a better future especially if they are depending on undefined behavior. However, sometimes your hands are tied if a sufficient number of users do things contrary to what you desire.

Should consumer's interpretation of a feature dictate its underlying implementation compatibility?

Yes. What users do with a library should (for better or worse) have an effect on what changes are made by the authors of that library. Certainly you may have a different stance, but the Go project holds to a high standard of not breaking users that are not doing something clearly wrong. The encoding/json package is part of the Go project, and so it adheres to that standard. In this case in particular, there is no simple workaround to manually preserve the old semantics in light of this change and so the existing user code will be broken in a very inconvenient way.

If you don't care what you're users do with your library, then release it at v0 and make changes as you see fit. In fact, the Go standard library did that for many years from 2009 to 2012. We're in v1 now and we need to take backwards compatibility seriously.

@as
Copy link
Contributor

@as as commented May 22, 2020

It's evident that you don't trust Google

This is not what I was trying to convey, but based on this I feel the discussion has become unproductive, so I will leave to the others to discuss the topic instead.

@seebs
Copy link
Contributor

@seebs seebs commented May 22, 2020

I don't think correctness is clearly defined, but I think that the interpretation of store-as-recursive is clearly better.

It's not really a question of whether it's Google, specifically; the problem is that while I am now aware that there exist 50 packages which are affected, I don't know the denominator for that numerator, and I don't have even one example I can look at. So I don't know what kind of breakage this is. Maybe it could be fixed by changing maps to be maps of foo instead of *foo*. Maybe it actually doesn't affect real code, it just affects the exact DeepEqual output being checked by a test case. I have no way of knowing.

It seems like it would help a lot with evaluating things like this if at least some of the test cases were available to look at. Ideally, I'd want to talk to someone whose code was affected and find out whether it was intentional, whether they expected the behavior, etcetera.

If nothing else, perhaps this would be a good place to put in a deprecation note. Go has previously done release notes with "we're going to change X for consistency" type things and that seems to have worked okay.

@toothrot
Copy link
Contributor

@toothrot toothrot commented May 26, 2020

Hello! This is one of the few remaining issues blocking the Beta release of Go 1.15. We'll need to make a decision on this in the next week in order to keep our release on schedule.

@mvdan
Copy link
Member

@mvdan mvdan commented May 26, 2020

I share what @seebs said in #39149 (comment). It's hard for me to properly understand what kind of regressions we're seeing, if I can't see what the code was doing or why.

Having said that, like @dsnet says, we just don't have a large corpus of third party Go tests using encoding/json to counter-balance this breakage, or to argue that we have any evidence of our own. Proving that this regression isn't a problem in open source is a question that's hard to answer well, but we don't even have a partial answer today.

I think at best we're evenly split on whether or not this breakage is OK given the Go1 compatibility guarantee. Plus, the beta is just a few days away. So I think we should revert, even if that feels like wasted effort on my part. The revert CL is ready, so if @dsnet wants to give it a review, it can be merged very soon.

Once merged, I'll do the two follow-up steps outlined in #39149 (comment). Those will still help improve the json package in the long term, at least.

@andybons
Copy link
Member

@andybons andybons commented May 26, 2020

As a note, reverting this change does not mean this discussion has to end.

Thanks for handling the revert and follow-up items, @mvdan.

@mvdan
Copy link
Member

@mvdan mvdan commented May 26, 2020

Indeed, we could try again for 1.16, though I don't think that's a good idea unless such a "corpus of json tests" is constructed.

@mvdan
Copy link
Member

@mvdan mvdan commented May 28, 2020

A couple of days have elapsed, and there are only two working days left until the beta, so I'll go ahead and merge. Thanks @andybons for the review.

If someone strongly feels about giving this another try in 1.16, please file a new issue.

@dmitshur
Copy link
Member

@dmitshur dmitshur commented May 28, 2020

We've looked over this issue in a release meeting. Based on the discussion above, though it was a close call in either direction, the decision we're making is to proceed with a rollback, which is implemented in CL 234559. I'll update the issue state to NeedsFix to reflect that.

Thank you for your diligent work no this @mvdan and the original report @dsnet. The encoding/json package is in a mature state, so while unfortunate, it can happen that we'll learn more about an attempt to improve after the implementation work has already been done. Hopefully this will help inform any future attempts.

@dmitshur dmitshur added NeedsFix and removed NeedsDecision labels May 28, 2020
@gopherbot gopherbot closed this in 107ebb1 May 28, 2020
@mvdan
Copy link
Member

@mvdan mvdan commented May 28, 2020

Huh, comment race condition :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.