Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: encoding/json: memoize strings during decode? #32779

Open
rsc opened this issue Jun 25, 2019 · 17 comments · May be fixed by #33416
Labels
Milestone

Comments

@rsc
Copy link
Contributor

@rsc rsc commented Jun 25, 2019

Part of the motivation for #32593 is the observation in json-iterator/go#376 that when you json decode, many of the same strings appear over and over in the JSON and get copied and reconstructed in the result over and over as well.

We do not want to make the JSON coalesce all the allocated strings into one giant buffer, because then holding on to just one of them holds onto the entire thing.

But it might make sense to add to the decoder state a simple map[[]byte]string and remember which specific byte slices we've already converted to string and reuse those allocated strings instead of doing the same conversions again in a particular Decode or Unmarshal operation.

It's unclear to me whether the map in a Decoder should persist between Decode operations. Probably? That will affect users who put Decoders in a pool or something like that, though. (Solution: don't put decoders in pools.)

@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Jun 25, 2019

A while ago I wrote some code to test this exact same thing with the json decoder, but could not detect any significant difference between the memory usage of the decoder with memoized strings and the regular json decoder, which made me think maybe this is already done somewhere? Apparently not. I can try to find out my test code...

@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Jun 25, 2019

Here's a link to my test code that does this for decoding json into interface{}:

https://github.com/bserdar/jsdecodertest

My initial comment was wrong, it does make a difference to store strings in a map. My test code stores only field keys in a map because those are the strings that are likely to repeat.

The json/ package is a copy of the stdlib encoding/json package modified to include a map[string]string in the decoder state.

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Jun 26, 2019

Nit: I assume the proposal meant to use a map[string]string with map[string([]byte)] lookup as map[[]byte]string is not a supported map type.

@mvdan

This comment has been minimized.

Copy link
Member

@mvdan mvdan commented Jun 26, 2019

Also cc @josharian, who had a very similar idea a while ago. That was briefly discussed in #5160.

As a thought, we probably don't want to reuse all strings. For example, in a map the keys probably tend to repeat much more often than the values.

Also, do we want to make this configurable? Someone who writes a program likely knows whether they'll read somewhat static/repeated data or not.

bserdar added a commit to bserdar/go that referenced this issue Aug 1, 2019
The existing decoder creates a new string for every key.
With this change, decoder keeps keys in a map[string]string
and uses a single copy for repeated keys.

Fixes golang#32779
@bserdar bserdar referenced a pull request that will close this issue Aug 1, 2019
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 1, 2019

Change https://golang.org/cl/188500 mentions this issue: encoding/json: memoize keys during decode

@rsc

This comment has been minimized.

Copy link
Contributor Author

@rsc rsc commented Aug 6, 2019

@mvdan, are you concerned about the overhead of the map lookup in your Jun 26 comment? From the outside there can be no possible downside to reusing all strings that are constructed.

bserdar added a commit to bserdar/go that referenced this issue Aug 15, 2019
The existing decoder creates a new string for every key.
With this change, decoder keeps keys in a map[string]string
and uses a single copy for repeated keys.

Fixes golang#32779
@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Aug 27, 2019

The CL mentioned above has an implementation of this for JSON decoder. It introduces some overhead when decoding to an interface{}, because that's the only case where the keys read during decoding will be kept after decoding is done.

However, I think there is a better way to do this. What I propose is to write an encoding/LiteralStore type (a map[string]string) containing a Get(string) string method to retrieve the shared instance of the string. Then this can be used from the decoders in encoding/xml and encoding/json. To use the LiteralStore, the caller has to assign an instance to the JSON/XML decoder. This way, the caller has the option to use a literal store for large input where it makes a difference, use the store for multiple docs, or avoid it completely.

What do you think?

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Aug 28, 2019

This is one of the conventional use-cases for weak references in GC'd languages.

Ideally, we should reuse strings that remain reachable from the last Decode invocation, and let the collector clean up any strings that have since become completely unreachable. That way, the frequently-used strings will remain in the map, but the infrequent strings can still be collected (so that the map doesn't grow without bound if the same Decoder is used on many different inputs over time).

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Aug 28, 2019

Alternatives that do not require weak references include:

  • Make the interning behavior an explicit opt-in using a Decoder method (perhaps func (*Decoder) InternStrings(func([]byte)bool)).
  • Make the interning behavior an explicit annotation on the struct field tag.
    • But this gets awkward to describe the behavior for map keys vs. map values.
@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Aug 28, 2019

@bcmills, I propose to make the interning behavior explicit opt-in using a Decoder field instead of a method. After reading #32593, I believe one way to make this acceptable in terms of performance and usage is this:

  • Add a strings/LiteralStore type (a map[string]string), with an Add(string) string method that will return a shared copy for string
  • Add Decoder.KeyLiterals field of type LiteralStore to both json and xml decoders.
  • Use the KeyLiterals if it is non-nil during decoding to store keys

This implementation does not change the existing behavior and makes the use of a literal store explicit opt-in.

A literal store like this is only useful for larger json/xml input, and only if it is manually decoded, or unmarshaled into an interface{}. If the input is unmarshaled/decoded into a struct, then there is no need to keep the keys as they will be lost.

Re: weak references, afaik there is no way to implement this using finalizers, am I wrong?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 28, 2019

@bserdar I don't think there is any way to do this with finalizers because existing users will be using string values, not pointers, so there is nothing to hang a finalizer on. For that matter weak references wouldn't work either: you would have to change all the users to use the weak reference type rather than string.

In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use.

@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Aug 28, 2019

@ianlancetaylor thanks for the clarification.

If there is interest in this, I can modify https://golang.org/cl/188500 to implement this. I already have some wrapper code around json/xml decoder to do this because reusing just the key strings makes a lot of difference for large docs.

cc @mvdan

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Aug 29, 2019

For that matter weak references wouldn't work either: you would have to change all the users to use the weak reference type rather than string.

No? We would need some sort of language-supported “weak string” type, but then only the json.Decoder would hold a weak reference — the decoded messages would use a regular string so that the data doesn't disappear out from beneath them.

(But that assumes a language change that seems pretty unlikely to ever happen.)

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Aug 29, 2019

In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use.

Ah, I think I see what you're saying. If we could control the types in use, then we could add a pointer indirection, and require users to keep the pointer alive in order to retain the string value in the decoder's cache.

But then you can't use those pointers in the same way as ordinary strings: they can't be map keys, they won't work with the normal built-in string operations, and they can't be passed to the strings package without falling back to manual memory management (via runtime.KeepAlive).

It may be true that they are technically equivalent in power, but they are not equivalently usable.

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Aug 29, 2019

@bserdar, strings.LiteralStore seems unfortunate given that we're also discussing adding generics to the language, because there is a much more general “interning table” API that can be used with any type that supports hashing and equality.

The general form could have an API like:

package intern

type Table(type T HashEqualer) […]

func (t Table(T)) Get(x T) (canonicalX T) {
	[…]
}

Such an API is also useful for building things like abstract syntax trees, where it can be useful to compress trees by coalescing occurrences of a common subexpression to the same canonical instance.

(A long time ago I maintained a server that evaluated large numbers of boolean expressions for ad targeting, and it used a similar approach — to great effect — to compress the targeting expressions, including in our wire protocol.)

@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Aug 29, 2019

@bcmills, interning table would be nice, but it won't be available any time soon. Until that becomes available, we can keep interning internal to the Decoder, but expose an InternKeys(bool) method for explicit opt-in. Or, implement the strings.LiteralStore, and change it to use the intern package when generics are done.

bserdar added a commit to bserdar/go that referenced this issue Sep 7, 2019
The existing decoder creates a new string for every key.
For large JSON input where keys are repeated, we can avoid
this by interning keys. This change adds a Decoder.InternKeys
function that causes the decoder to store keys in a
map[string]string and uses a single copy for repeated keys.
The default behavior is not to intern keys.

This change only affects Decoder when decoding to an interface{}
or, when tokenizing the input. It does not affect Unmarshal, or
decoding to a struct.

Fixes golang#32779
bserdar added a commit to bserdar/go that referenced this issue Sep 11, 2019
The existing decoder creates a new string for every key.
For large JSON input where keys are repeated, we can avoid
this by interning keys. This change adds a Decoder.InternKeys
function that causes the decoder to store keys in a
map[string]string and uses a single copy for repeated keys.
The default behavior is not to intern keys.

This change only affects Decoder when decoding to an interface{}
or, when tokenizing the input. It does not affect Unmarshal, or
decoding to a struct.

Fixes golang#32779
@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Oct 2, 2019

xml.Decoder can benefit from interning strings as well. What is the opinion about that? Decoder.InternKeys() would be the explicit opt-in for interning keys during decode. Something similar to this: https://go-review.googlesource.com/c/go/+/188500

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.