Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: encoding/binary: cache dataSize across invocations of Read and Write #34471

Closed
lmb opened this issue Sep 23, 2019 · 13 comments
Closed

Comments

@lmb
Copy link
Contributor

lmb commented Sep 23, 2019

As of current tip (dfbc9c83a910c79cb3cc34dbfaed3c436e1b6ecb), using binary.Read to repeatedly unmarshal a struct duplicates a lot of work. This is because the size of the struct is re-computed on every invocation, which causes a lot of work via reflection.

You can observe this by looking at a CPU profile of the ReadStruct benchmark from the standard library:

(pprof) top binary.Read -cum
Active filters:
   focus=binary.Read
Showing nodes accounting for 460ms, 43.40% of 1060ms total
Showing top 10 nodes out of 48
      flat  flat%   sum%        cum   cum%
         0     0%     0%      990ms 93.40%  std/encoding/binary.BenchmarkReadStruct
      20ms  1.89%  1.89%      990ms 93.40%  std/encoding/binary.Read
         0     0%  1.89%      990ms 93.40%  testing.(*B).launch
         0     0%  1.89%      990ms 93.40%  testing.(*B).runN
         0     0%  1.89%      640ms 60.38%  std/encoding/binary.dataSize
      90ms  8.49% 10.38%      640ms 60.38%  std/encoding/binary.sizeof
      20ms  1.89% 12.26%      490ms 46.23%  reflect.(*rtype).Field
      60ms  5.66% 17.92%      360ms 33.96%  reflect.(*structType).Field
     130ms 12.26% 30.19%      300ms 28.30%  std/encoding/binary.(*decoder).value
     140ms 13.21% 43.40%      250ms 23.58%  runtime.mallocgc

Here, binary.dataSize accounts for almost 60% of runtime, even though the size of a type is fixed and does not change between invocations.

I've written a patch which caches the result of binary.dataSize in a sync.Map, with the following effect on the benchmark:

name                    old time/op    new time/op     delta
ReadStruct-8              1.06µs ± 1%     0.39µs ± 6%   -62.87%  (p=0.000 n=16+16)

name                    old speed      new speed       delta
ReadStruct-8            70.6MB/s ± 1%  190.4MB/s ± 6%  +169.74%  (p=0.000 n=16+16)

name                    old alloc/op   new alloc/op    delta
ReadStruct-8                200B ± 0%        80B ± 0%   -60.00%  (p=0.000 n=16+16)

name                    old allocs/op  new allocs/op   delta
ReadStruct-8                16.0 ± 0%        1.0 ± 0%   -93.75%  (p=0.000 n=16+16)

As you can see, we pretty much recoup the 60% seen in the CPU profile. I think Write may see a similar speed up, but haven't checked. However, the patch has a problem: the cache can grow without bounds, and there is no way to reset it except by restarting the program.

I think the correct solution to this would be to introduce something like an Encoder / Decoder to binary which "owns" the cache. But it seems like this might go against the spirit of the package. Looking at https://godoc.org/?q=binary there are many programs that would benefit from a faster Read / Write, so I'd like to explore this option nonetheless.

@gopherbot gopherbot added this to the Proposal milestone Sep 23, 2019
@gopherbot
Copy link

Change https://golang.org/cl/199539 mentions this issue: encoding/binary: memoize the result of dataSize

@lmb
Copy link
Contributor Author

lmb commented Oct 7, 2019

The above is my straw man CL, hopefully it can spur some discussion.

@ianlancetaylor
Copy link
Contributor

Adding a map to the default encoder will mean that some programs will permanently burn memory that would prefer not to lose forever. I'm not sure that's the best choice.

@lmb
Copy link
Contributor Author

lmb commented Oct 16, 2019

@ianlancetaylor @bradfitz can either of you take another look?

@ianlancetaylor
Copy link
Contributor

Looking at https://golang.org/cl/199539, it seems that NewReader and NewWriter are unnecessary. The map can be lazily initialized if it is needed. So the proposal is to add two new types to encoding/binary: Reader and Writer. They have Read and Write methods, respectively, but they are not instances of io.Reader and io.Writer.

type Reader struct { ... }

func (r *Reader) Read(r io.Reader, order ByteOrder, data interface{}) error

type Writer struct { ... }

func (w *Writer) Write(w io.Writer, order ByteOrder, data interface{}) error

Whether the existing Read and Write functions use Reader and Writer is an implementation detail.

It seems odd to have to pass in the io.Reader and ByteOrder each time. Presumably they are going to be the same for each use of a Reader or Writer. and Read and Write methods normally implement io.Reader and io.Writer, but these can't, so perhaps we should follow encoding/json and use Encode and Decode. That suggests:

type Reader ...

func NewReader(io.Reader, ByteOrder) *Reader

func (r *Reader) Decode(interface{}) error

type Writer ...

func NewWriter(io.Writer, ByteOrder) *Writer

func (w *Writer) Encode(interface{}) error

@lmb
Copy link
Contributor Author

lmb commented Oct 18, 2019

They have Read and Write methods, respectively, but they are not instances of io.Reader and io.Writer.

Yes, that's true. I figured it would be better to stick with the terminology of binary with Read and Write. Looking at json and gob I would have preferred Decoder and Encoder but those already exist unexported in the code.

It seems odd to have to pass in the io.Reader and ByteOrder each time. Presumably they are going to be the same for each use of a Reader or Writer.

This isn't true for the two use cases I have in mind.

First, I'm reading perfEventHeader from multiple distinct ring buffers. With your proposed API I'd have to do one of the following:

  • Have one binary.Reader and one "swappableReader" which just forwards to another io.Reader. Not nice, but minimal overhead.
  • Have one binary.Reader per ring buffer. Nicer than the above, the added overhead for another map, etc. is probably negligible.
    The perf improvement for this case is ~8%.

Second, I use binary to decode the keys and values of a Map object exposed by the Linux kernel. The object is usually long lived, and key and value map to a single Go type. Users will repeatedly do lookups / insertions into the map. Here, caching dataSize results in ~15% speed up for a moderately complicated struct.

I think those improvements are nice, but as always looked better in the microbenchmarks. I also think that the second use case would benefit from being able to avoid io.Reader entirely. Combined with the fact that I can't come up with a nice API tells me that my requirements might be non-standard, and that just forking binary for my own purposes is easier. I happy to withdraw this proposal if you agree.

@ianlancetaylor
Copy link
Contributor

You are saying that you vary the io.Reader? But you presumably don't vary the ByteOrder?

If you want to continue with this proposal, do you want to propose a different API?

@rsc
Copy link
Contributor

rsc commented Oct 21, 2019

Please no. encoding/binary can be made arbitrarily more complex in an effort to make it "fast". It was never intended to be fast, only convenient.

@bcmills
Copy link
Member

bcmills commented Oct 22, 2019

Adding a map to the default encoder will mean that some programs will permanently burn memory that would prefer not to lose forever.

It was never intended to be fast, only convenient.

The size of the map is presumably proportional to the number of distinct types passed to encoding/binary. That set has a fixed and (generally) small upper bound, except for the types synthesized by the reflect package.

Is it possible to detect types synthesized by the reflect package? If so, we could omit those from the map, and then I'm not sure that a global map would really be so bad — it would avoid the need for changes in the package API, and programs that really care about that level of memory overhead can always marshal through a more restricted set of types (or avoid encoding/binary entirely).

@lmb
Copy link
Contributor Author

lmb commented Oct 23, 2019

You are saying that you vary the io.Reader? But you presumably don't vary the ByteOrder?

That's correct.

Please no. encoding/binary can be made arbitrarily more complex in an effort to make it "fast". It was never intended to be fast, only convenient.

Agreed. In fact the package is so convenient that there isn't much else out there, see the godoc.org link I referenced earlier. My thinking is that the ecosystem benefits if we can get a substantial speed up for a very commonly used package.

The size of the map is presumably proportional to the number of distinct types passed to encoding/binary. That set has a fixed (and generally) small upper bound, except for the types synthesized by the reflect package.

I think that is correct. I was looking at reflect to figure out how heavy a reflect.Type is, and it turns out that the package already has global caches just like the one I propose here:

@bradfitz
Copy link
Contributor

Adding a sync.Map caching just the struct sizes makes a big difference:

bradfitz@go:~/go/src/encoding/binary$ go test -v -run=XXX -bench=ReadStruct$ -count=5
goos: linux
goarch: amd64
pkg: encoding/binary
BenchmarkReadStruct
BenchmarkReadStruct-4            2036982               588 ns/op         127.49 MB/s
BenchmarkReadStruct-4            2030828               593 ns/op         126.44 MB/s
BenchmarkReadStruct-4            2032056               595 ns/op         126.04 MB/s
BenchmarkReadStruct-4            2031474               593 ns/op         126.41 MB/s
BenchmarkReadStruct-4            2042948               588 ns/op         127.60 MB/s
PASS    
ok      encoding/binary 9.016s
bradfitz@go:~/go/src/encoding/binary$ go test -v -run=XXX -bench=ReadStruct$ -count=5
goos: linux
goarch: amd64
pkg: encoding/binary
BenchmarkReadStruct
BenchmarkReadStruct-4             741402              1523 ns/op          49.24 MB/s
BenchmarkReadStruct-4             807487              1517 ns/op          49.45 MB/s
BenchmarkReadStruct-4             816326              1523 ns/op          49.23 MB/s
BenchmarkReadStruct-4             796476              1518 ns/op          49.42 MB/s
BenchmarkReadStruct-4             804229              1519 ns/op          49.37 MB/s
PASS
ok      encoding/binary 6.120s
diff --git a/src/encoding/binary/binary.go b/src/encoding/binary/binary.go
index 8c2d1d9da4..8b1e97cb2d 100644
--- a/src/encoding/binary/binary.go
+++ b/src/encoding/binary/binary.go
@@ -26,6 +26,7 @@ import (
        "io"
        "math"
        "reflect"
+       "sync"
 )
 
 // A ByteOrder specifies how to convert byte sequences into
@@ -377,6 +378,8 @@ func dataSize(v reflect.Value) int {
        return sizeof(v.Type())
 }
 
+var structSize sync.Map // reflect.Type -> size
+
 // sizeof returns the size >= 0 of variables for the given type or -1 if the type is not acceptable.
 func sizeof(t reflect.Type) int {
        switch t.Kind() {
@@ -386,6 +389,9 @@ func sizeof(t reflect.Type) int {
                }
 
        case reflect.Struct:
+               if size, ok := structSize.Load(t); ok {
+                       return size.(int)
+               }
                sum := 0
                for i, n := 0, t.NumField(); i < n; i++ {
                        s := sizeof(t.Field(i).Type)
@@ -394,6 +400,9 @@ func sizeof(t reflect.Type) int {
                        }
                        sum += s
                }
+               if t.Name() != "" { // don't cache reflect-created types
+                       structSize.Store(t, sum)
+               }
                return sum
 
        case reflect.Bool,

I'm fine with that.

It's a bounded number of structs.

@lmb
Copy link
Contributor Author

lmb commented Oct 24, 2019

I've updated the CL based on your code @bradfitz, with a little twist: we only cache the result of dataSize, which should mean that there are less items in the cache and that the sync.Map lookup is not in the recursive sizeof.

@rsc
Copy link
Contributor

rsc commented Oct 30, 2019

Given that the code is so small and the memory footprint is bounded, I withdraw my objection to adding this cache.

Does anyone else still object? Otherwise it seems like this is a likely accept.

Leaving open a week for final comments.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants