Investigate header decoding performance #1587

dfawley · 2017-10-17T21:59:36Z

@petermattis found that disabling header decoding can double overall throughput in his benchmarks. This is surprising, and should be investigated.

dfawley · 2017-10-26T20:55:51Z

Actually, this wasn't doubling performance, it was more like a ~30% slowdown with header decoding enabled. Also, I didn't fully understand what was meant by not decoding the headers (because I assumed the http2 library always did some decoding on its side), so now that I understand that it can be completely disabled, this seems reasonable.

petermattis · 2017-10-26T21:25:36Z

@dfawley I just re-tested this and the slowdown I see is 45%. Flipped around, not decoding headers provides an 84% improvement in throughput. I haven't looked at all at what about header decoding is causing this problem.

The only difference between the two tests (using pinger) is:

diff --git a/y.go b/y.go
index 1bfb580..b2ab044 100644
--- a/y.go
+++ b/y.go
@@ -36,7 +36,7 @@ func newYServer(conn net.Conn) *yServerConn {
        y.wr = bufio.NewWriter(conn)
        y.fr = http2.NewFramer(y.wr, y.rd)
        y.fr.SetReuseFrames()
-       y.fr.ReadMetaHeaders = hpack.NewDecoder(4096, nil)
+       // y.fr.ReadMetaHeaders = hpack.NewDecoder(4096, nil)
        y.sender.cond.L = &y.sender.Mutex
        return y
 }
@@ -185,7 +185,7 @@ func newYClient(conn net.Conn) *yClientConn {
        y.wr = bufio.NewWriter(conn)
        y.fr = http2.NewFramer(y.wr, y.rd)
        y.fr.SetReuseFrames()
-       y.fr.ReadMetaHeaders = hpack.NewDecoder(4096, nil)
+       // y.fr.ReadMetaHeaders = hpack.NewDecoder(4096, nil)
        y.sender.cond.L = &y.sender.Mutex
        y.receiver.streamID = 1
        y.receiver.pending = make(map[uint32]*yPending)

With header decoding:

~ ./pinger -c 200 -n 1 -p 1000 -d 10s -t y localhost:50051
...
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 232403.8     89.3      6.3      5.8      6.6      9.4

Without header decoding:

~ ./pinger -c 200 -n 1 -p 1000 -d 10s -t y localhost:50051
...
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 427379.2    164.3      2.2      3.4      4.1      8.9

petermattis · 2017-10-26T21:38:24Z

Ok, I took a quick look and it appears that that ~50% of the performance delta is due to allocations of the MetaHeadersFrame and MetaHeadersFrame.Fields. It looks feasible to reuse the MetaHeadersFrame similarly to how DataFrames are reused. Ditto for MetaHeadersFrame.Fields. Unfortunately, those changes don't move the performance needle at all for gRPC.

MakMukhi · 2017-10-27T01:39:22Z

~50% of the performance delta is due to allocations of the MetaHeadersFrame and MetaHeadersFrame.Fields

How are you getting that number? I ran a cpu profile on your benchmarks( btw cpu profiling there is a little buggy at least for y, if you have fixed your local branch, can you push the updates?) and it seems to me that most of readMetaFrame's time is going into growing slices(presumably for MetaHeadersFrame.Fields). Perhaps, reusing MetaHeaderFrames might came in handy. But we've gotta keep in mind that hpack encoding and decoding are inherently expensive. If we don't pass along a decoder then we'll have to do this decoding ourselves.

Unfortunately, those changes don't move the performance needle at all for gRPC.

Did you make a prototype change already?

implementation of readMetaFrame may be sub-optimal but I haven't taken a closer look at net/http internals.

Our current focus is on reducing contention for high-concurrency use-cases. Vetting net/http's implementation for performance seems like long shot right now. @dfawley what do you think?

petermattis · 2017-10-27T16:10:22Z

~50% of the performance delta is due to allocations of the MetaHeadersFrame and MetaHeadersFrame.Fields

How are you getting that number?

I added caching of MetaHeadersFrame and MetaHeadersFrame.Fields to golang.org/x/net/http2.Framer. That eliminated ~50% of the performance delta. A bit more can be achieved by some restructuring of Framer.readMetaFrame. Right now it performs a number of small allocations so that variables can be captured by the emit function.

Concretely (and these numbers differ from the ones above because they were gathered between 2 VMs in GCE):

old-decode-headers (the current golang.org/x/net/http2 code)
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 168991.7     64.9      7.3      7.9      8.4     10.0

new-decode-headers (the changes mentioned above)
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 256298.9     98.5      4.5      5.2      5.8     12.6

no-decode-headers (send headers, but don't set Framer.ReadMetaHeaders)
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 336653.7    129.4      2.6      4.2      5.0      7.1

Unfortunately, those changes don't move the performance needle at all for gRPC.

Did you make a prototype change already?

Yes.

dfawley · 2017-10-27T16:25:16Z

@petermattis Are those latest numbers with gRPC or your x/y implementations?

I'm assuming the latter because of this comment:

Unfortunately, those changes don't move the performance needle at all for gRPC.

So you're saying optimizations in header decoding doesn't help gRPC at all? If so, that probably means the bottleneck is in the sending path right now.

Maybe we can turn this into an issue/PR against http2. Reusing header frames seems like a useful feature for everyone.

petermattis · 2017-10-27T16:29:59Z

@petermattis Are those latest numbers with gRPC or your x/y implementations?

These numbers were with my y protocol implementation (which uses http2 framing).

Unfortunately, those changes don't move the performance needle at all for gRPC.

So you're saying optimizations in header decoding doesn't help gRPC at all? If so, that probably means the bottleneck is in the sending path right now.

Yes, these optimizations do not help gRPC at all right now. I would guess that the gRPC bottleneck is somewhere in the sending path, though it is also possible something is happening on the receiving path.

Maybe we can turn this into an issue/PR against http2. Reusing header frames seems like a useful feature for everyone.

Sounds good to me.

dfawley added P1 Type: Performance Performance improvements (CPU, network, memory, etc) labels Oct 17, 2017

dfawley assigned MakMukhi Oct 17, 2017

dfawley closed this as completed Oct 26, 2017

dfawley mentioned this issue Nov 14, 2017

x/net/http2: support memory re-use for MetaHeadersFrames golang/go#22717

Open

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate header decoding performance #1587

Investigate header decoding performance #1587

dfawley commented Oct 17, 2017

dfawley commented Oct 26, 2017

petermattis commented Oct 26, 2017

petermattis commented Oct 26, 2017

MakMukhi commented Oct 27, 2017

petermattis commented Oct 27, 2017

dfawley commented Oct 27, 2017

petermattis commented Oct 27, 2017

Investigate header decoding performance #1587

Investigate header decoding performance #1587

Comments

dfawley commented Oct 17, 2017

dfawley commented Oct 26, 2017

petermattis commented Oct 26, 2017

petermattis commented Oct 26, 2017

MakMukhi commented Oct 27, 2017

petermattis commented Oct 27, 2017

dfawley commented Oct 27, 2017

petermattis commented Oct 27, 2017