Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate header decoding performance #1587

Closed
dfawley opened this issue Oct 17, 2017 · 7 comments
Closed

Investigate header decoding performance #1587

dfawley opened this issue Oct 17, 2017 · 7 comments
Assignees
Labels
P1 Type: Performance Performance improvements (CPU, network, memory, etc)

Comments

@dfawley
Copy link
Member

dfawley commented Oct 17, 2017

cockroachdb/cockroach#17370 (comment)

@petermattis found that disabling header decoding can double overall throughput in his benchmarks. This is surprising, and should be investigated.

@dfawley dfawley added P1 Type: Performance Performance improvements (CPU, network, memory, etc) labels Oct 17, 2017
@dfawley
Copy link
Member Author

dfawley commented Oct 26, 2017

Actually, this wasn't doubling performance, it was more like a ~30% slowdown with header decoding enabled. Also, I didn't fully understand what was meant by not decoding the headers (because I assumed the http2 library always did some decoding on its side), so now that I understand that it can be completely disabled, this seems reasonable.

@dfawley dfawley closed this as completed Oct 26, 2017
@petermattis
Copy link
Contributor

@dfawley I just re-tested this and the slowdown I see is 45%. Flipped around, not decoding headers provides an 84% improvement in throughput. I haven't looked at all at what about header decoding is causing this problem.

The only difference between the two tests (using pinger) is:

diff --git a/y.go b/y.go
index 1bfb580..b2ab044 100644
--- a/y.go
+++ b/y.go
@@ -36,7 +36,7 @@ func newYServer(conn net.Conn) *yServerConn {
        y.wr = bufio.NewWriter(conn)
        y.fr = http2.NewFramer(y.wr, y.rd)
        y.fr.SetReuseFrames()
-       y.fr.ReadMetaHeaders = hpack.NewDecoder(4096, nil)
+       // y.fr.ReadMetaHeaders = hpack.NewDecoder(4096, nil)
        y.sender.cond.L = &y.sender.Mutex
        return y
 }
@@ -185,7 +185,7 @@ func newYClient(conn net.Conn) *yClientConn {
        y.wr = bufio.NewWriter(conn)
        y.fr = http2.NewFramer(y.wr, y.rd)
        y.fr.SetReuseFrames()
-       y.fr.ReadMetaHeaders = hpack.NewDecoder(4096, nil)
+       // y.fr.ReadMetaHeaders = hpack.NewDecoder(4096, nil)
        y.sender.cond.L = &y.sender.Mutex
        y.receiver.streamID = 1
        y.receiver.pending = make(map[uint32]*yPending)

With header decoding:

~ ./pinger -c 200 -n 1 -p 1000 -d 10s -t y localhost:50051
...
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 232403.8     89.3      6.3      5.8      6.6      9.4

Without header decoding:

~ ./pinger -c 200 -n 1 -p 1000 -d 10s -t y localhost:50051
...
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 427379.2    164.3      2.2      3.4      4.1      8.9

@petermattis
Copy link
Contributor

Ok, I took a quick look and it appears that that ~50% of the performance delta is due to allocations of the MetaHeadersFrame and MetaHeadersFrame.Fields. It looks feasible to reuse the MetaHeadersFrame similarly to how DataFrames are reused. Ditto for MetaHeadersFrame.Fields. Unfortunately, those changes don't move the performance needle at all for gRPC.

@MakMukhi
Copy link
Contributor

~50% of the performance delta is due to allocations of the MetaHeadersFrame and MetaHeadersFrame.Fields

How are you getting that number? I ran a cpu profile on your benchmarks( btw cpu profiling there is a little buggy at least for y, if you have fixed your local branch, can you push the updates?) and it seems to me that most of readMetaFrame's time is going into growing slices(presumably for MetaHeadersFrame.Fields). Perhaps, reusing MetaHeaderFrames might came in handy. But we've gotta keep in mind that hpack encoding and decoding are inherently expensive. If we don't pass along a decoder then we'll have to do this decoding ourselves.

Unfortunately, those changes don't move the performance needle at all for gRPC.

Did you make a prototype change already?

implementation of readMetaFrame may be sub-optimal but I haven't taken a closer look at net/http internals.

Our current focus is on reducing contention for high-concurrency use-cases. Vetting net/http's implementation for performance seems like long shot right now. @dfawley what do you think?

@petermattis
Copy link
Contributor

~50% of the performance delta is due to allocations of the MetaHeadersFrame and MetaHeadersFrame.Fields

How are you getting that number?

I added caching of MetaHeadersFrame and MetaHeadersFrame.Fields to golang.org/x/net/http2.Framer. That eliminated ~50% of the performance delta. A bit more can be achieved by some restructuring of Framer.readMetaFrame. Right now it performs a number of small allocations so that variables can be captured by the emit function.

Concretely (and these numbers differ from the ones above because they were gathered between 2 VMs in GCE):

old-decode-headers (the current golang.org/x/net/http2 code)
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 168991.7     64.9      7.3      7.9      8.4     10.0

new-decode-headers (the changes mentioned above)
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 256298.9     98.5      4.5      5.2      5.8     12.6

no-decode-headers (send headers, but don't set Framer.ReadMetaHeaders)
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 336653.7    129.4      2.6      4.2      5.0      7.1

Unfortunately, those changes don't move the performance needle at all for gRPC.

Did you make a prototype change already?

Yes.

@dfawley
Copy link
Member Author

dfawley commented Oct 27, 2017

@petermattis Are those latest numbers with gRPC or your x/y implementations?

I'm assuming the latter because of this comment:

Unfortunately, those changes don't move the performance needle at all for gRPC.

So you're saying optimizations in header decoding doesn't help gRPC at all? If so, that probably means the bottleneck is in the sending path right now.

Maybe we can turn this into an issue/PR against http2. Reusing header frames seems like a useful feature for everyone.

@petermattis
Copy link
Contributor

@petermattis Are those latest numbers with gRPC or your x/y implementations?

These numbers were with my y protocol implementation (which uses http2 framing).

Unfortunately, those changes don't move the performance needle at all for gRPC.

So you're saying optimizations in header decoding doesn't help gRPC at all? If so, that probably means the bottleneck is in the sending path right now.

Yes, these optimizations do not help gRPC at all right now. I would guess that the gRPC bottleneck is somewhere in the sending path, though it is also possible something is happening on the receiving path.

Maybe we can turn this into an issue/PR against http2. Reusing header frames seems like a useful feature for everyone.

Sounds good to me.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1 Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

No branches or pull requests

3 participants