-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/textproto: reduce allocations in ReadMIMEHeader #37809
Comments
I wonder if it's possible to make use of |
We're not adding unsafe usage to textproto. |
With or without I tried a slightly more complex implementation that would cover requests with arbitrary number header values, but it added no gain from my training set of real-life requests/responses.
|
@bradfitz would it be accepted if hidden by the use of EDIT: scratch that, this can't be implemented with |
@bradfitz would you be open to adding the |
@rs, could you write all of the values into the |
That could work yes, but it would make the code slightly more complex. I would personally prefer the |
Another option (which would also complicate the code) is to have a fixed size, stack-allocated buffer for pending values to record and for the bytes involved. When a new kv pair is available, append the bytes to the bytes buffer and indices in the bytes buffer to the k/v pairs buffer. When either buffer is too full, convert the byte buffer to a string and add slices of that string to the map. This could all be wrapped up methods in a separate type, for sanity. (This is similar to the strings.Builder idea, but without the extra strings.Builder layer, since a plain ol' byte slice should suffice.) Also, if the keys are commonly re-used, string interning might help. |
Good point, you would still pay for one extra copy, but it's better than nothing. String interning might work for some very common header values, I guess it could save a couple alloc per request with common content types and accept header values for instance. Before spending more time on this, I'd love to hear from @bradfitz about the |
@rs, |
Agreed but it would still be safe and you would still save allocations by creating those substring from a common buffer. |
How are you asserting that this allocation cannot be garbage collected? |
my friend found it https://rover.rocks/golang-memory-leak '__') probably it's gin-gonic/gin's |
Do you have any statistics about what the header values are? If there are a handful of very common values, perhaps we could hard-code some strings to re-use instead of allocate. (Like string interning, but hard-coded.) |
PoC for golang#37809 name old time/op new time/op delta ReadMIMEHeader/client_headers-4 2.38µs ± 9% 2.26µs ±11% -5.35% (p=0.000 n=26+29) ReadMIMEHeader/server_headers-4 2.10µs ± 8% 1.95µs ± 9% -7.28% (p=0.000 n=28+26) Uncommon-4 549ns ± 7% 522ns ±10% -4.99% (p=0.000 n=29+28) name old alloc/op new alloc/op delta ReadMIMEHeader/client_headers-4 1.53kB ± 0% 0.95kB ± 1% -38.41% (p=0.000 n=29+28) ReadMIMEHeader/server_headers-4 1.10kB ± 0% 0.92kB ± 0% -16.51% (p=0.000 n=30+30) Uncommon-4 468B ± 0% 435B ± 0% -7.05% (p=0.000 n=30+27) name old allocs/op new allocs/op delta ReadMIMEHeader/client_headers-4 14.0 ± 0% 3.0 ± 0% -78.57% (p=0.000 n=30+30) ReadMIMEHeader/server_headers-4 14.0 ± 0% 3.0 ± 0% -78.57% (p=0.000 n=30+30) Uncommon-4 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=30+30)
Spent a bit on this and, unfortunately, there does not seem to be a lot that can be done. The only approach I found that worked1, given the constraints above, was interning (under the assumption that headers are not mostly unique) as that reduced both allocations and CPU usage, so this may be a good candidate for some form of string interning (not exposed to users): benchmarksnote: machine was noisy, need to run again on a dedicated one
I have pushed the poc here. If we think this may be useful I can try to refine it properly (the current interning logic is really crude). If you have internal benchmarks it would be useful to run them on that poc to get a better sense of whether there are some workloads where this approach falls short. While I haven't tested this on our production workloads yet, I can share that on one of our highest-traffic services ReadMIMEHeader accounts for ~2% of CPU time and ~2.5% of total allocations by bytes. Footnotes
|
PoC for golang#37809 name old time/op new time/op delta ReadMIMEHeader/client_headers-4 2.38µs ± 9% 2.26µs ±11% -5.35% (p=0.000 n=26+29) ReadMIMEHeader/server_headers-4 2.10µs ± 8% 1.95µs ± 9% -7.28% (p=0.000 n=28+26) Uncommon-4 549ns ± 7% 522ns ±10% -4.99% (p=0.000 n=29+28) name old alloc/op new alloc/op delta ReadMIMEHeader/client_headers-4 1.53kB ± 0% 0.95kB ± 1% -38.41% (p=0.000 n=29+28) ReadMIMEHeader/server_headers-4 1.10kB ± 0% 0.92kB ± 0% -16.51% (p=0.000 n=30+30) Uncommon-4 468B ± 0% 435B ± 0% -7.05% (p=0.000 n=30+27) name old allocs/op new allocs/op delta ReadMIMEHeader/client_headers-4 14.0 ± 0% 3.0 ± 0% -78.57% (p=0.000 n=30+30) ReadMIMEHeader/server_headers-4 14.0 ± 0% 3.0 ± 0% -78.57% (p=0.000 n=30+30) Uncommon-4 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=30+30)
PoC for golang#37809 net/textproto: name old time/op new time/op delta ReadMIMEHeader/client_headers-4 2.38µs ± 9% 2.26µs ±11% -5.35% (p=0.000 n=26+29) ReadMIMEHeader/server_headers-4 2.10µs ± 8% 1.95µs ± 9% -7.28% (p=0.000 n=28+26) Uncommon-4 549ns ± 7% 522ns ±10% -4.99% (p=0.000 n=29+28) name old alloc/op new alloc/op delta ReadMIMEHeader/client_headers-4 1.53kB ± 0% 0.95kB ± 1% -38.41% (p=0.000 n=29+28) ReadMIMEHeader/server_headers-4 1.10kB ± 0% 0.92kB ± 0% -16.51% (p=0.000 n=30+30) Uncommon-4 468B ± 0% 435B ± 0% -7.05% (p=0.000 n=30+27) name old allocs/op new allocs/op delta ReadMIMEHeader/client_headers-4 14.0 ± 0% 3.0 ± 0% -78.57% (p=0.000 n=30+30) ReadMIMEHeader/server_headers-4 14.0 ± 0% 3.0 ± 0% -78.57% (p=0.000 n=30+30) Uncommon-4 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=30+30) net/http: name old time/op new time/op delta CookieString-4 653ns ± 9% 819ns ±35% +25.42% (p=0.000 n=26+28) ReadSetCookies-4 2.40µs ± 9% 2.96µs ±27% +23.31% (p=0.000 n=26+28) ReadCookies-4 2.85µs ± 9% 3.71µs ±52% +30.13% (p=0.000 n=25+29) HeaderWriteSubset-4 418ns ± 2% 431ns ±10% +3.17% (p=0.000 n=27+25) CopyValues-4 1.22µs ± 5% 1.19µs ±11% -2.48% (p=0.000 n=25+28) ServerMatch-4 18.7ns ± 2% 18.8ns ± 8% ~ (p=0.113 n=28+26) ReadRequestChrome-4 2.81µs ±19% 2.45µs ±14% -12.95% (p=0.000 n=28+26) ReadRequestCurl-4 1.36µs ± 6% 1.26µs ± 7% -6.72% (p=0.000 n=26+27) ReadRequestApachebench-4 1.38µs ±12% 1.30µs ±18% -5.32% (p=0.000 n=27+30) ReadRequestSiege-4 1.77µs ± 8% 1.62µs ± 6% -8.67% (p=0.000 n=26+28) ReadRequestWrk-4 1.05µs ±13% 0.96µs ± 5% -8.53% (p=0.000 n=27+24) FileAndServer_1KB/NoTLS-4 133µs ± 6% 126µs ±10% -5.04% (p=0.000 n=29+29) FileAndServer_1KB/TLS-4 142µs ± 7% 135µs ± 6% -4.87% (p=0.000 n=28+26) FileAndServer_16MB/NoTLS-4 11.0ms ± 4% 10.5ms ± 6% -4.82% (p=0.000 n=27+28) FileAndServer_16MB/TLS-4 21.7ms ±10% 21.2ms ±10% -2.48% (p=0.025 n=28+29) FileAndServer_64MB/NoTLS-4 41.7ms ± 5% 41.5ms ± 5% ~ (p=0.423 n=28+29) FileAndServer_64MB/TLS-4 82.9ms ±12% 81.0ms ±12% ~ (p=0.096 n=29+29) ServeMux-4 50.6µs ± 7% 50.1µs ± 7% ~ (p=0.118 n=26+28) ServeMux_SkipServe-4 23.0µs ± 4% 23.1µs ± 5% ~ (p=0.796 n=27+26) ClientServer-4 95.7µs ± 7% 92.9µs ± 7% -2.91% (p=0.001 n=30+28) ClientServerParallel4-4 37.5µs ±24% 36.5µs ±47% ~ (p=0.124 n=29+29) ClientServerParallel64-4 122µs ±49% 108µs ±58% -11.15% (p=0.005 n=26+29) Server-4 169µs ±43% 165µs ±48% ~ (p=0.618 n=29+25) Client-4 147µs ±25% 136µs ±17% -7.57% (p=0.005 n=30+24) ServerFakeConnNoKeepAlive-4 23.0µs ±61% 20.8µs ±38% ~ (p=0.109 n=30+30) ServerFakeConnWithKeepAlive-4 11.6µs ±21% 11.7µs ±39% ~ (p=0.686 n=27+30) ServerFakeConnWithKeepAliveLite-4 7.74µs ±30% 7.71µs ±23% ~ (p=0.879 n=26+29) ServerHandlerTypeLen-4 8.96µs ±44% 9.52µs ±46% ~ (p=0.138 n=30+30) ServerHandlerNoLen-4 7.92µs ±33% 7.56µs ±29% ~ (p=0.376 n=25+26) ServerHandlerNoType-4 7.70µs ±33% 8.70µs ±51% +13.01% (p=0.036 n=28+30) ServerHandlerNoHeader-4 5.52µs ±22% 6.10µs ±29% +10.49% (p=0.009 n=28+30) ServerHijack-4 16.2µs ± 4% 16.4µs ± 7% ~ (p=0.580 n=26+28) CloseNotifier-4 137µs ± 9% 143µs ± 9% +4.34% (p=0.000 n=26+27) ResponseStatusLine-4 24.5ns ±13% 24.7ns ± 5% +0.70% (p=0.033 n=28+27) name old alloc/op new alloc/op delta CookieString-4 176B ± 0% 176B ± 0% ~ (all equal) ReadSetCookies-4 1.01kB ± 0% 1.01kB ± 0% ~ (all equal) ReadCookies-4 1.84kB ± 0% 1.84kB ± 0% ~ (all equal) HeaderWriteSubset-4 0.00B 0.00B ~ (all equal) CopyValues-4 736B ± 0% 736B ± 0% ~ (all equal) ReadRequestChrome-4 1.83kB ± 0% 1.33kB ± 1% -27.52% (p=0.000 n=30+30) ReadRequestCurl-4 912B ± 0% 887B ± 0% -2.73% (p=0.000 n=30+29) ReadRequestApachebench-4 916B ± 0% 887B ± 0% -3.14% (p=0.000 n=30+29) ReadRequestSiege-4 1.00kB ± 0% 0.92kB ± 0% -8.05% (p=0.000 n=30+27) ReadRequestWrk-4 864B ± 0% 855B ± 0% -1.04% (p=0.000 n=30+22) ServeMux-4 17.3kB ± 0% 17.3kB ± 0% ~ (all equal) ServeMux_SkipServe-4 0.00B 0.00B ~ (all equal) ClientServer-4 5.01kB ± 0% 5.01kB ± 1% ~ (p=0.194 n=29+30) ClientServerParallel4-4 6.74kB ±18% 6.69kB ±17% ~ (p=0.976 n=30+30) ClientServerParallel64-4 16.4kB ±24% 16.1kB ±22% ~ (p=0.640 n=29+28) Server-4 2.32kB ± 1% 2.36kB ± 3% +1.82% (p=0.000 n=29+28) Client-4 3.50kB ± 0% 3.53kB ± 2% +0.83% (p=0.000 n=26+30) ServerFakeConnNoKeepAlive-4 4.75kB ± 0% 4.56kB ± 1% -4.18% (p=0.000 n=26+30) ServerFakeConnWithKeepAlive-4 2.48kB ± 0% 2.24kB ± 0% -9.59% (p=0.000 n=30+29) ServerFakeConnWithKeepAliveLite-4 1.35kB ± 0% 1.36kB ± 0% +0.73% (p=0.000 n=30+27) ServerHandlerTypeLen-4 2.16kB ± 0% 2.19kB ± 0% +1.31% (p=0.000 n=30+30) ServerHandlerNoLen-4 2.13kB ± 0% 2.15kB ± 0% +1.09% (p=0.000 n=30+29) ServerHandlerNoType-4 2.13kB ± 0% 2.16kB ± 0% +1.28% (p=0.000 n=30+29) ServerHandlerNoHeader-4 1.35kB ± 0% 1.36kB ± 0% +0.93% (p=0.000 n=27+28) ServerHijack-4 16.2kB ± 0% 16.4kB ± 0% +0.93% (p=0.000 n=28+30) CloseNotifier-4 3.36kB ± 1% 3.48kB ± 2% +3.40% (p=0.000 n=28+30) ResponseStatusLine-4 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta CookieString-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadSetCookies-4 15.0 ± 0% 15.0 ± 0% ~ (all equal) ReadCookies-4 11.0 ± 0% 11.0 ± 0% ~ (all equal) HeaderWriteSubset-4 0.00 0.00 ~ (all equal) CopyValues-4 11.0 ± 0% 11.0 ± 0% ~ (all equal) ReadRequestChrome-4 14.0 ± 0% 6.0 ± 0% -57.14% (p=0.000 n=30+30) ReadRequestCurl-4 9.00 ± 0% 6.00 ± 0% -33.33% (p=0.000 n=30+30) ReadRequestApachebench-4 9.00 ± 0% 6.00 ± 0% -33.33% (p=0.000 n=30+30) ReadRequestSiege-4 11.0 ± 0% 6.0 ± 0% -45.45% (p=0.000 n=30+30) ReadRequestWrk-4 7.00 ± 0% 6.00 ± 0% -14.29% (p=0.000 n=30+30) ServeMux-4 360 ± 0% 360 ± 0% ~ (all equal) ServeMux_SkipServe-4 0.00 0.00 ~ (all equal) ClientServer-4 59.0 ± 0% 53.0 ± 0% -10.17% (p=0.000 n=30+30) ClientServerParallel4-4 65.4 ± 9% 59.2 ± 9% -9.48% (p=0.000 n=30+30) ClientServerParallel64-4 100 ±19% 96 ± 9% -4.49% (p=0.003 n=30+28) Server-4 21.0 ± 0% 18.0 ± 0% -14.29% (p=0.000 n=30+30) Client-4 42.0 ± 0% 39.0 ± 0% -7.14% (p=0.000 n=30+30) ServerFakeConnNoKeepAlive-4 53.0 ± 0% 47.0 ± 0% -11.32% (p=0.000 n=30+30) ServerFakeConnWithKeepAlive-4 23.0 ± 0% 17.0 ± 0% -26.09% (p=0.000 n=30+30) ServerFakeConnWithKeepAliveLite-4 13.0 ± 0% 12.0 ± 0% -7.69% (p=0.000 n=30+30) ServerHandlerTypeLen-4 20.0 ± 0% 19.0 ± 0% -5.00% (p=0.000 n=30+30) ServerHandlerNoLen-4 18.0 ± 0% 17.0 ± 0% -5.56% (p=0.000 n=30+30) ServerHandlerNoType-4 19.0 ± 0% 18.0 ± 0% -5.26% (p=0.000 n=30+30) ServerHandlerNoHeader-4 13.0 ± 0% 12.0 ± 0% -7.69% (p=0.000 n=30+30) ServerHijack-4 51.0 ± 0% 50.0 ± 0% -1.96% (p=0.000 n=30+30) CloseNotifier-4 51.0 ± 0% 49.0 ± 0% -3.92% (p=0.000 n=30+30) ResponseStatusLine-4 0.00 0.00 ~ (all equal) name old speed new speed delta ReadRequestChrome-4 218MB/s ±16% 249MB/s ±13% +14.15% (p=0.000 n=28+27) ReadRequestCurl-4 57.6MB/s ± 5% 61.4MB/s ±13% +6.72% (p=0.000 n=26+28) ReadRequestApachebench-4 59.7MB/s ±11% 63.1MB/s ±15% +5.79% (p=0.000 n=27+30) ReadRequestSiege-4 85.1MB/s ± 8% 93.5MB/s ± 5% +9.82% (p=0.000 n=27+28) ReadRequestWrk-4 38.2MB/s ±11% 41.6MB/s ± 5% +8.89% (p=0.000 n=27+25) FileAndServer_1KB/NoTLS-4 7.71MB/s ± 6% 8.10MB/s ± 9% +5.10% (p=0.000 n=29+30) FileAndServer_1KB/TLS-4 7.21MB/s ± 8% 7.57MB/s ± 5% +5.05% (p=0.000 n=28+26) FileAndServer_16MB/NoTLS-4 1.53GB/s ± 4% 1.61GB/s ± 6% +5.11% (p=0.000 n=27+28) FileAndServer_16MB/TLS-4 775MB/s ± 9% 794MB/s ± 9% +2.52% (p=0.025 n=28+29) FileAndServer_64MB/NoTLS-4 1.61GB/s ± 5% 1.62GB/s ± 5% ~ (p=0.423 n=28+29) FileAndServer_64MB/TLS-4 808MB/s ±13% 826MB/s ±10% ~ (p=0.098 n=30+28)
What version of Go are you using (
go version
)?What did you do?
For our use-case (a HTTP proxy)
textproto.ReadMIMEHeader
is in the top 3 in term of allocations. In order to ease the GC, we could pre-allocate 512 bytes to be used for small header values. This can reduce by half the number of allocations for this function.The patch is pretty small:
Please tell me if it's worth submitting a PR with this change.
The text was updated successfully, but these errors were encountered: