Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/textproto: reduce allocations in ReadMIMEHeader #37809

Open
rs opened this issue Mar 11, 2020 · 20 comments
Open

net/textproto: reduce allocations in ReadMIMEHeader #37809

rs opened this issue Mar 11, 2020 · 20 comments
Labels
NeedsDecision
Milestone

Comments

@rs
Copy link
Contributor

@rs rs commented Mar 11, 2020

What version of Go are you using (go version)?

1.14

What did you do?

For our use-case (a HTTP proxy) textproto.ReadMIMEHeader is in the top 3 in term of allocations. In order to ease the GC, we could pre-allocate 512 bytes to be used for small header values. This can reduce by half the number of allocations for this function.

name                              old time/op    new time/op    delta
ReadMIMEHeader/client_headers-16    2.30µs ± 5%    2.11µs ± 2%   -8.21%  (p=0.000 n=10+10)
ReadMIMEHeader/server_headers-16    1.94µs ± 3%    1.85µs ± 3%   -4.56%  (p=0.000 n=10+10)

name                              old alloc/op   new alloc/op   delta
ReadMIMEHeader/client_headers-16    1.53kB ± 0%    1.69kB ± 0%  +10.48%  (p=0.000 n=10+10)
ReadMIMEHeader/server_headers-16    1.09kB ± 0%    1.44kB ± 0%  +32.32%  (p=0.000 n=10+10)

name                              old allocs/op  new allocs/op  delta
ReadMIMEHeader/client_headers-16      14.0 ± 0%       6.0 ± 0%  -57.14%  (p=0.000 n=10+10)
ReadMIMEHeader/server_headers-16      14.0 ± 0%       6.0 ± 0%  -57.14%  (p=0.000 n=10+10)

The patch is pretty small:

index d26e981ae4..6126a6685c 100644
--- a/src/net/textproto/reader.go
+++ b/src/net/textproto/reader.go
@@ -13,6 +13,7 @@ import (
        "strconv"
        "strings"
        "sync"
+       "unsafe"
 )

 // A Reader implements convenience methods for reading requests
@@ -502,6 +503,12 @@ func (r *Reader) ReadMIMEHeader() (MIMEHeader, error) {
                return m, ProtocolError("malformed MIME header initial line: " + string(line))
        }

+       // Create a pre-allocated byte slice for all header values to save
+       // allocations for the first 512 bytes of values, a size that fit typical
+       // small header values, larger ones will get their own allocation.
+       const valuesPreAllocSize = 1 << 9
+       valuesPreAlloc := make([]byte, 0, valuesPreAllocSize)
+
        for {
                kv, err := r.readContinuedLineSlice(mustHaveFieldNameColon)
                if len(kv) == 0 {
@@ -527,7 +534,17 @@ func (r *Reader) ReadMIMEHeader() (MIMEHeader, error) {
                for i < len(kv) && (kv[i] == ' ' || kv[i] == '\t') {
                        i++
                }
-               value := string(kv[i:])
+
+               // Try to fit the value in the pre-allocated buffer to save allocations.
+               var value string
+               if len(kv[i:]) <= valuesPreAllocSize-len(valuesPreAlloc) {
+                       off := len(valuesPreAlloc)
+                       valuesPreAlloc = append(valuesPreAlloc, kv[i:]...)
+                       v := valuesPreAlloc[off:]
+                       value = *(*string)(unsafe.Pointer(&v))
+               } else {
+                       value = string(kv[i:])
+               }

                vv := m[key]
                if vv == nil && len(strs) > 0 {

Please tell me if it's worth submitting a PR with this change.

@toothrot toothrot added the NeedsDecision label Mar 11, 2020
@toothrot toothrot added this to the Backlog milestone Mar 11, 2020
@toothrot
Copy link
Contributor

@toothrot toothrot commented Mar 11, 2020

/cc @bradfitz @rsc

@smasher164
Copy link
Member

@smasher164 smasher164 commented Mar 12, 2020

I wonder if it's possible to make use of strings.Builder and pre-allocating with Grow(valuesPreAllocSize), instead of depending on unsafe and only using the buffer when the value fits inside it.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Mar 12, 2020

We're not adding unsafe usage to textproto.

@rs
Copy link
Contributor Author

@rs rs commented Mar 12, 2020

With or without strings.Builder, this trick won't work if you grow the slice, as a copy of the backing array will be performed, which would be a waste: the strings created before the grow would still point on the old backing array, copying data for nothing and wasting the first half of the new grown backing array.

I tried a slightly more complex implementation that would cover requests with arbitrary number header values, but it added no gain from my training set of real-life requests/responses.

We could still use strings.Builder to hide the use of unsafe, but I'm not sure to see the point.

@rs
Copy link
Contributor Author

@rs rs commented Mar 12, 2020

We're not adding unsafe usage to textproto.

@bradfitz would it be accepted if hidden by the use of strings.Builder?

EDIT: scratch that, this can't be implemented with strings.Buffer. So I guess it's a no-go.

@rs
Copy link
Contributor Author

@rs rs commented Mar 12, 2020

@bradfitz would you be open to adding the Slice(off, len int) string method to strings.Builder to allow implementation of such tricks without the need of unsafe?

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 12, 2020

@rs, could you write all of the values into the strings.Builder and record the offsets, then convert all of the offsets to slices of the final string result?

@rs
Copy link
Contributor Author

@rs rs commented Mar 12, 2020

That could work yes, but it would make the code slightly more complex. I would personally prefer the Slice method on strings.Builder, but I guess it will require some more convincing :)

@josharian
Copy link
Contributor

@josharian josharian commented Mar 12, 2020

Another option (which would also complicate the code) is to have a fixed size, stack-allocated buffer for pending values to record and for the bytes involved. When a new kv pair is available, append the bytes to the bytes buffer and indices in the bytes buffer to the k/v pairs buffer. When either buffer is too full, convert the byte buffer to a string and add slices of that string to the map. This could all be wrapped up methods in a separate type, for sanity. (This is similar to the strings.Builder idea, but without the extra strings.Builder layer, since a plain ol' byte slice should suffice.)

Also, if the keys are commonly re-used, string interning might help.

@rs
Copy link
Contributor Author

@rs rs commented Mar 12, 2020

Good point, you would still pay for one extra copy, but it's better than nothing. String interning might work for some very common header values, I guess it could save a couple alloc per request with common content types and accept header values for instance.

Before spending more time on this, I'd love to hear from @bradfitz about the strings.Builder.Slice idea and if not, what is the probability to having more complex implementation as suggested above accepted.

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 12, 2020

@rs, strings.Builder.Slice seems too prone to misuse: much too easy to accidentally create multiple underlying strings (due to growth) when the user really only intended to retain slices of a single big string.

@rs
Copy link
Contributor Author

@rs rs commented Mar 12, 2020

Agreed but it would still be safe and you would still save allocations by creating those substring from a common buffer.

@kokizzu
Copy link

@kokizzu kokizzu commented Jan 29, 2021

Possibly related, this allocation grows to 200MB over a day or a week and cannot be reclaimed by GC:
image

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jan 29, 2021

Possibly related, this allocation grows to 200MB over a week and cannot be reclaimed by GC:

How are you asserting that this allocation cannot be garbage collected?

@kokizzu
Copy link

@kokizzu kokizzu commented Jan 29, 2021

i run github.com/google/gops on the service:

$ gops pprof-heap 1
Profile dump saved to: /tmp/heap_profile363561508
$ gops gc 1 # force GC
$ gops pprof-heap 1
Profile dump saved to: /tmp/heap_profile269530696

copy both heap_profile and compare, it doesn't decrease '__')
image


File: phoenix
Type: inuse_space
Time: Jan 29, 2021 at 4:10pm (WIB)
Showing nodes accounting for 296.06MB, 96.41% of 307.09MB total
Dropped 81 nodes (cum <= 1.54MB)
      flat  flat%   sum%        cum   cum%
  216.36MB 70.46% 70.46%   236.86MB 77.13%  net/textproto.(*Reader).ReadMIMEHeader
   20.50MB  6.68% 77.13%    20.50MB  6.68%  net/textproto.canonicalMIMEHeaderKey
   16.63MB  5.42% 82.55%    16.63MB  5.42%  github.com/gorilla/context.Set
    9.50MB  3.09% 85.64%       13MB  4.23%  net/http.(*Request).WithContext
    6.50MB  2.12% 87.76%     7.50MB  2.44%  context.WithCancel
       5MB  1.63% 89.39%        5MB  1.63%  context.(*cancelCtx).Done
       5MB  1.63% 91.02%        5MB  1.63%  context.WithValue
       4MB  1.30% 92.32%        4MB  1.30%  syscall.anyToSockaddr
    3.50MB  1.14% 93.46%     3.50MB  1.14%  net/http.cloneURL
    2.50MB  0.81% 94.27%     2.50MB  0.81%  net/http.(*Server).newConn
    2.50MB  0.81% 95.09%     2.50MB  0.81%  net.newFD
    2.01MB  0.65% 95.74%     2.01MB  0.65%  bufio.NewReaderSize
    1.55MB   0.5% 96.24%     1.55MB   0.5%  regexp.(*bitState).reset
    0.50MB  0.16% 96.41%   294.09MB 95.77%  net/http.(*conn).serve
         0     0% 96.41%     2.01MB  0.65%  bufio.NewReader
         0     0% 96.41%     5.03MB  1.64%  github.com/kokizzu/phoenix/handler.Exclusive.Get
         0     0% 96.41%     5.03MB  1.64%  github.com/kokizzu/phoenix/handler.GetHandler
         0     0% 96.41%     2.55MB  0.83%  github.com/kokizzu/phoenix/router/authentication.JWT
         0     0% 96.41%    25.22MB  8.21%  github.com/kokizzu/phoenix/router/middleware.Exclusive.func1
         0     0% 96.41%    25.22MB  8.21%  github.com/go-chi/chi.(*Mux).Mount.func1
         0     0% 96.41%    41.22MB 13.42%  github.com/go-chi/chi.(*Mux).ServeHTTP
         0     0% 96.41%    25.22MB  8.21%  github.com/go-chi/chi.(*Mux).routeHTTP
         0     0% 96.41%    39.72MB 12.93%  github.com/go-chi/chi/middleware.Recoverer.func1
         0     0% 96.41%    39.72MB 12.93%  github.com/go-chi/chi/middleware.RequestID.func1
         0     0% 96.41%     2.55MB  0.83%  go.kokizzu.io/apinizer/v8/authorization.NewJWTAuthorization
         0     0% 96.41%     2.55MB  0.83%  go.kokizzu.io/apinizer/v8/authorization.getToken
         0     0% 96.41%       10MB  3.26%  main.main.func1
         0     0% 96.41%     7.50MB  2.44%  net.(*TCPListener).Accept
         0     0% 96.41%     7.50MB  2.44%  net.(*TCPListener).accept
         0     0% 96.41%     7.50MB  2.44%  net.(*netFD).accept
         0     0% 96.41%       10MB  3.26%  net/http.(*Server).ListenAndServe
         0     0% 96.41%       10MB  3.26%  net/http.(*Server).Serve
         0     0% 96.41%   243.36MB 79.25%  net/http.(*conn).readRequest
         0     0% 96.41%    39.72MB 12.93%  net/http.HandlerFunc.ServeHTTP
         0     0% 96.41%     2.01MB  0.65%  net/http.newBufioReader
         0     0% 96.41%   239.36MB 77.94%  net/http.readRequest
         0     0% 96.41%    41.22MB 13.42%  net/http.serverHandler.ServeHTTP
         0     0% 96.41%     1.55MB   0.5%  regexp.(*Regexp).MatchString
         0     0% 96.41%     1.55MB   0.5%  regexp.(*Regexp).backtrack
         0     0% 96.41%     1.55MB   0.5%  regexp.(*Regexp).doExecute
         0     0% 96.41%     1.55MB   0.5%  regexp.(*Regexp).doMatch
         0     0% 96.41%     1.55MB   0.5%  regexp.MatchString
         0     0% 96.41%        2MB  0.65%  sync.(*Once).Do
         0     0% 96.41%        2MB  0.65%  sync.(*Once).doSlow
         0     0% 96.41%     2.50MB  0.81%  syscall.Getsockname

or this caused by something else?

@kokizzu
Copy link

@kokizzu kokizzu commented Jan 29, 2021

my friend found it https://rover.rocks/golang-memory-leak '__') probably it's gin-gonic/gin's context.Set that did the cache trashing that holds the allocation

@s-tokutake
Copy link

@s-tokutake s-tokutake commented Apr 10, 2021

We are facing the same problem as above. However, we does not use gin-gonic/gin.

image

@josharian
Copy link
Contributor

@josharian josharian commented Apr 11, 2021

Do you have any statistics about what the header values are? If there are a handful of very common values, perhaps we could hard-code some strings to re-use instead of allocate. (Like string interning, but hard-coded.)

@jonaz
Copy link

@jonaz jonaz commented Mar 25, 2022

We are also seeing a majority of the memory being used in ReadMIMEHeader in a simple proxy application. I have yet to investigate if clients send unusually large headers (for example many cookies).
2022-03-25-075812_2024x1120_scrot

Note that this is compiled with go 1.13. I've recompiled using go 1.17 now and will check if there are any improvements.

CAFxX added a commit to CAFxX/go that referenced this issue Apr 11, 2022
PoC for golang#37809

name                             old time/op    new time/op    delta
ReadMIMEHeader/client_headers-4    2.38µs ± 9%    2.26µs ±11%   -5.35%  (p=0.000 n=26+29)
ReadMIMEHeader/server_headers-4    2.10µs ± 8%    1.95µs ± 9%   -7.28%  (p=0.000 n=28+26)
Uncommon-4                          549ns ± 7%     522ns ±10%   -4.99%  (p=0.000 n=29+28)

name                             old alloc/op   new alloc/op   delta
ReadMIMEHeader/client_headers-4    1.53kB ± 0%    0.95kB ± 1%  -38.41%  (p=0.000 n=29+28)
ReadMIMEHeader/server_headers-4    1.10kB ± 0%    0.92kB ± 0%  -16.51%  (p=0.000 n=30+30)
Uncommon-4                           468B ± 0%      435B ± 0%   -7.05%  (p=0.000 n=30+27)

name                             old allocs/op  new allocs/op  delta
ReadMIMEHeader/client_headers-4      14.0 ± 0%       3.0 ± 0%  -78.57%  (p=0.000 n=30+30)
ReadMIMEHeader/server_headers-4      14.0 ± 0%       3.0 ± 0%  -78.57%  (p=0.000 n=30+30)
Uncommon-4                           5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=30+30)
@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Apr 11, 2022

Spent a bit on this and, unfortunately, there does not seem to be a lot that can be done. The only approach I found that worked1, given the constraints above, was interning (under the assumption that headers are not mostly unique) as that reduced both allocations and CPU usage, so this may be a good candidate for some form of string interning (not exposed to users):

benchmarks

note: machine was noisy, need to run again on a dedicated one

net/textproto:

name                             old time/op    new time/op    delta
ReadMIMEHeader/client_headers-4    2.38µs ± 9%    2.26µs ±11%   -5.35%  (p=0.000 n=26+29)
ReadMIMEHeader/server_headers-4    2.10µs ± 8%    1.95µs ± 9%   -7.28%  (p=0.000 n=28+26)
Uncommon-4                          549ns ± 7%     522ns ±10%   -4.99%  (p=0.000 n=29+28)

name                             old alloc/op   new alloc/op   delta
ReadMIMEHeader/client_headers-4    1.53kB ± 0%    0.95kB ± 1%  -38.41%  (p=0.000 n=29+28)
ReadMIMEHeader/server_headers-4    1.10kB ± 0%    0.92kB ± 0%  -16.51%  (p=0.000 n=30+30)
Uncommon-4                           468B ± 0%      435B ± 0%   -7.05%  (p=0.000 n=30+27)

name                             old allocs/op  new allocs/op  delta
ReadMIMEHeader/client_headers-4      14.0 ± 0%       3.0 ± 0%  -78.57%  (p=0.000 n=30+30)
ReadMIMEHeader/server_headers-4      14.0 ± 0%       3.0 ± 0%  -78.57%  (p=0.000 n=30+30)
Uncommon-4                           5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=30+30)

net/http:

name                               old time/op    new time/op    delta
CookieString-4                        653ns ± 9%     819ns ±35%  +25.42%  (p=0.000 n=26+28)
ReadSetCookies-4                     2.40µs ± 9%    2.96µs ±27%  +23.31%  (p=0.000 n=26+28)
ReadCookies-4                        2.85µs ± 9%    3.71µs ±52%  +30.13%  (p=0.000 n=25+29)
HeaderWriteSubset-4                   418ns ± 2%     431ns ±10%   +3.17%  (p=0.000 n=27+25)
CopyValues-4                         1.22µs ± 5%    1.19µs ±11%   -2.48%  (p=0.000 n=25+28)
ServerMatch-4                        18.7ns ± 2%    18.8ns ± 8%     ~     (p=0.113 n=28+26)
ReadRequestChrome-4                  2.81µs ±19%    2.45µs ±14%  -12.95%  (p=0.000 n=28+26)
ReadRequestCurl-4                    1.36µs ± 6%    1.26µs ± 7%   -6.72%  (p=0.000 n=26+27)
ReadRequestApachebench-4             1.38µs ±12%    1.30µs ±18%   -5.32%  (p=0.000 n=27+30)
ReadRequestSiege-4                   1.77µs ± 8%    1.62µs ± 6%   -8.67%  (p=0.000 n=26+28)
ReadRequestWrk-4                     1.05µs ±13%    0.96µs ± 5%   -8.53%  (p=0.000 n=27+24)
FileAndServer_1KB/NoTLS-4             133µs ± 6%     126µs ±10%   -5.04%  (p=0.000 n=29+29)
FileAndServer_1KB/TLS-4               142µs ± 7%     135µs ± 6%   -4.87%  (p=0.000 n=28+26)
FileAndServer_16MB/NoTLS-4           11.0ms ± 4%    10.5ms ± 6%   -4.82%  (p=0.000 n=27+28)
FileAndServer_16MB/TLS-4             21.7ms ±10%    21.2ms ±10%   -2.48%  (p=0.025 n=28+29)
FileAndServer_64MB/NoTLS-4           41.7ms ± 5%    41.5ms ± 5%     ~     (p=0.423 n=28+29)
FileAndServer_64MB/TLS-4             82.9ms ±12%    81.0ms ±12%     ~     (p=0.096 n=29+29)
ServeMux-4                           50.6µs ± 7%    50.1µs ± 7%     ~     (p=0.118 n=26+28)
ServeMux_SkipServe-4                 23.0µs ± 4%    23.1µs ± 5%     ~     (p=0.796 n=27+26)
ClientServer-4                       95.7µs ± 7%    92.9µs ± 7%   -2.91%  (p=0.001 n=30+28)
ClientServerParallel4-4              37.5µs ±24%    36.5µs ±47%     ~     (p=0.124 n=29+29)
ClientServerParallel64-4              122µs ±49%     108µs ±58%  -11.15%  (p=0.005 n=26+29)
Server-4                              169µs ±43%     165µs ±48%     ~     (p=0.618 n=29+25)
Client-4                              147µs ±25%     136µs ±17%   -7.57%  (p=0.005 n=30+24)
ServerFakeConnNoKeepAlive-4          23.0µs ±61%    20.8µs ±38%     ~     (p=0.109 n=30+30)
ServerFakeConnWithKeepAlive-4        11.6µs ±21%    11.7µs ±39%     ~     (p=0.686 n=27+30)
ServerFakeConnWithKeepAliveLite-4    7.74µs ±30%    7.71µs ±23%     ~     (p=0.879 n=26+29)
ServerHandlerTypeLen-4               8.96µs ±44%    9.52µs ±46%     ~     (p=0.138 n=30+30)
ServerHandlerNoLen-4                 7.92µs ±33%    7.56µs ±29%     ~     (p=0.376 n=25+26)
ServerHandlerNoType-4                7.70µs ±33%    8.70µs ±51%  +13.01%  (p=0.036 n=28+30)
ServerHandlerNoHeader-4              5.52µs ±22%    6.10µs ±29%  +10.49%  (p=0.009 n=28+30)
ServerHijack-4                       16.2µs ± 4%    16.4µs ± 7%     ~     (p=0.580 n=26+28)
CloseNotifier-4                       137µs ± 9%     143µs ± 9%   +4.34%  (p=0.000 n=26+27)
ResponseStatusLine-4                 24.5ns ±13%    24.7ns ± 5%   +0.70%  (p=0.033 n=28+27)

name                               old alloc/op   new alloc/op   delta
CookieString-4                         176B ± 0%      176B ± 0%     ~     (all equal)
ReadSetCookies-4                     1.01kB ± 0%    1.01kB ± 0%     ~     (all equal)
ReadCookies-4                        1.84kB ± 0%    1.84kB ± 0%     ~     (all equal)
HeaderWriteSubset-4                   0.00B          0.00B          ~     (all equal)
CopyValues-4                           736B ± 0%      736B ± 0%     ~     (all equal)
ReadRequestChrome-4                  1.83kB ± 0%    1.33kB ± 1%  -27.52%  (p=0.000 n=30+30)
ReadRequestCurl-4                      912B ± 0%      887B ± 0%   -2.73%  (p=0.000 n=30+29)
ReadRequestApachebench-4               916B ± 0%      887B ± 0%   -3.14%  (p=0.000 n=30+29)
ReadRequestSiege-4                   1.00kB ± 0%    0.92kB ± 0%   -8.05%  (p=0.000 n=30+27)
ReadRequestWrk-4                       864B ± 0%      855B ± 0%   -1.04%  (p=0.000 n=30+22)
ServeMux-4                           17.3kB ± 0%    17.3kB ± 0%     ~     (all equal)
ServeMux_SkipServe-4                  0.00B          0.00B          ~     (all equal)
ClientServer-4                       5.01kB ± 0%    5.01kB ± 1%     ~     (p=0.194 n=29+30)
ClientServerParallel4-4              6.74kB ±18%    6.69kB ±17%     ~     (p=0.976 n=30+30)
ClientServerParallel64-4             16.4kB ±24%    16.1kB ±22%     ~     (p=0.640 n=29+28)
Server-4                             2.32kB ± 1%    2.36kB ± 3%   +1.82%  (p=0.000 n=29+28)
Client-4                             3.50kB ± 0%    3.53kB ± 2%   +0.83%  (p=0.000 n=26+30)
ServerFakeConnNoKeepAlive-4          4.75kB ± 0%    4.56kB ± 1%   -4.18%  (p=0.000 n=26+30)
ServerFakeConnWithKeepAlive-4        2.48kB ± 0%    2.24kB ± 0%   -9.59%  (p=0.000 n=30+29)
ServerFakeConnWithKeepAliveLite-4    1.35kB ± 0%    1.36kB ± 0%   +0.73%  (p=0.000 n=30+27)
ServerHandlerTypeLen-4               2.16kB ± 0%    2.19kB ± 0%   +1.31%  (p=0.000 n=30+30)
ServerHandlerNoLen-4                 2.13kB ± 0%    2.15kB ± 0%   +1.09%  (p=0.000 n=30+29)
ServerHandlerNoType-4                2.13kB ± 0%    2.16kB ± 0%   +1.28%  (p=0.000 n=30+29)
ServerHandlerNoHeader-4              1.35kB ± 0%    1.36kB ± 0%   +0.93%  (p=0.000 n=27+28)
ServerHijack-4                       16.2kB ± 0%    16.4kB ± 0%   +0.93%  (p=0.000 n=28+30)
CloseNotifier-4                      3.36kB ± 1%    3.48kB ± 2%   +3.40%  (p=0.000 n=28+30)
ResponseStatusLine-4                  0.00B          0.00B          ~     (all equal)

name                               old allocs/op  new allocs/op  delta
CookieString-4                         1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadSetCookies-4                       15.0 ± 0%      15.0 ± 0%     ~     (all equal)
ReadCookies-4                          11.0 ± 0%      11.0 ± 0%     ~     (all equal)
HeaderWriteSubset-4                    0.00           0.00          ~     (all equal)
CopyValues-4                           11.0 ± 0%      11.0 ± 0%     ~     (all equal)
ReadRequestChrome-4                    14.0 ± 0%       6.0 ± 0%  -57.14%  (p=0.000 n=30+30)
ReadRequestCurl-4                      9.00 ± 0%      6.00 ± 0%  -33.33%  (p=0.000 n=30+30)
ReadRequestApachebench-4               9.00 ± 0%      6.00 ± 0%  -33.33%  (p=0.000 n=30+30)
ReadRequestSiege-4                     11.0 ± 0%       6.0 ± 0%  -45.45%  (p=0.000 n=30+30)
ReadRequestWrk-4                       7.00 ± 0%      6.00 ± 0%  -14.29%  (p=0.000 n=30+30)
ServeMux-4                              360 ± 0%       360 ± 0%     ~     (all equal)
ServeMux_SkipServe-4                   0.00           0.00          ~     (all equal)
ClientServer-4                         59.0 ± 0%      53.0 ± 0%  -10.17%  (p=0.000 n=30+30)
ClientServerParallel4-4                65.4 ± 9%      59.2 ± 9%   -9.48%  (p=0.000 n=30+30)
ClientServerParallel64-4                100 ±19%        96 ± 9%   -4.49%  (p=0.003 n=30+28)
Server-4                               21.0 ± 0%      18.0 ± 0%  -14.29%  (p=0.000 n=30+30)
Client-4                               42.0 ± 0%      39.0 ± 0%   -7.14%  (p=0.000 n=30+30)
ServerFakeConnNoKeepAlive-4            53.0 ± 0%      47.0 ± 0%  -11.32%  (p=0.000 n=30+30)
ServerFakeConnWithKeepAlive-4          23.0 ± 0%      17.0 ± 0%  -26.09%  (p=0.000 n=30+30)
ServerFakeConnWithKeepAliveLite-4      13.0 ± 0%      12.0 ± 0%   -7.69%  (p=0.000 n=30+30)
ServerHandlerTypeLen-4                 20.0 ± 0%      19.0 ± 0%   -5.00%  (p=0.000 n=30+30)
ServerHandlerNoLen-4                   18.0 ± 0%      17.0 ± 0%   -5.56%  (p=0.000 n=30+30)
ServerHandlerNoType-4                  19.0 ± 0%      18.0 ± 0%   -5.26%  (p=0.000 n=30+30)
ServerHandlerNoHeader-4                13.0 ± 0%      12.0 ± 0%   -7.69%  (p=0.000 n=30+30)
ServerHijack-4                         51.0 ± 0%      50.0 ± 0%   -1.96%  (p=0.000 n=30+30)
CloseNotifier-4                        51.0 ± 0%      49.0 ± 0%   -3.92%  (p=0.000 n=30+30)
ResponseStatusLine-4                   0.00           0.00          ~     (all equal)

name                               old speed      new speed      delta
ReadRequestChrome-4                 218MB/s ±16%   249MB/s ±13%  +14.15%  (p=0.000 n=28+27)
ReadRequestCurl-4                  57.6MB/s ± 5%  61.4MB/s ±13%   +6.72%  (p=0.000 n=26+28)
ReadRequestApachebench-4           59.7MB/s ±11%  63.1MB/s ±15%   +5.79%  (p=0.000 n=27+30)
ReadRequestSiege-4                 85.1MB/s ± 8%  93.5MB/s ± 5%   +9.82%  (p=0.000 n=27+28)
ReadRequestWrk-4                   38.2MB/s ±11%  41.6MB/s ± 5%   +8.89%  (p=0.000 n=27+25)
FileAndServer_1KB/NoTLS-4          7.71MB/s ± 6%  8.10MB/s ± 9%   +5.10%  (p=0.000 n=29+30)
FileAndServer_1KB/TLS-4            7.21MB/s ± 8%  7.57MB/s ± 5%   +5.05%  (p=0.000 n=28+26)
FileAndServer_16MB/NoTLS-4         1.53GB/s ± 4%  1.61GB/s ± 6%   +5.11%  (p=0.000 n=27+28)
FileAndServer_16MB/TLS-4            775MB/s ± 9%   794MB/s ± 9%   +2.52%  (p=0.025 n=28+29)
FileAndServer_64MB/NoTLS-4         1.61GB/s ± 5%  1.62GB/s ± 5%     ~     (p=0.423 n=28+29)
FileAndServer_64MB/TLS-4            808MB/s ±13%   826MB/s ±10%     ~     (p=0.098 n=30+28)

I have pushed the poc here. If we think this may be useful I can try to refine it properly (the current interning logic is really crude).

If you have internal benchmarks it would be useful to run them on that poc to get a better sense of whether there are some workloads where this approach falls short.

While I haven't tested this on our production workloads yet, I can share that on one of our highest-traffic services ReadMIMEHeader accounts for ~2% of CPU time and ~2.5% of total allocations by bytes.

Footnotes

  1. other attempts included adding more headers to the list of known headers, some form of single string allocation, and other similar micro-optimizations

CAFxX added a commit to CAFxX/go that referenced this issue Apr 11, 2022
PoC for golang#37809

name                             old time/op    new time/op    delta
ReadMIMEHeader/client_headers-4    2.38µs ± 9%    2.26µs ±11%   -5.35%  (p=0.000 n=26+29)
ReadMIMEHeader/server_headers-4    2.10µs ± 8%    1.95µs ± 9%   -7.28%  (p=0.000 n=28+26)
Uncommon-4                          549ns ± 7%     522ns ±10%   -4.99%  (p=0.000 n=29+28)

name                             old alloc/op   new alloc/op   delta
ReadMIMEHeader/client_headers-4    1.53kB ± 0%    0.95kB ± 1%  -38.41%  (p=0.000 n=29+28)
ReadMIMEHeader/server_headers-4    1.10kB ± 0%    0.92kB ± 0%  -16.51%  (p=0.000 n=30+30)
Uncommon-4                           468B ± 0%      435B ± 0%   -7.05%  (p=0.000 n=30+27)

name                             old allocs/op  new allocs/op  delta
ReadMIMEHeader/client_headers-4      14.0 ± 0%       3.0 ± 0%  -78.57%  (p=0.000 n=30+30)
ReadMIMEHeader/server_headers-4      14.0 ± 0%       3.0 ± 0%  -78.57%  (p=0.000 n=30+30)
Uncommon-4                           5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=30+30)
CAFxX added a commit to CAFxX/go that referenced this issue Apr 11, 2022
PoC for golang#37809

net/textproto:

name                             old time/op    new time/op    delta
ReadMIMEHeader/client_headers-4    2.38µs ± 9%    2.26µs ±11%   -5.35%  (p=0.000 n=26+29)
ReadMIMEHeader/server_headers-4    2.10µs ± 8%    1.95µs ± 9%   -7.28%  (p=0.000 n=28+26)
Uncommon-4                          549ns ± 7%     522ns ±10%   -4.99%  (p=0.000 n=29+28)

name                             old alloc/op   new alloc/op   delta
ReadMIMEHeader/client_headers-4    1.53kB ± 0%    0.95kB ± 1%  -38.41%  (p=0.000 n=29+28)
ReadMIMEHeader/server_headers-4    1.10kB ± 0%    0.92kB ± 0%  -16.51%  (p=0.000 n=30+30)
Uncommon-4                           468B ± 0%      435B ± 0%   -7.05%  (p=0.000 n=30+27)

name                             old allocs/op  new allocs/op  delta
ReadMIMEHeader/client_headers-4      14.0 ± 0%       3.0 ± 0%  -78.57%  (p=0.000 n=30+30)
ReadMIMEHeader/server_headers-4      14.0 ± 0%       3.0 ± 0%  -78.57%  (p=0.000 n=30+30)
Uncommon-4                           5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=30+30)

net/http:

name                               old time/op    new time/op    delta
CookieString-4                        653ns ± 9%     819ns ±35%  +25.42%  (p=0.000 n=26+28)
ReadSetCookies-4                     2.40µs ± 9%    2.96µs ±27%  +23.31%  (p=0.000 n=26+28)
ReadCookies-4                        2.85µs ± 9%    3.71µs ±52%  +30.13%  (p=0.000 n=25+29)
HeaderWriteSubset-4                   418ns ± 2%     431ns ±10%   +3.17%  (p=0.000 n=27+25)
CopyValues-4                         1.22µs ± 5%    1.19µs ±11%   -2.48%  (p=0.000 n=25+28)
ServerMatch-4                        18.7ns ± 2%    18.8ns ± 8%     ~     (p=0.113 n=28+26)
ReadRequestChrome-4                  2.81µs ±19%    2.45µs ±14%  -12.95%  (p=0.000 n=28+26)
ReadRequestCurl-4                    1.36µs ± 6%    1.26µs ± 7%   -6.72%  (p=0.000 n=26+27)
ReadRequestApachebench-4             1.38µs ±12%    1.30µs ±18%   -5.32%  (p=0.000 n=27+30)
ReadRequestSiege-4                   1.77µs ± 8%    1.62µs ± 6%   -8.67%  (p=0.000 n=26+28)
ReadRequestWrk-4                     1.05µs ±13%    0.96µs ± 5%   -8.53%  (p=0.000 n=27+24)
FileAndServer_1KB/NoTLS-4             133µs ± 6%     126µs ±10%   -5.04%  (p=0.000 n=29+29)
FileAndServer_1KB/TLS-4               142µs ± 7%     135µs ± 6%   -4.87%  (p=0.000 n=28+26)
FileAndServer_16MB/NoTLS-4           11.0ms ± 4%    10.5ms ± 6%   -4.82%  (p=0.000 n=27+28)
FileAndServer_16MB/TLS-4             21.7ms ±10%    21.2ms ±10%   -2.48%  (p=0.025 n=28+29)
FileAndServer_64MB/NoTLS-4           41.7ms ± 5%    41.5ms ± 5%     ~     (p=0.423 n=28+29)
FileAndServer_64MB/TLS-4             82.9ms ±12%    81.0ms ±12%     ~     (p=0.096 n=29+29)
ServeMux-4                           50.6µs ± 7%    50.1µs ± 7%     ~     (p=0.118 n=26+28)
ServeMux_SkipServe-4                 23.0µs ± 4%    23.1µs ± 5%     ~     (p=0.796 n=27+26)
ClientServer-4                       95.7µs ± 7%    92.9µs ± 7%   -2.91%  (p=0.001 n=30+28)
ClientServerParallel4-4              37.5µs ±24%    36.5µs ±47%     ~     (p=0.124 n=29+29)
ClientServerParallel64-4              122µs ±49%     108µs ±58%  -11.15%  (p=0.005 n=26+29)
Server-4                              169µs ±43%     165µs ±48%     ~     (p=0.618 n=29+25)
Client-4                              147µs ±25%     136µs ±17%   -7.57%  (p=0.005 n=30+24)
ServerFakeConnNoKeepAlive-4          23.0µs ±61%    20.8µs ±38%     ~     (p=0.109 n=30+30)
ServerFakeConnWithKeepAlive-4        11.6µs ±21%    11.7µs ±39%     ~     (p=0.686 n=27+30)
ServerFakeConnWithKeepAliveLite-4    7.74µs ±30%    7.71µs ±23%     ~     (p=0.879 n=26+29)
ServerHandlerTypeLen-4               8.96µs ±44%    9.52µs ±46%     ~     (p=0.138 n=30+30)
ServerHandlerNoLen-4                 7.92µs ±33%    7.56µs ±29%     ~     (p=0.376 n=25+26)
ServerHandlerNoType-4                7.70µs ±33%    8.70µs ±51%  +13.01%  (p=0.036 n=28+30)
ServerHandlerNoHeader-4              5.52µs ±22%    6.10µs ±29%  +10.49%  (p=0.009 n=28+30)
ServerHijack-4                       16.2µs ± 4%    16.4µs ± 7%     ~     (p=0.580 n=26+28)
CloseNotifier-4                       137µs ± 9%     143µs ± 9%   +4.34%  (p=0.000 n=26+27)
ResponseStatusLine-4                 24.5ns ±13%    24.7ns ± 5%   +0.70%  (p=0.033 n=28+27)

name                               old alloc/op   new alloc/op   delta
CookieString-4                         176B ± 0%      176B ± 0%     ~     (all equal)
ReadSetCookies-4                     1.01kB ± 0%    1.01kB ± 0%     ~     (all equal)
ReadCookies-4                        1.84kB ± 0%    1.84kB ± 0%     ~     (all equal)
HeaderWriteSubset-4                   0.00B          0.00B          ~     (all equal)
CopyValues-4                           736B ± 0%      736B ± 0%     ~     (all equal)
ReadRequestChrome-4                  1.83kB ± 0%    1.33kB ± 1%  -27.52%  (p=0.000 n=30+30)
ReadRequestCurl-4                      912B ± 0%      887B ± 0%   -2.73%  (p=0.000 n=30+29)
ReadRequestApachebench-4               916B ± 0%      887B ± 0%   -3.14%  (p=0.000 n=30+29)
ReadRequestSiege-4                   1.00kB ± 0%    0.92kB ± 0%   -8.05%  (p=0.000 n=30+27)
ReadRequestWrk-4                       864B ± 0%      855B ± 0%   -1.04%  (p=0.000 n=30+22)
ServeMux-4                           17.3kB ± 0%    17.3kB ± 0%     ~     (all equal)
ServeMux_SkipServe-4                  0.00B          0.00B          ~     (all equal)
ClientServer-4                       5.01kB ± 0%    5.01kB ± 1%     ~     (p=0.194 n=29+30)
ClientServerParallel4-4              6.74kB ±18%    6.69kB ±17%     ~     (p=0.976 n=30+30)
ClientServerParallel64-4             16.4kB ±24%    16.1kB ±22%     ~     (p=0.640 n=29+28)
Server-4                             2.32kB ± 1%    2.36kB ± 3%   +1.82%  (p=0.000 n=29+28)
Client-4                             3.50kB ± 0%    3.53kB ± 2%   +0.83%  (p=0.000 n=26+30)
ServerFakeConnNoKeepAlive-4          4.75kB ± 0%    4.56kB ± 1%   -4.18%  (p=0.000 n=26+30)
ServerFakeConnWithKeepAlive-4        2.48kB ± 0%    2.24kB ± 0%   -9.59%  (p=0.000 n=30+29)
ServerFakeConnWithKeepAliveLite-4    1.35kB ± 0%    1.36kB ± 0%   +0.73%  (p=0.000 n=30+27)
ServerHandlerTypeLen-4               2.16kB ± 0%    2.19kB ± 0%   +1.31%  (p=0.000 n=30+30)
ServerHandlerNoLen-4                 2.13kB ± 0%    2.15kB ± 0%   +1.09%  (p=0.000 n=30+29)
ServerHandlerNoType-4                2.13kB ± 0%    2.16kB ± 0%   +1.28%  (p=0.000 n=30+29)
ServerHandlerNoHeader-4              1.35kB ± 0%    1.36kB ± 0%   +0.93%  (p=0.000 n=27+28)
ServerHijack-4                       16.2kB ± 0%    16.4kB ± 0%   +0.93%  (p=0.000 n=28+30)
CloseNotifier-4                      3.36kB ± 1%    3.48kB ± 2%   +3.40%  (p=0.000 n=28+30)
ResponseStatusLine-4                  0.00B          0.00B          ~     (all equal)

name                               old allocs/op  new allocs/op  delta
CookieString-4                         1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadSetCookies-4                       15.0 ± 0%      15.0 ± 0%     ~     (all equal)
ReadCookies-4                          11.0 ± 0%      11.0 ± 0%     ~     (all equal)
HeaderWriteSubset-4                    0.00           0.00          ~     (all equal)
CopyValues-4                           11.0 ± 0%      11.0 ± 0%     ~     (all equal)
ReadRequestChrome-4                    14.0 ± 0%       6.0 ± 0%  -57.14%  (p=0.000 n=30+30)
ReadRequestCurl-4                      9.00 ± 0%      6.00 ± 0%  -33.33%  (p=0.000 n=30+30)
ReadRequestApachebench-4               9.00 ± 0%      6.00 ± 0%  -33.33%  (p=0.000 n=30+30)
ReadRequestSiege-4                     11.0 ± 0%       6.0 ± 0%  -45.45%  (p=0.000 n=30+30)
ReadRequestWrk-4                       7.00 ± 0%      6.00 ± 0%  -14.29%  (p=0.000 n=30+30)
ServeMux-4                              360 ± 0%       360 ± 0%     ~     (all equal)
ServeMux_SkipServe-4                   0.00           0.00          ~     (all equal)
ClientServer-4                         59.0 ± 0%      53.0 ± 0%  -10.17%  (p=0.000 n=30+30)
ClientServerParallel4-4                65.4 ± 9%      59.2 ± 9%   -9.48%  (p=0.000 n=30+30)
ClientServerParallel64-4                100 ±19%        96 ± 9%   -4.49%  (p=0.003 n=30+28)
Server-4                               21.0 ± 0%      18.0 ± 0%  -14.29%  (p=0.000 n=30+30)
Client-4                               42.0 ± 0%      39.0 ± 0%   -7.14%  (p=0.000 n=30+30)
ServerFakeConnNoKeepAlive-4            53.0 ± 0%      47.0 ± 0%  -11.32%  (p=0.000 n=30+30)
ServerFakeConnWithKeepAlive-4          23.0 ± 0%      17.0 ± 0%  -26.09%  (p=0.000 n=30+30)
ServerFakeConnWithKeepAliveLite-4      13.0 ± 0%      12.0 ± 0%   -7.69%  (p=0.000 n=30+30)
ServerHandlerTypeLen-4                 20.0 ± 0%      19.0 ± 0%   -5.00%  (p=0.000 n=30+30)
ServerHandlerNoLen-4                   18.0 ± 0%      17.0 ± 0%   -5.56%  (p=0.000 n=30+30)
ServerHandlerNoType-4                  19.0 ± 0%      18.0 ± 0%   -5.26%  (p=0.000 n=30+30)
ServerHandlerNoHeader-4                13.0 ± 0%      12.0 ± 0%   -7.69%  (p=0.000 n=30+30)
ServerHijack-4                         51.0 ± 0%      50.0 ± 0%   -1.96%  (p=0.000 n=30+30)
CloseNotifier-4                        51.0 ± 0%      49.0 ± 0%   -3.92%  (p=0.000 n=30+30)
ResponseStatusLine-4                   0.00           0.00          ~     (all equal)

name                               old speed      new speed      delta
ReadRequestChrome-4                 218MB/s ±16%   249MB/s ±13%  +14.15%  (p=0.000 n=28+27)
ReadRequestCurl-4                  57.6MB/s ± 5%  61.4MB/s ±13%   +6.72%  (p=0.000 n=26+28)
ReadRequestApachebench-4           59.7MB/s ±11%  63.1MB/s ±15%   +5.79%  (p=0.000 n=27+30)
ReadRequestSiege-4                 85.1MB/s ± 8%  93.5MB/s ± 5%   +9.82%  (p=0.000 n=27+28)
ReadRequestWrk-4                   38.2MB/s ±11%  41.6MB/s ± 5%   +8.89%  (p=0.000 n=27+25)
FileAndServer_1KB/NoTLS-4          7.71MB/s ± 6%  8.10MB/s ± 9%   +5.10%  (p=0.000 n=29+30)
FileAndServer_1KB/TLS-4            7.21MB/s ± 8%  7.57MB/s ± 5%   +5.05%  (p=0.000 n=28+26)
FileAndServer_16MB/NoTLS-4         1.53GB/s ± 4%  1.61GB/s ± 6%   +5.11%  (p=0.000 n=27+28)
FileAndServer_16MB/TLS-4            775MB/s ± 9%   794MB/s ± 9%   +2.52%  (p=0.025 n=28+29)
FileAndServer_64MB/NoTLS-4         1.61GB/s ± 5%  1.62GB/s ± 5%     ~     (p=0.423 n=28+29)
FileAndServer_64MB/TLS-4            808MB/s ±13%   826MB/s ±10%     ~     (p=0.098 n=30+28)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsDecision
Projects
None yet
Development

No branches or pull requests