Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/tls: web server leaking memory at block.reserve #28654

Closed
zemirco opened this issue Nov 8, 2018 · 10 comments
Closed

crypto/tls: web server leaking memory at block.reserve #28654

zemirco opened this issue Nov 8, 2018 · 10 comments

Comments

@zemirco
Copy link

@zemirco zemirco commented Nov 8, 2018

What version of Go are you using (go version)?

$ go version
go version go1.11.1 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build714069567=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I've got a web server. The server is directly exposed to the internet. There is no proxy in front of it. Here is a minimal version.

r := chi.NewRouter()

tlsConfig := &tls.Config{
    PreferServerCipherSuites: true,
    MinVersion:               tls.VersionTLS12,
    CurvePreferences: []tls.CurveID{
        tls.CurveP256,
        tls.X25519,
    },
    CipherSuites: []uint16{
        tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,
        tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,
        tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,
        tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
    },
}

s := &http.Server{
    ReadTimeout:  5 * time.Second,
    WriteTimeout: 10 * time.Second,
    IdleTimeout:  120 * time.Second,
    Handler:      r,
    TLSConfig:    tlsConfig,
}

// redirect http to https
redirect := &http.Server{
    ReadTimeout:  5 * time.Second,
    WriteTimeout: 10 * time.Second,
    IdleTimeout:  120 * time.Second,
    Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Connection", "close")
        url := "https://" + r.Host + r.URL.String()
        http.Redirect(w, r, url, http.StatusMovedPermanently)
    }),
}
go func() {
    log.Fatal(redirect.ListenAndServe())
}()

log.Fatal(s.ListenAndServeTLS(certFile, keyFile))

The web server runs inside docker.

$ docker version
Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:38 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:45 2018
  OS/Arch:      linux/amd64
  Experimental: false

What did you expect to see?

I expected to see a more or less steady memory consumption.

What did you see instead?

I saw a steady growth in memory usage. Here is a screenshot for the last 30 days.

screenshot from 2018-11-08 08-47-50

Restarts happened between October 21 and October 28 where the memory didn't have time to grow as much. Before October 21 you can see the pattern where memory grows until a certain point and drops sharply afterwards.

And here is one for the last 6 hours.

screenshot from 2018-11-08 08-47-29

The time span between 3 and 7 am is the most interesting. After that I tried various things and killed the server. The server doesn't get a lot of traffic at the moment. I think the almost perfect linear growth is due to an AWS Health Check.

screenshot from 2018-11-08 08-54-48

Every 30 seconds AWS sends a request to my server to make sure it is still running.

Here is the output of top5 from pprof.

Type: inuse_space
Time: Nov 7, 2018 at 10:31am (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 289.50MB, 79.70% of 363.24MB total
Dropped 90 nodes (cum <= 1.82MB)
Showing top 5 nodes out of 88
      flat  flat%   sum%        cum   cum%
  238.98MB 65.79% 65.79%   238.98MB 65.79%  crypto/tls.(*block).reserve
   20.02MB  5.51% 71.30%    20.02MB  5.51%  crypto/tls.Server
   11.50MB  3.17% 74.47%    11.50MB  3.17%  crypto/aes.newCipher
   10.50MB  2.89% 77.36%    10.50MB  2.89%  crypto/aes.(*aesCipherGCM).NewGCM

A visual representation

svg

The exact lines in the code.

img

Before opening an issue here I tried Stack Overflow https://stackoverflow.com/questions/53189316/golang-web-server-leaking-memory-at-crypto-tls-block-reserve. We couldn't find a solution but various people told me to open an issue. Another user (https://serverfault.com/users/126632/michael-hampton) even said he sees the same issue.

Any ideas?

If you need further information please let me know.

@ysmolsky ysmolsky changed the title web server leaking memory at crypto/tls.(*block).reserve crypto/tls: web server leaking memory at block.reserve Nov 8, 2018
@ysmolsky

This comment has been minimized.

Copy link
Member

@ysmolsky ysmolsky commented Nov 8, 2018

@FiloSottile. I cannot find this code in the tip, can you look at this issue?

@ysmolsky

This comment has been minimized.

Copy link
Member

@ysmolsky ysmolsky commented Nov 8, 2018

@zemirco can you try to reproduce this on the tip version of go (this requires building Go from sources)? Or can you make the self-sufficient program that reproduces the problem so we can run it?

@zemirco

This comment has been minimized.

Copy link
Author

@zemirco zemirco commented Nov 8, 2018

Thank you for looking into this. I can only reproduce this issue in production. I haven't seen it on my local machine. That makes it hard to debug. You need a real domain with a valid certificate to make TLS work.

Is there a Dockerfile which builds Go from scratch that I could use instead of the one from Docker Hub that I'm currently using?

I upgraded Go inside Docker to use 1.11.2-alpine3.8. Will show an updated Graph in a few hours.

@bradfitz

This comment has been minimized.

Copy link
Contributor

@bradfitz bradfitz commented Nov 8, 2018

I assume you're just accumulating HTTP keep-alive servers and each is retaining some memory.

How many file descriptors do you have open? Paste the output of lsof -p $PID.

And include output of running with environment variable GODEBUG=gctrace=1 set.

Also, you write:

Here is a minimal version.

r := chi.NewRouter()

Your first line of the "minimal version" includes some unspecified package.

Do you have a complete standalone repro, ideally without using external packages? But if you need the external packages, can you at least specify their imports & versions?

@zemirco

This comment has been minimized.

Copy link
Author

@zemirco zemirco commented Nov 8, 2018

Here is the current graph showing the last 6 hours.

screenshot 2018-11-08 at 22 50 33

Still seeing the linear growth. Here is the current top3 from pprof.

(pprof) top3
Showing nodes accounting for 113.12MB, 69.06% of 163.80MB total
Dropped 17 nodes (cum <= 0.82MB)
Showing top 3 nodes out of 104
      flat  flat%   sum%        cum   cum%
   97.61MB 59.59% 59.59%    97.61MB 59.59%  crypto/tls.(*block).reserve
       8MB  4.89% 64.48%        8MB  4.89%  net/textproto.(*Reader).ReadMIMEHeader
    7.50MB  4.58% 69.06%     7.50MB  4.58%  crypto/aes.(*aesCipherGCM).NewGCM

Output of lsof -p $PID.

COMMAND    PID USER   FD      TYPE DEVICE SIZE/OFF     NODE NAME
tomahawk 11518 root  cwd       DIR   0,48     4096  3109115 /go/src/github.com/zemirco/tomahawk
tomahawk 11518 root  rtd       DIR   0,48     4096  3106256 /
tomahawk 11518 root  txt       REG   0,48 22721273  3107144 /go/bin/tomahawk
tomahawk 11518 root  mem       REG  252,1           3107144 /go/bin/tomahawk (stat: No such file or directory)
tomahawk 11518 root  mem       REG  252,1           4419813 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
tomahawk 11518 root    0r     FIFO   0,12      0t0 26164278 pipe
tomahawk 11518 root    1w     FIFO   0,12      0t0 26164279 pipe
tomahawk 11518 root    2w     FIFO   0,12      0t0 26164280 pipe
tomahawk 11518 root    3u     sock    0,9      0t0 26164672 protocol: TCP
tomahawk 11518 root    4u  a_inode   0,13        0     9012 [eventpoll]
tomahawk 11518 root    5u     sock    0,9      0t0 26164731 protocol: TCP
tomahawk 11518 root    6u     sock    0,9      0t0 26164738 protocol: TCPv6
tomahawk 11518 root    7u     sock    0,9      0t0 26164745 protocol: TCPv6
tomahawk 11518 root    8u     sock    0,9      0t0 26164747 protocol: TCPv6
tomahawk 11518 root   10u     sock    0,9      0t0 26165341 protocol: TCP

I'm using chi for routing https://github.com/go-chi/chi. The version from my Gopkg.lock

[[projects]]
  name = "github.com/go-chi/chi"
  packages = ["."]
  revision = "b5294d10673813fac8558e7f47242bc9e61b4c25"
  version = "v3.3.3"

I'm also using https://github.com/felixge/httpsnoop to capture http related metrics. It's in the critical path as you can see in the SVG above. Here is the exact version.

[[projects]]
  name = "github.com/felixge/httpsnoop"
  packages = ["."]
  revision = "eadd4fad6aac69ae62379194fe0219f3dbc80fd3"
  version = "v1.0.0"

Output with GODEBUG=gctrace=1.

2018-11-08T22:44:16.590031102Z gc 1 @0.213s 1%: 0.020+10+0.013 ms clock, 0.020+0.15/2.6/0+0.013 ms cpu, 4->4->2 MB, 5 MB goal, 1 P
2018-11-08T22:44:16.683122273Z gc 2 @0.309s 1%: 0.022+7.8+0.010 ms clock, 0.022+0.78/1.8/1.5+0.010 ms cpu, 4->5->4 MB, 5 MB goal, 1 P
2018-11-08T22:44:57.655815041Z gc 3 @41.282s 0%: 0.009+7.0+0.011 ms clock, 0.009+0.38/0.94/3.9+0.011 ms cpu, 7->7->5 MB, 8 MB goal, 1 P
2018-11-08T22:46:22.032369159Z gc 4 @125.660s 0%: 0.081+5.9+0.012 ms clock, 0.081+0.18/0.16/5.5+0.012 ms cpu, 9->9->5 MB, 10 MB goal, 1 P
2018-11-08T22:46:46.630903631Z scvg0: inuse: 7, idle: 55, sys: 63, released: 0, consumed: 63 (MB)
2018-11-08T22:47:57.002212617Z gc 5 @220.626s 0%: 0.011+9.6+0.016 ms clock, 0.011+0.15/1.2/5.7+0.016 ms cpu, 10->10->6 MB, 11 MB goal, 1 P
2018-11-08T22:49:16.719639149Z scvg1: inuse: 10, idle: 52, sys: 63, released: 0, consumed: 63 (MB)
2018-11-08T22:49:43.557174096Z gc 6 @327.182s 0%: 0.013+9.1+0.010 ms clock, 0.013+0.96/1.1/5.5+0.010 ms cpu, 12->12->7 MB, 13 MB goal, 1 P
2018-11-08T22:51:43.678884679Z GC forced
2018-11-08T22:51:43.686286771Z gc 7 @447.310s 0%: 0.012+9.9+0.015 ms clock, 0.012+0/0.084/9.8+0.015 ms cpu, 14->14->7 MB, 15 MB goal, 1 P
2018-11-08T22:51:47.835089661Z scvg2: inuse: 9, idle: 53, sys: 63, released: 0, consumed: 63 (MB)
2018-11-08T22:53:26.227003208Z gc 8 @549.853s 0%: 0.011+7.6+0.013 ms clock, 0.011+0.33/1.1/5.9+0.013 ms cpu, 15->15->8 MB, 16 MB goal, 1 P
2018-11-08T22:54:18.389598926Z scvg3: inuse: 12, idle: 50, sys: 63, released: 0, consumed: 63 (MB)
2018-11-08T22:55:29.390862211Z GC forced
2018-11-08T22:55:29.399517412Z gc 9 @673.022s 0%: 0.023+10+0.012 ms clock, 0.023+0.12/1.5/5.9+0.012 ms cpu, 15->15->9 MB, 17 MB goal, 1 P
2018-11-08T22:56:49.913746798Z scvg4: inuse: 15, idle: 48, sys: 63, released: 0, consumed: 63 (MB)
2018-11-08T22:57:33.112418381Z GC forced
2018-11-08T22:57:33.122372223Z gc 10 @796.746s 0%: 0.011+9.7+0.014 ms clock, 0.011+0/0.092/9.6+0.014 ms cpu, 16->16->10 MB, 19 MB goal, 1 P
2018-11-08T22:59:20.102230021Z scvg5: inuse: 17, idle: 46, sys: 63, released: 0, consumed: 63 (MB)
2018-11-08T22:59:34.544470054Z GC forced
2018-11-08T22:59:34.554512955Z gc 11 @918.176s 0%: 0.011+12+0.013 ms clock, 0.011+0.10/1.3/7.5+0.013 ms cpu, 17->17->11 MB, 21 MB goal, 1 P
2018-11-08T23:01:35.394720036Z GC forced
2018-11-08T23:01:35.405141003Z gc 12 @1039.026s 0%: 0.011+12+0.010 ms clock, 0.011+0.094/1.6/7.6+0.010 ms cpu, 18->18->12 MB, 23 MB goal, 1 P
2018-11-08T23:01:51.896557407Z scvg6: inuse: 15, idle: 48, sys: 63, released: 0, consumed: 63 (MB)
@FiloSottile

This comment has been minimized.

Copy link
Member

@FiloSottile FiloSottile commented Nov 8, 2018

The *block structure is gone in https://golang.org/cl/142818 indeed. It was a simple freelist of reusable buffers, and I do not remember finding anything that would suggest there was a way for them to grow indefinitely.

They are held by the tls.Conn, though, so I suspect your issue here is that something up that stack is not letting go of connections.

@error10

This comment has been minimized.

Copy link

@error10 error10 commented Nov 13, 2018

I'm also seeing this apparent memory leak on a production server, which serves HTTP/2 on TLS directly to the Internet without being behind a proxy like nginx.

It takes some time to get close to exhausting the 2GB of RAM its comtaining VM is allowed, but when that happens, top looks about like:

(pprof) top
Showing nodes accounting for 1489.40MB, 77.63% of 1918.47MB total
Dropped 114 nodes (cum <= 9.59MB)
Showing top 10 nodes out of 111
      flat  flat%   sum%        cum   cum%
  820.06MB 42.75% 42.75%   820.06MB 42.75%  crypto/tls.(*block).reserve
  142.57MB  7.43% 50.18%   142.57MB  7.43%  vendor/golang_org/x/net/http2/hpack.(*headerFieldTable).addEntry
  103.09MB  5.37% 55.55%   103.09MB  5.37%  crypto/tls.Server
   79.02MB  4.12% 59.67%   247.04MB 12.88%  net/http.(*http2Server).ServeConn
   71.03MB  3.70% 63.37%    71.03MB  3.70%  bytes.makeSlice
   66.58MB  3.47% 66.84%    66.58MB  3.47%  myproject/vendor/github.com/gorilla/context.Set
   65.01MB  3.39% 70.23%    65.01MB  3.39%  net/http.(*Request).WithContext
   59.53MB  3.10% 73.33%   378.80MB 19.74%  crypto/tls.(*Conn).readHandshake
   56.01MB  2.92% 76.25%    56.01MB  2.92%  net/textproto.(*Reader).ReadMIMEHeader
   26.50MB  1.38% 77.63%    26.50MB  1.38%  context.WithValue

I tried the minimal example in the net/http docs and could not reproduce this. Its memory usage remained constant even after close to a million HTTP/2 requests, while my app goes OOM after a tiny fraction of that. Something else is going on.

In my case I'm routing with gorilla/mux and using justinas/alice for middleware handling. I'm going to try to put together a minimal example to reproduce the problem but it will take some time. At this point I do suspect the trouble might not be in the standard library after all...

@zemirco

This comment has been minimized.

Copy link
Author

@zemirco zemirco commented Nov 13, 2018

It looks like you're also using the gorilla packages. I know there is a memory leak in https://github.com/gorilla/context when combining it with http.Request.WithContext. Even if you don't use github.com/gorilla/context directly you might be using it through other packages that have it as a dependency. In my case it's https://github.com/gorilla/sessions and https://github.com/boj/redistore. I tried everything from the docs. Neither the context.ClearHandler nor context.Clear() before creating a new context with context.WithValue() does help. Here is another article that mentions the same problem https://leining.org/posts/gorilla-sessions-memory-leak/.

Maybe @elithrar could help?

There is a discussion about releasing new versions for the gorilla packages. The maintainers don't want to break compatibility with existing code. This is a good thing. By now we have Go Modules which could help releasing a new version without breaking old import paths.

@agnivade agnivade added this to the Unplanned milestone Nov 22, 2018
@zemirco

This comment has been minimized.

Copy link
Author

@zemirco zemirco commented Dec 19, 2018

We can close this issue. With this commit gorilla/sessions@12bd476 the memory leak is gone. It had nothing to do with the std lib.

Here is the memory usage from the last 24 hours. As you can see it remains flat after the update just before 6 pm.

screenshot 2018-12-19 at 06 44 37

@zemirco zemirco closed this Dec 19, 2018
@kostix

This comment has been minimized.

Copy link

@kostix kostix commented Dec 19, 2018

@zemirco I think it worth updating your SO question or maybe putting a comment on it.

(Glad you've got this issue sorted!)

@ravilr ravilr mentioned this issue May 9, 2019
0 of 1 task complete
@golang golang locked and limited conversation to collaborators Dec 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.