Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus can leak goroutines when scraping targets over unreliable connection #5394

Closed
BenoitKnecht opened this Issue Mar 21, 2019 · 6 comments

Comments

Projects
None yet
4 participants
@BenoitKnecht
Copy link

BenoitKnecht commented Mar 21, 2019

Bug Report

I noticed that my Prometheus server is leaking goroutines, and that it correlates to established TCP connections:

image

The TCP connections in question all go to the same host, which is on the other end of a pretty unreliable VPN tunnel.

This is the top of the goroutine profile for this Prometheus server:

goroutine profile: total 6062
1638 @ 0x42e0cb 0x42e173 0x405a8e 0x40577b 0x6c7bb7 0x6c948d 0x6d374e 0x45c0c1
#       0x6c7bb6        net/http.(*persistConn).addTLS+0x1a6            /usr/local/go/src/net/http/transport.go:1177
#       0x6c948c        net/http.(*Transport).dialConn+0x15ac           /usr/local/go/src/net/http/transport.go:1250
#       0x6d374d        net/http.(*Transport).getConn.func4+0x6d        /usr/local/go/src/net/http/transport.go:999

1637 @ 0x42e0cb 0x4293f9 0x428aa6 0x4963fa 0x49650d 0x497259 0x534b1f 0x548298 0x845bba 0x60a2e9 0x60a7fd 0x60ce0a 0x60fa6a 0x60f757 0x60e481 0x6d3852 0x45c0c1
#       0x428aa5        internal/poll.runtime_pollWait+0x65                     /usr/local/go/src/runtime/netpoll.go:173
#       0x4963f9        internal/poll.(*pollDesc).wait+0x99                     /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
#       0x49650c        internal/poll.(*pollDesc).waitRead+0x3c                 /usr/local/go/src/internal/poll/fd_poll_runtime.go:90
#       0x497258        internal/poll.(*FD).Read+0x178                          /usr/local/go/src/internal/poll/fd_unix.go:169
#       0x534b1e        net.(*netFD).Read+0x4e                                  /usr/local/go/src/net/fd_unix.go:202
#       0x548297        net.(*conn).Read+0x67                                   /usr/local/go/src/net/net.go:177
#       0x60a2e8        crypto/tls.(*block).readFromUntil+0x88                  /usr/local/go/src/crypto/tls/conn.go:492
#       0x60a7fc        crypto/tls.(*Conn).readRecord+0xdc                      /usr/local/go/src/crypto/tls/conn.go:593
#       0x60ce09        crypto/tls.(*Conn).readHandshake+0x99                   /usr/local/go/src/crypto/tls/conn.go:955
#       0x60fa69        crypto/tls.(*clientHandshakeState).handshake+0xa9       /usr/local/go/src/crypto/tls/handshake_client.go:191
#       0x60f756        crypto/tls.(*Conn).clientHandshake+0x396                /usr/local/go/src/crypto/tls/handshake_client.go:168
#       0x60e480        crypto/tls.(*Conn).Handshake+0xf0                       /usr/local/go/src/crypto/tls/conn.go:1272
#       0x6d3851        net/http.(*persistConn).addTLS.func2+0x41               /usr/local/go/src/net/http/transport.go:1171

1636 @ 0x42e0cb 0x42e173 0x405a8e 0x40577b 0x6d354a 0x45c0c1
#       0x6d3549        net/http.(*Transport).getConn.func2.1+0x49      /usr/local/go/src/net/http/transport.go:965

The relevant part of net/http.(*persistConn).addTLS() is

var timer *time.Timer // for canceling TLS handshake
if d := pconn.t.TLSHandshakeTimeout; d != 0 {
        timer = time.AfterFunc(d, func() {
                errc <- tlsHandshakeTimeoutError{}
        })      
}
go func() {
        if trace != nil && trace.TLSHandshakeStart != nil {
                trace.TLSHandshakeStart()
        }       
        err := tlsConn.Handshake()
        if timer != nil {
                timer.Stop()
        }       
        errc <- err
}()
if err := <-errc; err != nil {
        plainConn.Close()
        if trace != nil && trace.TLSHandshakeDone != nil {
                trace.TLSHandshakeDone(tls.ConnectionState{}, err)
        }       
        return err
}

As you can see, it can hang forever if TLSHandshakeTimeout is set to zero. And in Prometheus, the HTTP client's transport comes from github.com/prometheus/common/config. NewRoundTripperFromConfig(), where TLSHandshakeTimeout isn't explicitly set (so it defaults to zero):

var rt http.RoundTripper = &http.Transport{
        Proxy:               http.ProxyURL(cfg.ProxyURL.URL),
        MaxIdleConns:        20000,
        MaxIdleConnsPerHost: 1000, // see https://github.com/golang/go/issues/13801
        DisableKeepAlives:   false,
        TLSClientConfig:     tlsConfig,
        DisableCompression:  true,
        // 5 minutes is typically above the maximum sane scrape interval. So we can
        // use keepalive for all configurations. 
        IdleConnTimeout: 5 * time.Minute,
        DialContext: conntrack.NewDialContextFunc(
                conntrack.DialWithTracing(),
                conntrack.DialWithName(name),
        ),
}

I tried setting TLSHandshakeTimeout to 10 seconds, and ran the patched and unpatched versions of Prometheus with the same configuration.

Here are the goroutine and TCP sessions graphs for the patched

image

and unpatched

image

versions. And the top of the goroutine profile for the patched version:

goroutine profile: total 1227
566 @ 0x42e0cb 0x43e0f6 0x17b0172 0x45c0c1
#	0x17b0171	github.com/prometheus/prometheus/scrape.(*scrapeLoop).run+0x821	/home/knecht/dev/git/prometheus/prometheus/scrape/scrape.go:886

219 @ 0x42e0cb 0x43e0f6 0x6cc2d3 0x45c0c1
#	0x6cc2d2	net/http.(*persistConn).writeLoop+0x112	/usr/lib/go-1.11/src/net/http/transport.go:1885

143 @ 0x42e0cb 0x4293f9 0x428aa6 0x4963fa 0x49650d 0x497259 0x534b1f 0x548298 0x845bba 0x6ca1c5 0x5b4dcf 0x5b4f2f 0x6cad12 0x45c0c1
#	0x428aa5	internal/poll.runtime_pollWait+0x65	/usr/lib/go-1.11/src/runtime/netpoll.go:173
#	0x4963f9	internal/poll.(*pollDesc).wait+0x99	/usr/lib/go-1.11/src/internal/poll/fd_poll_runtime.go:85
#	0x49650c	internal/poll.(*pollDesc).waitRead+0x3c	/usr/lib/go-1.11/src/internal/poll/fd_poll_runtime.go:90
#	0x497258	internal/poll.(*FD).Read+0x178		/usr/lib/go-1.11/src/internal/poll/fd_unix.go:169
#	0x534b1e	net.(*netFD).Read+0x4e			/usr/lib/go-1.11/src/net/fd_unix.go:202
#	0x548297	net.(*conn).Read+0x67			/usr/lib/go-1.11/src/net/net.go:177
#	0x6ca1c4	net/http.(*persistConn).Read+0x74	/usr/lib/go-1.11/src/net/http/transport.go:1497
#	0x5b4dce	bufio.(*Reader).fill+0x10e		/usr/lib/go-1.11/src/bufio/bufio.go:100
#	0x5b4f2e	bufio.(*Reader).Peek+0x3e		/usr/lib/go-1.11/src/bufio/bufio.go:132
#	0x6cad11	net/http.(*persistConn).readLoop+0x1a1	/usr/lib/go-1.11/src/net/http/transport.go:1645

I'm going to submit a PR for github.com/prometheus/common to set TLSHandshakeTimeout to the same value as net/http.DefaultTransport.

Other Prometheus components may be affected. I've seen the same issue in the blackbox_exporter at least.

Environment

  • System information:
Linux 4.15.0-30-generic x86_64
  • Prometheus version:
prometheus, version 2.8.0 (branch: HEAD, revision: 59369491cfdfe8dcb325723d6d28a837887a07b9)
  build user:       root@4c4d5c29b71f
  build date:       20190312-07:46:58
  go version:       go1.11.5
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 21, 2019

Great, we've been trying to track this down for a long time!

BenoitKnecht added a commit to BenoitKnecht/common that referenced this issue Mar 21, 2019

config: set TLSHandshakeTimeout in HTTP transport
If omitted (i.e. zero), TLS handshakes never timeout, which means TCP
connections (and their associated goroutines) can stay alive forever.

Also set `ExpectContinueTimeout`, so that both have the same value as in
`http.DefaultTransport`.

See prometheus/prometheus#5394.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>

BenoitKnecht added a commit to BenoitKnecht/common that referenced this issue Mar 21, 2019

config: set TLSHandshakeTimeout in HTTP transport
If omitted (i.e. zero), TLS handshakes never timeout, which means TCP
connections (and their associated goroutines) can stay alive forever.

Also set `ExpectContinueTimeout`, so that both have the same value as in
`http.DefaultTransport`.

See prometheus/prometheus#5394.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>

brian-brazil added a commit to prometheus/common that referenced this issue Mar 21, 2019

config: set TLSHandshakeTimeout in HTTP transport (#179)
If omitted (i.e. zero), TLS handshakes never timeout, which means TCP
connections (and their associated goroutines) can stay alive forever.

Also set `ExpectContinueTimeout`, so that both have the same value as in
`http.DefaultTransport`.

See prometheus/prometheus#5394.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>
@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Apr 9, 2019

seems fixed in prometheus/common#179 ?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Apr 9, 2019

This needs a new tag for prometheus/common and an update of the dependency in this repo.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 9, 2019

@BenoitKnecht

This comment has been minimized.

Copy link
Author

BenoitKnecht commented Apr 10, 2019

As I mentioned, the blackbox_exporter can run into the same issue and should also be made to depend on the updated prometheus/common version.

Should I open an issue for that in prometheus/blackbox_exporter, or will you go through all the Prometheus projects and update that dependency anyway?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 10, 2019

Blackbox exporter is more complicated to do, as we can't use the current version of common.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.