Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul connection leak #3096

Closed
christoe opened this Issue Aug 21, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@christoe
Copy link

christoe commented Aug 21, 2017

What did you do?
We're running Prometheus 1.6.3 with Consul SD config.

What did you expect to see?
Prometheus running forever

What did you see instead? Under which circumstances?
We're quite rapidly running out of fd:s. The leak comes in the form of connections to Consul (running locally, so connections to localhost:8500). Currently we need to restart due to fd exhaustion several times per week. We first saw it on 1.3, upgraded to 1.4 and 1.5.2 and still had the issue, and now we're running 1.6.3.

Here's node_filefd_allocated on the machine in question, rate(1h):
image
On the graph above its easy to see that the fd allocation rate goes up, until it reaches the limit and cause Prometheus to stop scraping (the blanks in between the peaks). Older graphs with Prometheus/node fd correlation here: #1873 (comment)

We see some correlation between configuration reloads and the rate of allocation, frequent reloads seem to make the problem worse. We use consul-template to generate the Prometheus configuration file, and have had issues with flapping services causing forced reloads. We've implemented some better handling of flapping, but just the rate of new services appearing in the environment causes a couple of reloads per day, so we'll never get rid of the reloads entirely.

Environment

  • System information:

    Linux 3.10.0-327.28.3.el7.x86_64 x86_64

  • Prometheus version:

    prometheus, version 1.6.3 (branch: master, revision: c580b60)
    build user: root@a6410e65f5c7
    build date: 20170522-09:15:06
    go version: go1.8.1

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 21, 2017

Can you confirm that process_open_fds for Prometheus follows the same pattern?

@christoe

This comment has been minimized.

Copy link
Author

christoe commented Aug 21, 2017

@brian-brazil: Yes. Here's the same graph with process_open_fds included:
image

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 21, 2017

process_open_fds is a gauge, can you graph it without the rate?

@christoe

This comment has been minimized.

Copy link
Author

christoe commented Aug 21, 2017

Here you go:
image

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Oct 2, 2017

I've been able to reproduce this with a pretty basic setup; prometheus and consul running via docker compose, consul doesn't have any services in it's sd but prometheus is configured to scrape consul and itself. I then ran curl -x <prom_URL>/-/reload every 5 minutes. For every prometheus config reload, the process_open_fds increases by 2 and doesn't seem to decrease at all.

Looking into the cause.

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Oct 3, 2017

@fabxc I think you were right, the http.Client not having a timeout set seems to be the issue. With the following diff:

diff --git a/discovery/consul/consul.go b/discovery/consul/consul.go
index fbb7808..3008b12 100644
--- a/discovery/consul/consul.go
+++ b/discovery/consul/consul.go
@@ -102,7 +102,7 @@ func NewDiscovery(conf *config.ConsulSDConfig, logger log.Logger) (*Discovery, e
                return nil, err
        }
        transport := &http.Transport{TLSClientConfig: tls}
-       wrapper := &http.Client{Transport: transport}
+       wrapper := &http.Client{Transport: transport, Timeout: 15 * time.Second}
 
        clientConf := &consul.Config{
                Address:    conf.Server,

and the previous curl command curl -x <prom_URL>/-/reload running every 5 seconds, I got a graph of process_open_fds that looked like this:
2017-10-02-235158_1613x509_scrot

So unlike before where the number of open fds continuously increased from that graph it looks like the number of open fds stabilized at some point where the amount opened via reloads and the amount closed periodically via the http.Client timeouts kept up enough to keep the total open fds from growing.

I'm hoping someone might have insight as to why without the timeout I was seeing the count for goroutines running consul.Run and consulService.watch go up. As far as I can tell, on config reload we are properly telling the existing TargetSet's for consul to end via ts.cancel() in TargetManager.reload(). If the goroutines for checking that TargetSet should be ending, shouldn't the consul.Client and http.Client be garbage collected? Wondering if a timeout for the client we pass to consul is the right solution, or if that would just end up hiding the actual issue?

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Oct 3, 2017

Note that disabling keep alives for the same http.Client like transport := &http.Transport{TLSClientConfig: tls, DisableKeepAlives: true,} had no affect.

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Oct 29, 2017

this can probably be marked as fixed

cc: @brian-brazil @fabxc @christoe

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.