Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upToo many open files (established connections to same nodes) #1873
Comments
This comment has been minimized.
This comment has been minimized.
|
@mberhault Thanks for the great report, and wow, that does look like a connection leak. Strange! Maybe one more thing that could help: could you grab the contents of |
This comment has been minimized.
This comment has been minimized.
|
Done. And I also learned that you can now change ulimits of running processes (see Please find attached the raw The biggest goroutine counts are for:
With the increased ulimit, prometheus is responsive again, and reachable at http://monitoring.gce.cockroachdb.com:9090, feel free to poke around. |
This comment has been minimized.
This comment has been minimized.
|
Any updates on this? |
This comment has been minimized.
This comment has been minimized.
|
Wild guesses: This seems to be related to TLS connections. They are still leaking, but they come in bursts, irregularly but ~once per day on average. Goroutines and open fd's leak at the same time: http://monitoring.gce.cockroachdb.com:9090/graph#%5B%7B%22range_input%22%3A%221w%22%2C%22end_input%22%3A%22%22%2C%22step_input%22%3A%22%22%2C%22stacked%22%3A%22%22%2C%22expr%22%3A%22go_goroutines%22%2C%22tab%22%3A0%7D%2C%7B%22range_input%22%3A%221w%22%2C%22end_input%22%3A%22%22%2C%22step_input%22%3A%22%22%2C%22stacked%22%3A%22%22%2C%22expr%22%3A%22process_open_fds%22%2C%22tab%22%3A0%7D%5D Even wilder guess: This happens if a TLS scrape times out, possibly because the time-out cancelation via the context is not propagated throughout the stack. (Works for normal HTTP, but not for HTTPS.) Needs more investigation, though. @fabxc might be qualified. |
beorn7
added
kind/bug
priority/P1
labels
Aug 22, 2016
This comment has been minimized.
This comment has been minimized.
|
Sorry for still not having a solution here. Everybody seems terribly busy... However, I have seen a slightly similar leak on one of our Prometheus servers using TLS for scraping, and I couldn't reproduce the leak with Prometheus 1.1.2+. Could you upgrade your server and see if the leak is still happening? (This is based on the theory that the leak happens somewhere in the HTTP-TLS stack, and that bug got fixed in the library.) |
This comment has been minimized.
This comment has been minimized.
|
This is still happening on 1.1.3. Similar setup as before (insecure TLS), lsof lists a few hundred open FDs each for a few different nodes. |
This comment has been minimized.
This comment has been minimized.
|
I now can see it again on 1.2.1. This needs careful investigation… ;-/ |
beorn7
self-assigned this
Nov 2, 2016
This comment has been minimized.
This comment has been minimized.
|
And now I have upgraded to 1.2.2 (with Go 1.7.3 and a lot of vendoring updates, i.e. used libraries have changed quite a bit), and miraculously, the server only uses a third of the file handles and a lot less RAM. @mberhault at your convenience, could you try out 1.2.2, too? |
This comment has been minimized.
This comment has been minimized.
|
After running 1.2.2 for a while, it has now stabilized at ~650 fd's, 1.2.1 needed ~1100 in the same setup. So something definitely has been improved in either Go1.7.3 or one of the other dependencies we vendor. |
This comment has been minimized.
This comment has been minimized.
|
I've upgraded to 1.2.3. Will check back in after a little while. On Wed, Nov 9, 2016 at 9:57 AM, Björn Rabenstein notifications@github.com
|
This comment has been minimized.
This comment has been minimized.
mberhault
referenced this issue
Nov 13, 2016
Closed
/debug/pprof endpoints return 404 when using -web.route-prefix #2183
This comment has been minimized.
This comment has been minimized.
|
Thanks @mberhault for the investigation. By now, our own server has decided to eat more fd's, again. So while the various updates seemed to have reduced the baseline hunger for fd's, something is still leaking. |
This comment has been minimized.
This comment has been minimized.
yonkeltron
commented
Dec 29, 2016
|
I can confirm that this still happens with 1.3.1 (build be47695) and I am not using any TLS functionality. Allocated file descriptors keep increasing steadily until Prometheus becomes unresponsive or it gets restarted manually every 45 minutes or so. In the following plot (Grafana, running on the same host), you can see that allocated FDs steadily increase (fairly rapidly, as this shows the last hour). The steep drop offs correspond to manual restarts. The data source is the official node exporter querying for When I run a
When I check logs, I see lots of errors which looks like this:
Some additional info:
I stand ready to provide any additional information which could help with diagnostics or otherwise contribute positively. Thanks! |
This comment has been minimized.
This comment has been minimized.
yonkeltron
commented
Dec 30, 2016
|
Some additional details: We just upgraded to Prometheus 1.4.1 (build 2a89e87) and the issue still occurs. However, we noticed that it only happens when the Grafana dashboard is open. When we close the dashboard, the increase ceases and the allocated count remains flat. |
This comment has been minimized.
This comment has been minimized.
|
That sounds like Grafana establishes query connections at a faster rate than they can be answered, so connections pile up due to that? Maybe it does not cancel outstanding queries before sending a new refresh. |
This comment has been minimized.
This comment has been minimized.
|
Make sure you're not running Grafana 4.0.0, it has a known issue in that regard. |
This comment has been minimized.
This comment has been minimized.
|
Is this still happening? |
This comment has been minimized.
This comment has been minimized.
korovkin
commented
Feb 28, 2017
|
i am unable to start prometheus dues to this error:
|
This comment has been minimized.
This comment has been minimized.
korovkin
commented
Feb 28, 2017
|
what is the right way to recover here ? |
This comment has been minimized.
This comment has been minimized.
|
You need to grant the process more file handles (with Your error message happens during crash recovery. This issue is about leaking file descriptors during normal operation, though. |
This comment has been minimized.
This comment has been minimized.
korovkin
commented
Mar 1, 2017
|
understood. what is the advised limit to set ? |
This comment has been minimized.
This comment has been minimized.
|
Prometheus is not using an excessive amount of files (barring this bug where some people occasionally see difficult to reproduce fd leaks). Crash recovery might be a bit more demanding, but I would assume you are running with a very low file limit and/or a very large Prometheus data store. |
This comment has been minimized.
This comment has been minimized.
|
Would you mind sharing the output of |
This comment has been minimized.
This comment has been minimized.
|
Unfortunately, during crash recovery,
|
This comment has been minimized.
This comment has been minimized.
korovkin
commented
Mar 2, 2017
|
so apparently i am running a pretty old version (apt-get on ubuntu):
i have placed aside the data that caused the issue:
right now after i cleaned the old data set:
please let me know what else i can provide |
This comment has been minimized.
This comment has been minimized.
korovkin
commented
Mar 2, 2017
|
This comment has been minimized.
This comment has been minimized.
|
Definitely upgrade to the newest version. Crash recovery might indeed open too many files in the old version, but many bugs have been fixed in the meantime. |
This comment has been minimized.
This comment has been minimized.
korovkin
commented
Mar 2, 2017
|
got it, thank you very much for taking a look! |
korovkin
referenced this issue
Mar 6, 2017
Closed
Crash recovery should deal better with corrupt data for individual series in checkpoint file #2475
This comment has been minimized.
This comment has been minimized.
korovkin
commented
Mar 6, 2017
|
please take a look at a related : |
This comment has been minimized.
This comment has been minimized.
mjansing
commented
Mar 20, 2017
|
Any updates on this? I'm also running into this issue with static_config and a custom tls_config ca-file. |
This comment has been minimized.
This comment has been minimized.
mjansing
commented
Mar 28, 2017
|
I tried to dig a bit more into this issue. I use Prometheus 1.5.2 to scrape five hosts. Each host serves node_exporter metrics behind an nginx reverse proxy using an self-signed certificate (protected via basic-auth). After 2 days Prometheus becomes unresponsive until reaching max open fd limit:
It looks like that Prometheus keeps connections to some of these hosts until fd limit is reached:
I often get noisy alerts because Prometheus sometimes cannot scrape data from these hosts and then I see
Nginx error logs from scraping target (switched log level to debug) during curl-request:
So there is a problem while SSL handshaking and it's dedicated to my Prometheus hosts (Ubuntu 16.04). I don't have any problems when I run a local Prometheus instance or use curl on my local dev machine (macOS). For me it looks like that Prometheus keeps open connections if one of these strange ssl handshake errors occurs. The ssl handshake-problem during curl-request seems to be a GnuTLS-Problem if target works behind a reverse proxy: http://askubuntu.com/questions/186847/error-gnutls-handshake-failed-when-connecting-to-https-servers. So maybe there might be two problems:
Could this be a problem in Prometheus underlying HTTP/TLS-stack like the GnuTLS one? Hope my small investigation helps. |
This comment has been minimized.
This comment has been minimized.
|
Thanks @mjansing . This is very helpful research. AFIK, Prometheus is currently not supposed to use persistent connections AKA "keep-alive" (not even via TLS, where the handshake is much more expensive and thus persistent connections are more common). Cf. #2498 for more discussion of the general topic. However, here it looks like Prometheus is keeping open a TLS connection if it doesn't end in a regularly completed scrape but in a time-out. Perhaps, the Go HTTP stack uses persistent connections by default with TLS, and our code somehow doesn't honor that and opens a completely new connection for the next scrape rather than reusing the still open one. About the issue of becoming unresponsive if using all allowed FDs: I wouldn't consider this a bug. Prometheus is opening FDs all the time (especially the embedded LevelDB). If that's not possible, all kind of things break (and in case of LevelDB not only in our own code but also in used libraries). |
This comment has been minimized.
This comment has been minimized.
mjansing
commented
Apr 3, 2017
•
|
@beorn7: I was able to solve my problem. A "faulty" (!?!) network configuration (and not Prometheus itself) lead to our issues. Our Ubuntu based prometheus host had MTU value of 64000. After setting MTU manually via post-up-hook to 1500 all problems gone away and everything works fine:
Maybe this also helps @mberhault, @yonkeltron and @korovkin. |
This comment has been minimized.
This comment has been minimized.
carlpett
commented
Apr 22, 2017
|
We're seeing this as well, though in our case the leak comes in the form of connections to Consul (running locally, so connections to localhost:8500). Currently we need to restart due to fd exhaustion once or twice per week. We first saw it on 1.3, upgraded to 1.4 and still had the issue, and now we're running 1.5.2. Here's We see some correlation between configuration reloads and the rate of allocation, frequent reloads seem to make the problem worse. We use consul-template to generate the Prometheus configuration file, and have had issues with flapping services causing forced reloads. We've implemented some better handling of flapping, but just the rate of new services appearing in the environment causes a couple of reloads per day, so we'll never get rid of the reloads entierly. Any other data I can share that would be enlightening? |
This comment has been minimized.
This comment has been minimized.
|
There's several different issues being discussed here, so it's very hard to follow what is and isn't fixed. If there's remaining issues, please open new separate bugs for them. |
brian-brazil
closed this
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |



mberhault commentedAug 7, 2016
•
edited
What did you do?
Left prometheus running against ~20 targets using DNS discovery.
What did you expect to see?
Happy prometheus.
What did you see instead? Under which circumstances?
Unhappy prometheus: after about 2 weeks of prometheus running just fine, prometheus is unresponsive on its web port and is logging lots of "too many open files" errors..
Environment
GCE debian instance with 4vCPUs, 15GiB ram.
(probably not relevant, but it is running, so here it is anyway)
My prometheus config is as follows. The salient details are:
dns_sd_configusing 'A' records, andtls_configwithinsecure-skip-verify. The DNS records have not changed since prometheus started.Please find the entire prometheus log attached:
prometheus.stderr.gz
Additional details
A quick lsof shows the following many established connections along the lines of:
Counting by target address, I get the following counts:
Those addresses match targets that have been flapping for a while, with sometimes hours of down-time between restarts.
netstat confirms the connections are established:
Further debugging
I am currently leaving prometheus running in case there is some extra debugging information I can provide. Please let me know.