Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRegression in open file descriptors between 742efb453 and 6664b77f (file_sd_configs) #1036
Comments
This comment has been minimized.
This comment has been minimized.
|
How often are you reloading (not restarting) your Prometheus server? A description of how our file SD works would also help a lot. |
This comment has been minimized.
This comment has been minimized.
|
If your current file SD setup works in 0.15.1, chances are that this is also fixed by #1040. |
This comment has been minimized.
This comment has been minimized.
|
We have #1040 running in our test environment and will report back after it's soaked for a while. We send a HUP every time a change is made to our service metadata (usually additions). This should be on the order of less than 10 times a day maybe. We generate the individual yaml files (the ones that contain the target specific ip addresses) around every 45 seconds. We should actually start generating them only when they change (versus every time regardless). Thanks! |
This comment has been minimized.
This comment has been minimized.
|
@jmcfarlane On a server where this happens, could you run: for pid in $(pgrep prometheus | head -n1); do ls -l /proc/$pid/fd; doneOr if that contains too much sensitive info: for pid in $(pgrep prometheus | head -n1); do ls -l /proc/$pid/fd | grep anon_inode; doneI want to verify whether the leaked FDs are really inotify FDs or something else (like piling up timing-out target connections). The only leak I was able to reproduce locally so far (including SIGHUP-ing and many small target group files) was with many timing-out targets that would be replaced by the SD constantly, but each old one took a while to get its scraper shut down because the HTTP timeout for its currently running scrape needed to finish first. On a server that has this problem, how many targets do you see on its status page, and do they have any unusual health states indicated there? |
This comment has been minimized.
This comment has been minimized.
|
The output is probably too much to paste here (1.2m). Here's some summary info:
The number of yaml files being watched by inotify (some of them contain empty target lists):
Endpoints described by sd_name/inotify:
Total target listing (aka including the static targets):
Total number of hosts being polled by prometheus:
The global scrape interval is 15s:
But the node and app metrics are defined at a 5s scrape interval. The service itself is an m3.large and the ebs volume holding the metrics currently is 777 gigs. Our production environment has more services, more hosts, and larger disk usage (and we don't see this issue there). Lemme know what else I can provide, I think I know you guys are on a different time zone than me so trying to provide a smattering of detail. Thanks again for all the help! |
This comment has been minimized.
This comment has been minimized.
|
Wow, thanks for that level of detail! So while you see only 187 inotify FDs and 146 targets on the status page, there are 15082 open sockets. So it looks like the problem isn't with the inotify / FileSD, but network connections (likely scrapes) leaking. To find out more about those open connections, the following would be interesting:
|
This comment has been minimized.
This comment has been minimized.
|
Lots of connections shown via netstat:
The debug endpoint returned 12 megs of data:
Definitely looks like it's managing to hold onto connections :)
|
This comment has been minimized.
This comment has been minimized.
|
Are you using the default 10s scrape timeout? |
This comment has been minimized.
This comment has been minimized.
|
Those details are a few comments up, but the global scrape_interval is |
This comment has been minimized.
This comment has been minimized.
|
Okay, so you're having some timeouts anyway. Could you post the full stack trace for one of the 47828 goroutines that's most common? |
This comment has been minimized.
This comment has been minimized.
|
A few of these:
Lots of these:
Bunches of these:
Here's a rough summary of what's referenced the most:
|
This comment has been minimized.
This comment has been minimized.
|
Looks like the two goroutines spawned at the end of |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil Oh, interesting observation. That could be it. @jmcfarlane Could you try running this branch and see if the problem persists? https://github.com/prometheus/prometheus/tree/close-connections Thanks for your patience and help in getting this debugged! |
This comment has been minimized.
This comment has been minimized.
|
Sure thing, I'll get it loaded up now. Will provide data in the morning after it's had some hours to soak. |
This comment has been minimized.
This comment has been minimized.
|
So far the rate of increase looks about the same: Will see in the morning after it's had time to settle (8k was the previous ulimit, which we currently do not reach or exceed in production). |
This comment has been minimized.
This comment has been minimized.
|
That's probably not the problem then, as we'd expect to see ~1k file descriptors based on previous information. |
This comment has been minimized.
This comment has been minimized.
|
Hmm, maybe we need to do the same in |
This comment has been minimized.
This comment has been minimized.
|
Right, it makes much more sense to close them there (and only there), since every scraper will be shutdown before being replaced, but not every shutdown is followed by a replacement. Updated the branch: https://github.com/prometheus/prometheus/commits/close-connections Still done in a hurry, but if this doesn't fix it, we'll investigate more properly in the upcoming week. |
This comment has been minimized.
This comment has been minimized.
|
The growth continued, I'm pulling the updated branch now. Regarding the reload question earlier, we reload whenever service metadata changes. It so happens this did not happen during the night, meaning this last period of time experienced zero config reloads. |
This comment has been minimized.
This comment has been minimized.
|
Indeed after 7 hours f140761 shows the same rate of increase. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil @juliusv No. Our transport is initialized to not keep alive connections and the underlying TCP connection times out after the configured Also not all our round trippers are @jmcfarlane are you scraping with basic auth or bearer tokens? |
This comment has been minimized.
This comment has been minimized.
dan-cleinmark
commented
Aug 31, 2015
|
@fabxc - I'll jump in since @jmcfarlane is on PST - we're not using basic auth or bearer tokens. |
This comment has been minimized.
This comment has been minimized.
|
Sorry I was so busy today. We're going to revert the binary in our test env back to 742efb4. The idea is to re-verify the assertion that the binary is what introduced this behavior (versus some other variable as yet unaccounted for). |
This comment has been minimized.
This comment has been minimized.
|
I can report back again in the morning, but the result is already clear. I stopped prometheus, copied the 742efb4 binary into I'm confident that last line will continue to remain flat (as it does prod and 2 other environments). As surprising as it seems, the data suggests there is some sort of regression inside prometheus (aka outside of our configuration or usage. Has anyone else seen behavior like this (would be really interesting if it's isolated to us)? |
This comment has been minimized.
This comment has been minimized.
|
@jmcfarlane @dan-cleinmark I've tried again to reproduce this with different combinations of target reachability (reachable, timing out, or immediate connection refusal), replacing them via the file SD frequently, and I haven't been able to reproduce this at HEAD or at 6664b77 yet. I'm on Ubuntu Linux 14.04.3 LTS by the way. The only thing I've managed to do is increase the number of open FDs by the number of targets which are currently timing out (obviously), but that number also stays stable across target reloads and so on. So unless we manage to reproduce this on our end somehow, could you try pinning this down to the first commit that introduced this issue (e.g. |
This comment has been minimized.
This comment has been minimized.
|
I'll bisect it today, maybe we'll get lucky :) |
This comment has been minimized.
This comment has been minimized.
|
\o/ Thanks! |
This comment has been minimized.
This comment has been minimized.
|
I'm finished with the bisect. Here's a graph of the open file descriptors during the attempt: I'm looking through the diff now, but it appears this started here:
Here was the actual bisect itself:
|
This comment has been minimized.
This comment has been minimized.
|
Wait, I think I started with the wrong commit ^... |
This comment has been minimized.
This comment has been minimized.
|
Indeed I used the wrong hash with the bisect. I tried again with:
I'm looking at these diffs now. The attempts were tracked here, where the green marks indicate the commits that did not exhibit the descriptor leak: From the perspective of this issue the last known "good build" is 7a6d12a (it's running right now and is not leaking). |
This comment has been minimized.
This comment has been minimized.
|
11a577f would be reasonable as it touches the ingestion code. |
This comment has been minimized.
This comment has been minimized.
|
@jmcfarlane Finally, how did you build Prometheus? Via |
This comment has been minimized.
This comment has been minimized.
|
I answered my question myself by looking at your bisect log (it mentions |
This comment has been minimized.
This comment has been minimized.
|
We still have trouble reproducing this at all, unfortunately. |
This comment has been minimized.
This comment has been minimized.
|
@jmcfarlane Also, #1070 looks very much like it could have been the cause of your FD leak. Want to give that a try too? (or just the PR that @fabxc mentioned above, since it will include that too once it's done). |
This comment has been minimized.
This comment has been minimized.
|
Will do, am at home with a fever so probably not today though :/ On Fri, Sep 11, 2015, 03:48 Julius Volz notifications@github.com wrote:
|
This comment has been minimized.
This comment has been minimized.
|
Aww, get better soon! |
This comment has been minimized.
This comment has been minimized.
|
This one is closed via #1070 as far as I am concerned. |
fabxc
closed this
Sep 21, 2015
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
jmcfarlane commentedAug 27, 2015
After upgrading to 6664b77 in our test environment we noticed an ever increasing (linear) growth of open file descriptors:
https://s3.amazonaws.com/uploads.hipchat.com/15035/57761/dgmeVFGbiAHueDV/Screenshot%20from%202015-08-27%2013%3A23%3A37.png
Any information we could provide to help troubleshoot this?