Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in open file descriptors between 742efb453 and 6664b77f (file_sd_configs) #1036

Closed
jmcfarlane opened this Issue Aug 27, 2015 · 40 comments

Comments

Projects
None yet
5 participants
@jmcfarlane
Copy link

jmcfarlane commented Aug 27, 2015

After upgrading to 6664b77 in our test environment we noticed an ever increasing (linear) growth of open file descriptors:

https://s3.amazonaws.com/uploads.hipchat.com/15035/57761/dgmeVFGbiAHueDV/Screenshot%20from%202015-08-27%2013%3A23%3A37.png

Any information we could provide to help troubleshoot this?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 28, 2015

How often are you reloading (not restarting) your Prometheus server?

A description of how our file SD works would also help a lot.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 28, 2015

If your current file SD setup works in 0.15.1, chances are that this is also fixed by #1040.
Could you check that?

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 28, 2015

We have #1040 running in our test environment and will report back after it's soaked for a while. We send a HUP every time a change is made to our service metadata (usually additions). This should be on the order of less than 10 times a day maybe. We generate the individual yaml files (the ones that contain the target specific ip addresses) around every 45 seconds. We should actually start generating them only when they change (versus every time regardless).

Thanks!

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 28, 2015

Unfortunately #1040 does not seem to have effected/improved the slope of the increase:

https://s3.amazonaws.com/uploads.hipchat.com/15035/57761/2yfpnRfkWqHZUuO/Screenshot%20from%202015-08-28%2012%3A37%3A46.png

We're running 742efb4 in staging and production and do not see this behavior.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 29, 2015

@jmcfarlane On a server where this happens, could you run:

for pid in $(pgrep prometheus | head -n1); do ls -l /proc/$pid/fd; done

Or if that contains too much sensitive info:

for pid in $(pgrep prometheus | head -n1); do ls -l /proc/$pid/fd | grep anon_inode; done

I want to verify whether the leaked FDs are really inotify FDs or something else (like piling up timing-out target connections).

The only leak I was able to reproduce locally so far (including SIGHUP-ing and many small target group files) was with many timing-out targets that would be replaced by the SD constantly, but each old one took a while to get its scraper shut down because the HTTP timeout for its currently running scrape needed to finish first.

On a server that has this problem, how many targets do you see on its status page, and do they have any unusual health states indicated there?

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 29, 2015

The output is probably too much to paste here (1.2m). Here's some summary info:

$ cat prom.out | grep -v socket | wc -l
803
$ cat prom.out | grep socket | wc -l
15082
$ cat prom.out | grep -v socket | wc -l
803
$ cat prom.out | grep '.ldb' | wc -l
33
$ cat prom.out | grep inotify | wc -l
187
$ wc -l prom.out 
15885 prom.out

The number of yaml files being watched by inotify (some of them contain empty target lists):

$ ls -l /etc/prometheus/sd/ | wc -l
195

Endpoints described by sd_name/inotify:

  • Number of unique hosts being polled by prometheus: 105
  • Number of tcp endpoints being polled by prometheus: 216 (node + jetty endpoint)

Total target listing (aka including the static targets):

$ curl -s http://...:9090/ | grep job_header | wc -l
111

Total number of hosts being polled by prometheus:

$ curl -s http://...:9090/ | grep state_indicator | wc -l
146

The global scrape interval is 15s:

global:
  evaluation_interval: 15s
  labels: {env: qa1}
  scrape_interval: 15s

But the node and app metrics are defined at a 5s scrape interval. The service itself is an m3.large and the ebs volume holding the metrics currently is 777 gigs. Our production environment has more services, more hosts, and larger disk usage (and we don't see this issue there).

Lemme know what else I can provide, I think I know you guys are on a different time zone than me so trying to provide a smattering of detail. Thanks again for all the help!

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 29, 2015

Wow, thanks for that level of detail! So while you see only 187 inotify FDs and 146 targets on the status page, there are 15082 open sockets. So it looks like the problem isn't with the inotify / FileSD, but network connections (likely scrapes) leaking.

To find out more about those open connections, the following would be interesting:

  • the output of netstat -tenp | grep prometheus
  • the contents of http://yourprometheus:9090/debug/pprof/goroutine?debug=2 (will tell us if there are e.g. many scraper goroutines hanging somewhere)
  • are the targets all scraped fine (expression sum(up) / count(up) should be 1)?
  • expression sort_desc(scrape_duration_seconds): what's the worst scrape duration here?
@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 29, 2015

Lots of connections shown via netstat:

$ wc -l prom.out2
23812 prom.out2

$ cat prom.out2 | awk '{print $6}' | sort | uniq -c
  23809 ESTABLISHED
      3 SYN_SENT

The debug endpoint returned 12 megs of data:

$ grep goroutine prom.out3 | awk '{print $1, $3}' | sort | uniq -c
    412 goroutine [chan
     25 goroutine [IO
     25 goroutine [runnable]:
      1 goroutine [running]:
  47828 goroutine [select,
    466 goroutine [select]:
      1 goroutine [semacquire,
      2 goroutine [syscall,
     75 goroutine [syscall]:

$ grep minutes prom.out3 | sort -k 4 -n | tail -n1
goroutine 99 [select, 1273 minutes]:

Definitely looks like it's managing to hold onto connections :)

  • sum(up) / count(up) returned: 0.8333333333333334
  • sort_desc(scrape_duration_seconds) slowest was: 10.332087977
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 29, 2015

Are you using the default 10s scrape timeout?

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 29, 2015

Those details are a few comments up, but the global scrape_interval is 15s, and the node_exporter and application metrics are set to 5s.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 29, 2015

Okay, so you're having some timeouts anyway. Could you post the full stack trace for one of the 47828 goroutines that's most common?

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 29, 2015

A few of these:

goroutine 100 [select, 1273 minutes]:
github.com/syndtr/goleveldb/leveldb.(*DB).jWriter(0xc2080322c0)
        /home/ubuntu/prometheus/.build/root/gopath/src/github.com/syndtr/goleveldb/leveldb/db_write.go:37 +0x197
created by github.com/syndtr/goleveldb/leveldb.openDB
        /home/ubuntu/prometheus/.build/root/gopath/src/github.com/syndtr/goleveldb/leveldb/db.go:148 +0xa03

Lots of these:

goroutine 166 [select, 1272 minutes]:
github.com/prometheus/prometheus/retrieval.func·006(0xc208581680)
        /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:142 +0x17a
created by github.com/prometheus/prometheus/retrieval.(*TargetManager).Run
        /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:152 +0x6b5

Bunches of these:

goroutine 29751404 [select, 1050 minutes]:
net/http.(*persistConn).writeLoop(0xc26f5c5970)
        /usr/local/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
        /usr/local/go/src/net/http/transport.go:661 +0xcbc

Here's a rough summary of what's referenced the most:

$ grep -E -o '[^ ]+\.go:[0-9]+' prom.out3 | sort | uniq -c | sort -n | tail -n 35
     20     /usr/local/go/src/bufio/bufio.go:97
     20     /usr/local/go/src/net/fd_unix.go:242
     20     /usr/local/go/src/net/net.go:121
     21     /usr/local/go/src/net/fd_poll_runtime.go:89
     25     /usr/local/go/src/net/fd_poll_runtime.go:84
     28     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/target.go:370
     28     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/target.go:450
     28     /usr/local/go/src/net/http/client.go:142
     28     /usr/local/go/src/net/http/client.go:174
     28     /usr/local/go/src/net/http/client.go:219
     28     /usr/local/go/src/net/http/client.go:367
     75     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/discovery/file.go:129
     75     /home/ubuntu/prometheus/.build/root/gopath/src/gopkg.in/fsnotify.v1/inotify.go:179
     75     /home/ubuntu/prometheus/.build/root/gopath/src/gopkg.in/fsnotify.v1/inotify_poller.go:85
     75     /usr/local/go/src/syscall/zsyscall_linux_amd64.go:376
     76     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:142
     76     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:394
     76     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:84
     98     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/target.go:348
    112     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/discovery/file.go:147
    112     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/discovery/file.go:200
    112     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:147
    112     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:402
    112     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:85
    112     /home/ubuntu/prometheus/.build/root/gopath/src/gopkg.in/fsnotify.v1/inotify.go:253
    126     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:309
    187     /home/ubuntu/prometheus/.build/root/gopath/src/gopkg.in/fsnotify.v1/inotify.go:58
    188     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:133
    188     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:152
    188     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:391
    188     /home/ubuntu/prometheus/.build/root/gopath/src/github.com/prometheus/prometheus/retrieval/targetmanager.go:95
  23841     /usr/local/go/src/net/http/transport.go:928
  23855     /usr/local/go/src/net/http/transport.go:660
  23855     /usr/local/go/src/net/http/transport.go:661
  23855     /usr/local/go/src/net/http/transport.go:945
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 29, 2015

Looks like the two goroutines spawned at the end of Transport.dialConn are being left hanging around. On our side I suspect https://github.com/prometheus/prometheus/blob/master/retrieval/target.go#L211 as we don't call CancelRequest or CloseIdleConnections on the old client.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 30, 2015

@brian-brazil Oh, interesting observation. That could be it.

@jmcfarlane Could you try running this branch and see if the problem persists? https://github.com/prometheus/prometheus/tree/close-connections Thanks for your patience and help in getting this debugged!

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 30, 2015

Sure thing, I'll get it loaded up now. Will provide data in the morning after it's had some hours to soak.

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 30, 2015

So far the rate of increase looks about the same:

https://s3.amazonaws.com/uploads.hipchat.com/15035/57761/8BRBmu7bhgfcJhN/Screenshot%20from%202015-08-29%2023%3A19%3A22.png

Will see in the morning after it's had time to settle (8k was the previous ulimit, which we currently do not reach or exceed in production).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 30, 2015

That's probably not the problem then, as we'd expect to see ~1k file descriptors based on previous information.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 30, 2015

Hmm, maybe we need to do the same in StopScraper too?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 30, 2015

Right, it makes much more sense to close them there (and only there), since every scraper will be shutdown before being replaced, but not every shutdown is followed by a replacement.

Updated the branch: https://github.com/prometheus/prometheus/commits/close-connections

Still done in a hurry, but if this doesn't fix it, we'll investigate more properly in the upcoming week.

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 30, 2015

The growth continued, I'm pulling the updated branch now. Regarding the reload question earlier, we reload whenever service metadata changes. It so happens this did not happen during the night, meaning this last period of time experienced zero config reloads.

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Aug 30, 2015

Indeed after 7 hours f140761 shows the same rate of increase.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 31, 2015

@brian-brazil @juliusv No. Our transport is initialized to not keep alive connections and the underlying TCP connection times out after the configured scrape_timeout (legacy solution). Thus, neither CloseIdleConns nor CancelRequest are relevant to us.
Also this does not add up with the fact that this problem doesn't exist in 0.15.1 as nothing changed in that regard.

Also not all our round trippers are http.Transports – so this fix should panic depending on config.

@jmcfarlane are you scraping with basic auth or bearer tokens?

@dan-cleinmark

This comment has been minimized.

Copy link

dan-cleinmark commented Aug 31, 2015

@fabxc - I'll jump in since @jmcfarlane is on PST - we're not using basic auth or bearer tokens.

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Sep 1, 2015

Sorry I was so busy today. We're going to revert the binary in our test env back to 742efb4. The idea is to re-verify the assertion that the binary is what introduced this behavior (versus some other variable as yet unaccounted for).

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Sep 1, 2015

I can report back again in the morning, but the result is already clear. I stopped prometheus, copied the 742efb4 binary into /usr/local/bin and started prometheus again:

https://s3.amazonaws.com/uploads.hipchat.com/15035/57761/8DA6CXYM8rwNmb3/Screenshot%20from%202015-08-31%2021%3A55%3A48.png

I'm confident that last line will continue to remain flat (as it does prod and 2 other environments). As surprising as it seems, the data suggests there is some sort of regression inside prometheus (aka outside of our configuration or usage. Has anyone else seen behavior like this (would be really interesting if it's isolated to us)?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 1, 2015

@jmcfarlane @dan-cleinmark I've tried again to reproduce this with different combinations of target reachability (reachable, timing out, or immediate connection refusal), replacing them via the file SD frequently, and I haven't been able to reproduce this at HEAD or at 6664b77 yet. I'm on Ubuntu Linux 14.04.3 LTS by the way.

The only thing I've managed to do is increase the number of open FDs by the number of targets which are currently timing out (obviously), but that number also stays stable across target reloads and so on.

So unless we manage to reproduce this on our end somehow, could you try pinning this down to the first commit that introduced this issue (e.g. git bisect)? The other option would be to trim down your test configuration further and further to see where the problem appears/disappears, and maybe you could even arrive at something that is shareable and reproducible for us then... I guess we can't get hands on your test system somehow :)

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Sep 1, 2015

I'll bisect it today, maybe we'll get lucky :)

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 1, 2015

\o/ Thanks!

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Sep 2, 2015

I'm finished with the bisect. Here's a graph of the open file descriptors during the attempt:

https://s3.amazonaws.com/uploads.hipchat.com/15035/57761/g2dERMCm9jJlBfM/Screenshot%20from%202015-09-01%2017%3A52%3A10.png

I'm looking through the diff now, but it appears this started here:

ubuntu@lucid-ci-x13xc7o6:~/prometheus$ git bisect bad
995d3b831d2ed57454c8dddce370e270dd035a74 is the first bad commit
commit 995d3b831d2ed57454c8dddce370e270dd035a74
Author: Julius Volz <julius@soundcloud.com>
Date:   Mon Aug 24 15:07:27 2015 +0200

    Fix most golint warnings.

    This is with `golint -min_confidence=0.5`.

    I left several lint warnings untouched because they were either
    incorrect or I felt it was better not to change them at the moment.

:040000 040000 8edf84608f50f6c1df8b815dc557c96ffe73d7cd f303f1217232f784358a03e1aecda06ad79188b6 M  cmd
:040000 040000 a35e7bbc8e68b226ec9a464432bce6bf463a860c 7e510e38168445ad974b841f4c7100efee4d68df M  config
:040000 040000 fad31d8cad1b47ce1a11210dba2ce7f4a1f5fcb6 a4a3f53a3d26d517d6bd9b4e147d02cc8420db3b M  promql
:040000 040000 300457887346ca5a353a669f64ad3459e76fd51f c040d269a4fcc7340997b6f955c3a305799bc962 M  retrieval
:040000 040000 5c55920ac9a08d7e3a759233fd525e03759faaa7 699d2fceb320037736d15ac34c54c8ea2cc03b91 M  rules
:040000 040000 99a1402f376c2395486c36e40edf734b2e2b66e8 a20dc41af1d32e851b1d4e77f126603f9bf10cfe M  storage
:040000 040000 0848ad7b812e7fa21f1f671ad4355614c7cecea6 d587ccb7189ac0613448423efcdea7d1b8bb2a53 M  template
:040000 040000 be069162f6804078bc9f8bcead0dd63f277f34f6 102665d37e629cc092a9b5ecde3fb00092911fc1 M  util
:040000 040000 f32fc616e00cba02e50708b405ce3f51d8b966de a4b6447f70d2e5da8894f606115c5850c69cb061 M  web

Here was the actual bisect itself:

    9  git bisect start HEAD 6664b77f363f2d77a552733cb2d1bd367051189f --
   10  make build
   11  scp prometheus jmcfarlane@...:
   12  git log -1
   13  git bisect bad
   14  make build
   15  git log -1
   16  mv prometheus prometheus.744d5d5a7aff6d734ed6cbc9e767c361c41250a5
   17  scp prometheus.744d5d5a7aff6d734ed6cbc9e767c361c41250a5 jmcfarlane@...:
   18  git bisect bad
   19  make build
   20  git log -1
   21  mv prometheus prometheus.78f5a93ea6d1edca15811f25e435e0d489187d0b
   22  scp prometheus.78f5a93ea6d1edca15811f25e435e0d489187d0b jmcfarlane@...:
   23  git bisect bad
   24  make build
   25  git log -1
   26  mv prometheus prometheus.995d3b831d2ed57454c8dddce370e270dd035a74
   27  scp prometheus.995d3b831d2ed57454c8dddce370e270dd035a74 jmcfarlane@...:
   28  git bisect bad
   29  git diff 995d3b831d2ed57454c8dddce370e270dd035a74^..995d3b831d2ed57454c8dddce370e270dd035a74
@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Sep 2, 2015

Wait, I think I started with the wrong commit ^...

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Sep 2, 2015

Indeed I used the wrong hash with the bisect. I tried again with: git bisect start HEAD 742efb453 -- and wound up with:

The first bad commit could be any of:
306e8468a0e9294d301711c187e215a7ea7b1fc8 (Does not compile?)
f237b0e2daca8c6cdbf90a46d3cd3fd54c587d3b (Does not compile?)
11a577fcd054066f1e4ab3da6f2d8b38132d97b6
We cannot bisect more!

I'm looking at these diffs now.

The attempts were tracked here, where the green marks indicate the commits that did not exhibit the descriptor leak:

https://s3.amazonaws.com/uploads.hipchat.com/15035/57761/D6lTgeMcX0Jw46d/Screenshot%20from%202015-09-01%2022%3A02%3A19.png

From the perspective of this issue the last known "good build" is 7a6d12a (it's running right now and is not leaking).

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 2, 2015

11a577f would be reasonable as it touches the ingestion code.
The other are massive renamings and vendoring. Cannot immediately see anything that would cause a connection leak though. We are calling resp.Body.Close(), which should be the only thing that matters in terms of connection closing.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 2, 2015

@jmcfarlane Finally, how did you build Prometheus? Via make, or manually with a local Go version (make uses Go 1.4.2)?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 2, 2015

I answered my question myself by looking at your bisect log (it mentions make build).

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 9, 2015

We still have trouble reproducing this at all, unfortunately.
I did some significant refactoring around the same code today: #1064
You might want to run it in your test setup. Given it works (it's still WIP but seems to work fine for me), it might dissolve the problem for a reason as unknown as the cause.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 11, 2015

@jmcfarlane Also, #1070 looks very much like it could have been the cause of your FD leak. Want to give that a try too? (or just the PR that @fabxc mentioned above, since it will include that too once it's done).

@jmcfarlane

This comment has been minimized.

Copy link
Author

jmcfarlane commented Sep 11, 2015

Will do, am at home with a fever so probably not today though :/

On Fri, Sep 11, 2015, 03:48 Julius Volz notifications@github.com wrote:

@jmcfarlane https://github.com/jmcfarlane Also, #1070
#1070 looks very much like
it could have been the cause of your FD leak. Want to give that a try too?
(or just the PR that @fabxc https://github.com/fabxc mentioned above,
since it will include that too once it's done).


Reply to this email directly or view it on GitHub
#1036 (comment)
.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 11, 2015

Aww, get better soon!

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 21, 2015

This one is closed via #1070 as far as I am concerned.

@fabxc fabxc closed this Sep 21, 2015

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.