Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files (established connections to same nodes) #1873

Closed
mberhault opened this Issue Aug 7, 2016 · 36 comments

Comments

Projects
None yet
8 participants
@mberhault
Copy link

mberhault commented Aug 7, 2016

What did you do?
Left prometheus running against ~20 targets using DNS discovery.

What did you expect to see?
Happy prometheus.

What did you see instead? Under which circumstances?
Unhappy prometheus: after about 2 weeks of prometheus running just fine, prometheus is unresponsive on its web port and is logging lots of "too many open files" errors..

Environment
GCE debian instance with 4vCPUs, 15GiB ram.

  • System information:
$ uname -a
Linux monitoring 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux
  • Prometheus version:
$ ./prometheus/prometheus -version
prometheus, version 1.0.1 (branch: master, revision: be40190)
  build user:       root@e881b289ce76
  build date:       20160722-19:54:46
  go version:       go1.6.2
  • Alertmanager version:
    (probably not relevant, but it is running, so here it is anyway)
$ ./alertmanager/alertmanager -version
alertmanager, version 0.3.0 (branch: master, revision: 97eca6d)
  build user:       root@b5b16edd7db9
  build date:       20160708-10:52:58
  go version:       go1.6.2
  • Prometheus configuration file:
    My prometheus config is as follows. The salient details are: dns_sd_config using 'A' records, and tls_config with insecure-skip-verify. The DNS records have not changed since prometheus started.
global:
  scrape_interval: 10s

rule_files:
- "alerts.rules"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'cockroach'
    # metrics_path defaults to '/metrics'
    metrics_path: '/_status/vars'
    # scheme defaults to 'http'.
    scheme: 'https'
    # Insecure for now. TODO(mberhault): use self-signed CA.
    tls_config:
      insecure_skip_verify: true

    # Resolve DNS to find hosts.
    # *NOTE* this requires up-to-date DNS.
    dns_sd_configs:
    - names: ['beta.gce.cockroachdb.com',
              'gamma.gce.cockroachdb.com',
              'rho.gce.cockroachdb.com']
      type: 'A'
      port: 8080
    # Create a "cluster" label using the beginning of the dns name (eg: 'beta') as the value.
    # *NOTE* this requires accurate DNS names.
    relabel_configs:
    - action: replace
      source_labels: [__meta_dns_name]
      target_label: cluster
      regex: ([^.]+)\.gce\.cockroachdb\.com
      replacement: $1

Additional details
A quick lsof shows the following many established connections along the lines of:

prometheu 2364 root 2037u  IPv4 2981837      0t0     TCP monitoring.c.cockroach-shared.internal:56735->72.157.196.104.bc.googleusercontent.com:http-alt (ESTABLISHED)

Counting by target address, I get the following counts:

$ sudo lsof -p 2364|awk '{print $9}'|cut -d'>' -f2|sort|uniq -c|sort -n|tail
      3 pipe
     14 82.54.196.104.bc.googleusercontent.com:http-alt
     67 126.150.196.104.bc.googleusercontent.com:http-alt
    161 63.18.196.104.bc.googleusercontent.com:http-alt
    172 89.153.196.104.bc.googleusercontent.com:http-alt
    185 72.157.196.104.bc.googleusercontent.com:http-alt
    221 49.102.196.104.bc.googleusercontent.com:http-alt
    245 29.141.196.104.bc.googleusercontent.com:http-alt
    290 144.153.196.104.bc.googleusercontent.com:http-alt
    659 29.164.196.104.bc.googleusercontent.com:http-alt

Those addresses match targets that have been flapping for a while, with sometimes hours of down-time between restarts.

netstat confirms the connections are established:

# Small snippet of :8080 targets using netstat -tpen
tcp        0      0 10.142.0.32:51342       104.196.164.29:8080     ESTABLISHED 0          317662      -
tcp        0      0 10.142.0.32:59371       104.196.164.29:8080     ESTABLISHED 0          2696638     -
tcp        0      0 10.142.0.32:36329       104.196.153.144:8080    ESTABLISHED 0          2810646     -
tcp        0      0 10.142.0.32:54494       104.196.153.89:8080     ESTABLISHED 0          2898911     -

Further debugging
I am currently leaving prometheus running in case there is some extra debugging information I can provide. Please let me know.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 7, 2016

@mberhault Thanks for the great report, and wow, that does look like a connection leak. Strange!

Maybe one more thing that could help: could you grab the contents of /debug/pprof/goroutine?debug=2? I'm curious whether the target scraping goroutines that spawned the leaked connections are still around.

@mberhault

This comment has been minimized.

Copy link
Author

mberhault commented Aug 8, 2016

Done. And I also learned that you can now change ulimits of running processes (see prlimit). Increased the soft limit for open files to 4096 to be able to even hit the debug endpoints.

Please find attached the raw goroutine?debug=2 output and the same run through panicparse
goroutines.pp.txt
goroutines.raw.txt

The biggest goroutine counts are for:

2017: chan receive [4617~18774 minutes] [Created by http.(*Transport).getConn.func2 @ transport.go:699]
    http       transport.go:695        (*Transport).getConn.func2.1(*, #34, #8)

2017: chan receive [4617~18774 minutes] [Created by http.(*Transport).getConn @ transport.go:708]
    http       transport.go:834        (*Transport).dialConn(#34, 0, *, 0x5, *, *, *, 0, 0)
    http       transport.go:706        (*Transport).getConn.func4(#34, 0, *, 0x5, *, *, *)

1898: IO wait [4617~18774 minutes] [Created by http.(*Transport).dialConn @ transport.go:833]
    net        netpoll.go:160          runtime_pollWait(*, 0x72, *)
    net        fd_poll_runtime.go:73   (*pollDesc).Wait(*, 0x72, 0, 0)
    net        fd_poll_runtime.go:78   (*pollDesc).WaitRead(*, 0, 0)
    net        fd_unix.go:250          (*netFD).Read(*, *, 0x400, 0x400, 0, #22551, #11)
    net        net.go:172              (*conn).Read(*, *, 0x400, 0x400, 0, 0, 0)
    tls        conn.go:460             (*block).readFromUntil(*, #22553, *, 0x5, 0, 0)
    tls        conn.go:562             (*Conn).readRecord(*, 0x16, 0, 0)
    tls        conn.go:800             (*Conn).readHandshake(*, 0, 0, 0, 0)
    tls        handshake_client.go:149 (*Conn).clientHandshake(*, 0, 0)
    tls        conn.go:1033            (*Conn).Handshake(*, 0, 0)
    http       transport.go:828        (*Transport).dialConn.func3(*, 0, *)

With the increased ulimit, prometheus is responsive again, and reachable at http://monitoring.gce.cockroachdb.com:9090, feel free to poke around.

@mberhault

This comment has been minimized.

Copy link
Author

mberhault commented Aug 15, 2016

Any updates on this?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 22, 2016

Wild guesses:

This seems to be related to TLS connections. They are still leaking, but they come in bursts, irregularly but ~once per day on average. Goroutines and open fd's leak at the same time: http://monitoring.gce.cockroachdb.com:9090/graph#%5B%7B%22range_input%22%3A%221w%22%2C%22end_input%22%3A%22%22%2C%22step_input%22%3A%22%22%2C%22stacked%22%3A%22%22%2C%22expr%22%3A%22go_goroutines%22%2C%22tab%22%3A0%7D%2C%7B%22range_input%22%3A%221w%22%2C%22end_input%22%3A%22%22%2C%22step_input%22%3A%22%22%2C%22stacked%22%3A%22%22%2C%22expr%22%3A%22process_open_fds%22%2C%22tab%22%3A0%7D%5D

Even wilder guess: This happens if a TLS scrape times out, possibly because the time-out cancelation via the context is not propagated throughout the stack. (Works for normal HTTP, but not for HTTPS.)

Needs more investigation, though. @fabxc might be qualified.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 28, 2016

Sorry for still not having a solution here. Everybody seems terribly busy...

However, I have seen a slightly similar leak on one of our Prometheus servers using TLS for scraping, and I couldn't reproduce the leak with Prometheus 1.1.2+. Could you upgrade your server and see if the leak is still happening? (This is based on the theory that the leak happens somewhere in the HTTP-TLS stack, and that bug got fixed in the library.)

@mberhault

This comment has been minimized.

Copy link
Author

mberhault commented Oct 14, 2016

This is still happening on 1.1.3.
This is only particularly noticeable when hosts are down flapping.

Similar setup as before (insecure TLS), lsof lists a few hundred open FDs each for a few different nodes.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 14, 2016

I now can see it again on 1.2.1. This needs careful investigation… ;-/

@beorn7 beorn7 self-assigned this Nov 2, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 3, 2016

And now I have upgraded to 1.2.2 (with Go 1.7.3 and a lot of vendoring updates, i.e. used libraries have changed quite a bit), and miraculously, the server only uses a third of the file handles and a lot less RAM.

@mberhault at your convenience, could you try out 1.2.2, too?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 9, 2016

After running 1.2.2 for a while, it has now stabilized at ~650 fd's, 1.2.1 needed ~1100 in the same setup. So something definitely has been improved in either Go1.7.3 or one of the other dependencies we vendor.

@mberhault

This comment has been minimized.

Copy link
Author

mberhault commented Nov 9, 2016

I've upgraded to 1.2.3. Will check back in after a little while.

On Wed, Nov 9, 2016 at 9:57 AM, Björn Rabenstein notifications@github.com
wrote:

After running 1.2.2 for a while, it has now stabilized at ~650 fd's, 1.2.1
needed ~1100 in the same setup. So something definitely has been
improved in either Go1.7.3 or one of the other dependencies we vendor.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1873 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFoMqJomYp9ZyC_Kdgmba3Ekl8_15EPaks5q8YsRgaJpZM4JeZ1u
.

@mberhault

This comment has been minimized.

Copy link
Author

mberhault commented Nov 13, 2016

We've been on 1.2.3 for 4 days now, and while it was fine for a while, checking back in now, we're seeing the occasional jump in open file descriptor count:
screen shot 2016-11-13 at 10 40 45 am

The pointer at the line above is our upgrade from 1.1.3 to 1.2.3.

A quick lsof right now shows ~200 established connections to the same two hosts:

$ sudo lsof -p $(pidof prometheus) | grep TCP | cut -d'>' -f2 | sort -n | uniq -c| sort -n
      1 40.84.24.17:http-alt (ESTABLISHED)
      1 localhost:43364 (ESTABLISHED)
      1 localhost:45071 (ESTABLISHED)
      1 localhost:46088 (ESTABLISHED)
      1 localhost:56506 (ESTABLISHED)
      1 prometheu 22160 root   26u  IPv4 46510232      0t0      TCP localhost:9090 (LISTEN)
      4 204.163.196.104.acc.googleusercontent.com:http-alt (ESTABLISHED)
      4 238.23.196.104.acc.googleusercontent.com:http-alt (ESTABLISHED)
      7 104.46.103.35:http-alt (ESTABLISHED)
      7 248.8.196.104.acc.googleusercontent.com:http-alt (ESTABLISHED)
     97 49.102.196.104.acc.googleusercontent.com:http-alt (ESTABLISHED)
    169 154.165.196.104.acc.googleusercontent.com:http-alt (ESTABLISHED)

The first increase (around 2016-11-09T18:38) matches node 49.102.196.104.acc.googleusercontent.com going down, although it came back about 6 hours later and has been up and reachable by prometheus since (except for a restart or two).
The second increase (around 2016-11-13T01:21) matches the node 154.165.196.104.acc.googleusercontent.com going down. That node is still down.

I was going to grab a goroutine profile, but /debug/pprof/goroutine returns 404. It could be that it's no longer baked in, but most likely due to the fact that we're now running prometheus with --web.external-url=.../prometheus. I tried both /prometheus/debug/pprof/goroutine and /debug/pprof/goroutine through localhost (we're behind oauth2_proxy now), to no avail. This might need to be filed as an issue.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 14, 2016

Thanks @mberhault for the investigation. By now, our own server has decided to eat more fd's, again. So while the various updates seemed to have reduced the baseline hunger for fd's, something is still leaking.

@yonkeltron

This comment has been minimized.

Copy link

yonkeltron commented Dec 29, 2016

I can confirm that this still happens with 1.3.1 (build be47695) and I am not using any TLS functionality. Allocated file descriptors keep increasing steadily until Prometheus becomes unresponsive or it gets restarted manually every 45 minutes or so.

In the following plot (Grafana, running on the same host), you can see that allocated FDs steadily increase (fairly rapidly, as this shows the last hour). The steep drop offs correspond to manual restarts. The data source is the official node exporter querying for node_filefd_allocated.

screen shot 2016-12-29 at 4 15 48 pm

When I run a sudo lsof | grep prometheus I get tons of entries which look like the following snippet taken from the end of the output. Not sure if this constitutes normal behavior or not.

prometheu 14307 14315     prometheus  767u     IPv6             251020       0t0        TCP localhost:54321->localhost:32936 (ESTABLISHED)
prometheu 14307 14315     prometheus  768u     IPv6             250997       0t0        TCP localhost:54321->localhost:32934 (ESTABLISHED)
prometheu 14307 14315     prometheus  769u     IPv6             251021       0t0        TCP localhost:54321->localhost:32937 (ESTABLISHED)
prometheu 14307 14315     prometheus  770u     IPv6             251022       0t0        TCP localhost:54321->localhost:32938 (ESTABLISHED)
prometheu 14307 14315     prometheus  771u     IPv6             251023       0t0        TCP localhost:54321->localhost:32939 (ESTABLISHED)
prometheu 14307 14315     prometheus  772u     IPv6             251972       0t0        TCP localhost:54321->localhost:33066 (ESTABLISHED)
prometheu 14307 14315     prometheus  773u     IPv6             251478       0t0        TCP localhost:54321->localhost:32994 (ESTABLISHED)
prometheu 14307 14315     prometheus  774u     IPv6             251479       0t0        TCP localhost:54321->localhost:32995 (ESTABLISHED)
prometheu 14307 14315     prometheus  775u     IPv6             251480       0t0        TCP localhost:54321->localhost:32996 (ESTABLISHED)
prometheu 14307 14315     prometheus  776u     IPv6             251481       0t0        TCP localhost:54321->localhost:32997 (ESTABLISHED)
prometheu 14307 14315     prometheus  777u     IPv6             251482       0t0        TCP localhost:54321->localhost:32998 (ESTABLISHED)
prometheu 14307 14315     prometheus  778u     IPv6             251483       0t0        TCP localhost:54321->localhost:32999 (ESTABLISHED)
prometheu 14307 14315     prometheus  779u     IPv6             251973       0t0        TCP localhost:54321->localhost:33067 (ESTABLISHED)
prometheu 14307 14315     prometheus  780u     IPv6             251992       0t0        TCP localhost:54321->localhost:33068 (ESTABLISHED)
prometheu 14307 14315     prometheus  781u     IPv6             251993       0t0        TCP localhost:54321->localhost:33069 (ESTABLISHED)
prometheu 14307 14315     prometheus  782u     IPv6             251994       0t0        TCP localhost:54321->localhost:33070 (ESTABLISHED)
prometheu 14307 14315     prometheus  783u     IPv6             251995       0t0        TCP localhost:54321->localhost:33071 (ESTABLISHED)

When I check logs, I see lots of errors which looks like this:

...snip...
Dec 29 14:32:36 HOSTNAME prometheus[714]: time="2016-12-29T14:32:36-05:00" level=warning msg="Series quarantined." fingerprint=5c58d0b83771436e metric=http_response_size_bytes{handler="prometheus", instance="omitted", job="prometheus", quantile="0.99"} reason="open data/5c/58d0b83771436e.db: too many open files" source="storage.go:1646"
Dec 29 14:32:36 HOSTNAME prometheus[714]: time="2016-12-29T14:32:36-05:00" level=error msg="Error indexing label pair to fingerprints batch: open data/labelpair_to_fingerprints/000011.log: too many open files" source="persistence.go:1258"
Dec 29 14:32:37 HOSTNAME prometheus[714]: time="2016-12-29T14:32:37-05:00" level=error msg="http: Accept error: accept tcp [::]:54321: accept4: too many open files; retrying in 1s\n" source="<autogenerated>:38"
...snip...

Some additional info:

$ uname -a
Linux HOSTNAME 3.10.0-327.4.4.el7.x86_64 #1 SMP Thu Dec 17 15:51:24 EST 2015 x86_64 x86_64 x86_64 GNU/Linux

$ ps ax | grep prometheus
  750 ?        Ssl    1:06 /opt/prometheus-exporter/node_exporter
15973 ?        Ssl    0:18 /opt/prometheus/prometheus -config.file=prometheus.yml -alertmanager.url http://hostname-omitted:9093 -web.listen-address 0.0.0.0:54321 -storage.local.retention=24h0m0s

I stand ready to provide any additional information which could help with diagnostics or otherwise contribute positively. Thanks!

@yonkeltron

This comment has been minimized.

Copy link

yonkeltron commented Dec 30, 2016

Some additional details:

We just upgraded to Prometheus 1.4.1 (build 2a89e87) and the issue still occurs. However, we noticed that it only happens when the Grafana dashboard is open. When we close the dashboard, the increase ceases and the allocated count remains flat.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Dec 31, 2016

That sounds like Grafana establishes query connections at a faster rate than they can be answered, so connections pile up due to that? Maybe it does not cancel outstanding queries before sending a new refresh.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 31, 2016

Make sure you're not running Grafana 4.0.0, it has a known issue in that regard.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 13, 2017

Is this still happening?

@korovkin

This comment has been minimized.

Copy link

korovkin commented Feb 28, 2017

i am unable to start prometheus dues to this error:

tail -f /var/log/prometheus/prometheus.log
time="2017-02-28T14:57:44-08:00" level=warning msg="Truncating file /var/lib/prometheus/metrics/31/9286a11209929d.db to exactly 15 chunks, trimming 837 extraneous bytes." source="crashrecovery.go:220"
time="2017-02-28T14:57:44-08:00" level=error msg="Could not open file /var/lib/prometheus/metrics/31/9286a11209929d.db: open /var/lib/prometheus/metrics/31/9286a11209929d.db: too many open files" source="crashrecovery.go:223"
time="2017-02-28T14:57:44-08:00" level=warning msg="Truncating file /var/lib/prometheus/metrics/31/414caab031fc3f.db to exactly 68 chunks, trimming 994 extraneous bytes." source="crashrecovery.go:220"
time="2017-02-28T14:57:44-08:00" level=error msg="Could not open file /var/lib/prometheus/metrics/31/414caab031fc3f.db: open /var/lib/prometheus/metrics/31/414caab031fc3f.db: too many open files" source="crashrecovery.go:223"
time="2017-02-28T14:57:44-08:00" level=warning msg="Truncating file /var/lib/prometheus/metrics/31/08eee5531e9789.db to exactly 60 chunks, trimming 89 extraneous bytes." source="crashrecovery.go:220"
time="2017-02-28T14:57:44-08:00" level=error msg="Could not open file /var/lib/prometheus/metrics/31/08eee5531e9789.db: open /var/lib/prometheus/metrics/31/08eee5531e9789.db: too many open files" source="crashrecovery.go:223"
time="2017-02-28T14:57:44-08:00" level=warning msg="Fingerprint 31297a859d249cc6 assumed archived but couldn't be found in archived index." source="crashrecovery.go:349"
time="2017-02-28T14:57:44-08:00" level=warning msg="Fingerprint 310e4cc9a9f79988 assumed archived but couldn't be found in archived index." source="crashrecovery.go:349"
time="2017-02-28T14:57:44-08:00" level=warning msg="Fingerprint 3181bda1121b6e75 assumed archived but couldn't be found in archived index." source="crashrecovery.go:349"
time="2017-02-28T14:57:44-08:00" level=error msg="Error opening memory series storage: open /var/lib/prometheus/metrics/32: too many open files" source="main.go:137"

@korovkin

This comment has been minimized.

Copy link

korovkin commented Feb 28, 2017

what is the right way to recover here ?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 1, 2017

You need to grant the process more file handles (with ulimit, see man page).

Your error message happens during crash recovery. This issue is about leaking file descriptors during normal operation, though.

@korovkin

This comment has been minimized.

Copy link

korovkin commented Mar 1, 2017

understood. what is the advised limit to set ?
as a side note - why is prometheus trying to open so many files at ones, is it not reasonable to build a mechanism to insure the amount of opened files is caped by a static limit?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 1, 2017

Prometheus is not using an excessive amount of files (barring this bug where some people occasionally see difficult to reproduce fd leaks). Crash recovery might be a bit more demanding, but I would assume you are running with a very low file limit and/or a very large Prometheus data store.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 1, 2017

Would you mind sharing the output of ulimit -n in the context Prometheus runs? This is also available as process_max_fds on Prometheus's /metrics. This will help tell if you have an unusually low limit, or if there is indeed some issue lurking in the code somewhere.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 1, 2017

Unfortunately, during crash recovery, /metrics doesn't work yet.

lsof can still help to see how many files (and which) are actually open.

@korovkin

This comment has been minimized.

Copy link

korovkin commented Mar 2, 2017

so apparently i am running a pretty old version (apt-get on ubuntu):

 # which prometheus
/usr/bin/prometheus
 # prometheus --version
prometheus, version 0.16.2+ds (branch: debian/sid, revision: 0.16.2+ds-1ubuntu1)
  build user:       pkg-go-maintainers@lists.alioth.debian.org
  build date:       20160408-04:15:29
  go version:       go1.6

i have placed aside the data that caused the issue:

 # sudo du -sh ~/metrics/
7.5G    /home/korovkin/metrics/

 # sudo find ~/metrics/ -type f | wc -l
139167

# ls -la metrics/ | tail -n 10
drwx------   2 prometheus     24576 Feb 28 14:12 fb
drwx------   2 prometheus     24576 Feb 28 13:15 fc
drwx------   2 prometheus     20480 Feb 28 14:04 fd
drwx------   2 prometheus     20480 Feb 28 14:04 fe
drwx------   2 prometheus     20480 Feb 28 13:21 ff
-rw-r-----   1 prometheus 290127872 Feb 28 14:20 heads.db
drwxr-xr-x   2 prometheus      4096 Feb 28 15:04 labelname_to_labelvalues
drwxr-xr-x   2 prometheus      4096 Feb 28 15:04 labelpair_to_fingerprints
drwx------  63 prometheus      4096 Feb 28 15:05 orphaned
-rw-r--r--   1 prometheus         2 Feb 20 20:38 VERSION

right now after i cleaned the old data set:

 # sudo lsof | grep prometheus | wc -l
[sudo] password for korovkin:
1946
 # sudo du -sh  /var/lib/prometheus/metrics
1.1G    /var/lib/prometheus/metrics
18:47:04 0 korovkin@ttunes-build-001  ~ >
 # sudo find  /var/lib/prometheus/metrics -type f | wc -l
33249


 # sudo prlimit --pid `pidof prometheus`
RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
AS         address space limit                unlimited unlimited bytes
CORE       max core file size                   1000000 unlimited blocks
CPU        CPU time                           unlimited unlimited seconds
DATA       max data size                      unlimited unlimited bytes
FSIZE      max file size                      unlimited unlimited blocks
LOCKS      max number of file locks held      unlimited unlimited
MEMLOCK    max locked-in-memory address space     65536     65536 bytes
MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
NICE       max nice prio allowed to raise             0         0
NOFILE     max number of open files                4000      4096
NPROC      max number of processes               191867    191867
RSS        max resident set size              unlimited unlimited pages
RTPRIO     max real-time priority                     0         0
RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
SIGPENDING max number of pending signals         191867    191867
STACK      max stack size                       8388608 unlimited bytes


please let me know what else i can provide

@korovkin

This comment has been minimized.

Copy link

korovkin commented Mar 2, 2017

# ulimit -u
191867

# sudo -H -u prometheus bash -c 'ulimit -u'
191867
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 2, 2017

Definitely upgrade to the newest version. Crash recovery might indeed open too many files in the old version, but many bugs have been fixed in the meantime.

@korovkin

This comment has been minimized.

Copy link

korovkin commented Mar 2, 2017

got it, thank you very much for taking a look!

@korovkin

This comment has been minimized.

Copy link

korovkin commented Mar 6, 2017

please take a look at a related :
#2475

@mjansing

This comment has been minimized.

Copy link

mjansing commented Mar 20, 2017

Any updates on this? I'm also running into this issue with static_config and a custom tls_config ca-file.

@mjansing

This comment has been minimized.

Copy link

mjansing commented Mar 28, 2017

I tried to dig a bit more into this issue. I use Prometheus 1.5.2 to scrape five hosts. Each host serves node_exporter metrics behind an nginx reverse proxy using an self-signed certificate (protected via basic-auth). After 2 days Prometheus becomes unresponsive until reaching max open fd limit:

$ ulimit -n => 1024
$ ls -l /proc/$PROMETHEUS_PID/fd | wc -l => 1025

It looks like that Prometheus keeps connections to some of these hosts until fd limit is reached:

$ lsof -p $PROMETHEUS_PID | awk '{print $9}'| cut -d'>' -f2 | sort | uniq -c | sort -n | tail
      1 /var/lib/prometheus/data/labelpair_to_fingerprints/000003.log
      1 /var/lib/prometheus/data/labelpair_to_fingerprints/LOCK
      1 /var/lib/prometheus/data/labelpair_to_fingerprints/LOG
      1 /var/lib/prometheus/data/labelpair_to_fingerprints/MANIFEST-000004
      1 NAME
      1 [eventpoll]
      2 type=STREAM
  206 ip213-244-XXX-XX.XXXXXX.com:https
  274 ip185-48-XXX-XXX.XXXXXX.com:https
  513 ip213-244-XXX-XX.XXXXXX.com:https
$ lsof $PROMETHEUS_PID
...snip...
prometheu 56128 root   71u     IPv4            6247120      0t0     TCP monitoring.XXXXX.XXXX.de:53984->ip185-48-XXX-XXX.XXXXXX.com:https (ESTABLISHED)
prometheu 56128 root   72u     IPv4            6249575      0t0     TCP monitoring.XXXXX.XXXX.de:54124->ip185-48-XXX-XXX.XXXXXX.com:https (ESTABLISHED)
prometheu 56128 root   73u     IPv4            6246813      0t0     TCP monitoring.XXXXX.XXXX.de:57626->ip213-244-XXX-XX.XXXXXX.com:https (ESTABLISHED)
prometheu 56128 root   74u     IPv4            6249438      0t0     TCP monitoring.XXXXX.XXXX.de:54090->ip185-48-XXX-XXX.XXXXXX.com:https (ESTABLISHED)
prometheu 56128 root   75u     IPv4            6247208      0t0     TCP monitoring.XXXXX.XXXX.de:54052->ip185-48-XXX-XXX.XXXXXX.com:https (ESTABLISHED)
prometheu 56128 root   76u     IPv4            6247175      0t0     TCP monitoring.XXXXX.XXXX.de:54018->ip185-48-XXX-XXX.XXXXXX.com:https (ESTABLISHED)
prometheu 56128 root   77u     IPv4            6249728      0t0     TCP monitoring.XXXXX.XXXX.de:54158->ip185-48-XXX-XXX.XXXXXX.com:https (ESTABLISHED)
prometheu 56128 root   78u     IPv4            6249839      0t0     TCP monitoring.XXXXX.XXXX.de:54192->ip185-48-XXX-XXX.XXXXXX.com:https (ESTABLISHED)
...snip...

I often get noisy alerts because Prometheus sometimes cannot scrape data from these hosts and then I see context deadline exceeded-errors in target status page. During these errors I cannot even scrape hosts via curl:

$ curl -u $USER:$PW --cacert /path/to/cert.crt https://myhost.XXXXX.XXXX.de/metrics -vvv
*   Trying 213.244.XXX.XX...
* Connected to XXXXX.XXXXX.XXX.de (213.244.XXX.XX) port 443 (#0)
* found 1 certificates in /path/to/cert.crt
* found 692 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* gnutls_handshake() failed: Error in the pull function.
* Closing connection 0
curl: (35) gnutls_handshake() failed: Error in the pull function.

Nginx error logs from scraping target (switched log level to debug) during curl-request:

2017/03/23 14:20:34 [debug] 826#0: accept on 0.0.0.0:443, ready: 0
2017/03/23 14:20:34 [debug] 827#0: accept on 0.0.0.0:443, ready: 0
2017/03/23 14:20:34 [debug] 826#0: posix_memalign: 00000000022FAC00:512 @16
2017/03/23 14:20:34 [debug] 827#0: accept() not ready (11: Resource temporarily unavailable)
2017/03/23 14:20:34 [debug] 826#0: *18874 accept: 78.137.XXX.XXX:51722 fd:4
2017/03/23 14:20:34 [debug] 826#0: *18874 event timer add: 4: 60000:1490278894566
2017/03/23 14:20:34 [debug] 826#0: *18874 reusable connection: 1
2017/03/23 14:20:34 [debug] 826#0: *18874 epoll add event: fd:4 op:1 ev:80002001
2017/03/23 14:20:34 [debug] 826#0: *18874 http check ssl handshake
2017/03/23 14:20:34 [debug] 826#0: *18874 http recv(): 1
2017/03/23 14:20:34 [debug] 826#0: *18874 https ssl handshake: 0x16
2017/03/23 14:20:34 [debug] 826#0: *18874 SSL server name: "XXXXX.XXXXX.XXXX.de"
2017/03/23 14:20:34 [debug] 826#0: *18874 SSL ALPN supported by client: http/1.1
2017/03/23 14:20:34 [debug] 826#0: *18874 SSL ALPN selected: http/1.1
2017/03/23 14:20:34 [debug] 826#0: *18874 SSL_do_handshake: -1
2017/03/23 14:20:34 [debug] 826#0: *18874 SSL_get_error: 2
2017/03/23 14:20:34 [debug] 826#0: *18874 reusable connection: 0
2017/03/23 14:21:34 [debug] 826#0: *18874 event timer del: 4: 1490278894566
2017/03/23 14:21:34 [debug] 826#0: *18874 SSL handshake handler: 0
2017/03/23 14:21:34 [info] 826#0: *18874 client timed out (110: Connection timed out) while SSL handshaking, client: 78.137.XXX.XX, server: 0.0.0.0:443
2017/03/23 14:21:34 [debug] 826#0: *18874 close http connection: 4
2017/03/23 14:21:34 [debug] 826#0: *18874 reusable connection: 0
2017/03/23 14:21:34 [debug] 826#0: *18874 free: 00000000022FAC00, unused: 131

So there is a problem while SSL handshaking and it's dedicated to my Prometheus hosts (Ubuntu 16.04). I don't have any problems when I run a local Prometheus instance or use curl on my local dev machine (macOS).

For me it looks like that Prometheus keeps open connections if one of these strange ssl handshake errors occurs. The ssl handshake-problem during curl-request seems to be a GnuTLS-Problem if target works behind a reverse proxy: http://askubuntu.com/questions/186847/error-gnutls-handshake-failed-when-connecting-to-https-servers.

So maybe there might be two problems:

  1. Prometheus becomes unresponsive after using all addressable fds (this issue)
  2. Failing scrape-requests which results in many open connections to same nodes

Could this be a problem in Prometheus underlying HTTP/TLS-stack like the GnuTLS one?

Hope my small investigation helps.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 2, 2017

Thanks @mjansing . This is very helpful research.

AFIK, Prometheus is currently not supposed to use persistent connections AKA "keep-alive" (not even via TLS, where the handshake is much more expensive and thus persistent connections are more common). Cf. #2498 for more discussion of the general topic.

However, here it looks like Prometheus is keeping open a TLS connection if it doesn't end in a regularly completed scrape but in a time-out. Perhaps, the Go HTTP stack uses persistent connections by default with TLS, and our code somehow doesn't honor that and opens a completely new connection for the next scrape rather than reusing the still open one.

About the issue of becoming unresponsive if using all allowed FDs: I wouldn't consider this a bug. Prometheus is opening FDs all the time (especially the embedded LevelDB). If that's not possible, all kind of things break (and in case of LevelDB not only in our own code but also in used libraries).

@mjansing

This comment has been minimized.

Copy link

mjansing commented Apr 3, 2017

@beorn7: I was able to solve my problem. A "faulty" (!?!) network configuration (and not Prometheus itself) lead to our issues. Our Ubuntu based prometheus host had MTU value of 64000. After setting MTU manually via post-up-hook to 1500 all problems gone away and everything works fine:

# The primary network interface
allow-hotplug eth0
iface eth0 inet dhcp
post-up /sbin/ifconfig eth0 mtu 1500 # manually set mtu

Maybe this also helps @mberhault, @yonkeltron and @korovkin.

@carlpett

This comment has been minimized.

Copy link

carlpett commented Apr 22, 2017

We're seeing this as well, though in our case the leak comes in the form of connections to Consul (running locally, so connections to localhost:8500). Currently we need to restart due to fd exhaustion once or twice per week. We first saw it on 1.3, upgraded to 1.4 and still had the issue, and now we're running 1.5.2.

Here's node_filefd_allocated and Prometheus own process_open_fds, in rate(1h) and total allocations:

image

We see some correlation between configuration reloads and the rate of allocation, frequent reloads seem to make the problem worse. We use consul-template to generate the Prometheus configuration file, and have had issues with flapping services causing forced reloads. We've implemented some better handling of flapping, but just the rate of new services appearing in the environment causes a couple of reloads per day, so we'll never get rid of the reloads entierly.

Any other data I can share that would be enlightening?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

There's several different issues being discussed here, so it's very hard to follow what is and isn't fixed. If there's remaining issues, please open new separate bugs for them.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.