Error too many open files and file already closed - Prometheus 2.0 #3563

zarkoc · 2017-12-08T11:40:38Z

Hi all,

Im running Prometheus 2.0. And scraping AWS targerts. WE have around 2000 targets with node_exporter that are scraped. We have only a couple of relabeling options in the config. Everything else is by default.
The same configuration works as expected on Prometheus 1.8.
Are there some tweaks that need to be done on Prometheus 2.0 (like there were for Prometheus 1.x) for the storage engine to be able to consume this many targets.

The disk where the data is stored isnt full, its only 6% full, with around 500GB free.

We get the following errors in the Prometheus logs:

Dec 08 11:26:47 prom-server.xxx.com systemd[1]: Started Prometheus.
Dec 08 11:26:47 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:26:47.419039583Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
Dec 08 11:26:47 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:26:47.419087944Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
Dec 08 11:26:47 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:26:47.419104976Z caller=main.go:217 host_details="(Linux 4.4.0-1041-aws #50-Ubuntu SMP Wed Nov 15 22:18:17 UTC 2017 x86_64 prom-server.xxx.com (none))"
Dec 08 11:26:47 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:26:47.421872234Z caller=web.go:380 component=web msg="Start listening for connections" address=localhost:9090
Dec 08 11:26:47 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:26:47.4220576Z caller=main.go:314 msg="Starting TSDB"
Dec 08 11:26:47 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:26:47.422069562Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
Dec 08 11:27:07 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:27:07.190183909Z caller=main.go:326 msg="TSDB started"
Dec 08 11:27:07 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:27:07.190257034Z caller=main.go:394 msg="Loading configuration file" filename=prometheus.yml
Dec 08 11:27:07 prom-server.xxx.com prometheus[27413]: level=info ts=2017-12-08T11:27:07.191331865Z caller=main.go:371 msg="Server is ready to receive requests."
Dec 08 11:28:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:28:07.19059331Z caller=db.go:255 component=tsdb msg="retention cutoff failed" err="open directory: open /data/storage/prometheus/data: too many open files"
Dec 08 11:28:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:28:07.19065022Z caller=db.go:260 component=tsdb msg="compaction failed" err="plan compaction: open /data/storage/prometheus/data: too many open files"
Dec 08 11:28:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:28:07.474258221Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.eu-central-1.amazonaws.com/: dial tcp: lookup ec2.eu-central-1.amazonaws.com on 10.1.1.10:53: dial udp 10.1.1.10:53: socket: too many open files"
Dec 08 11:28:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:28:07.489219937Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: lookup ec2.us-east-1.amazonaws.com on 10.1.1.10:53: dial udp 10.1.1.10:53: socket: too many open files"
Dec 08 11:28:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:28:07.495240394Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: lookup ec2.us-east-1.amazonaws.com on 10.1.1.10:53: dial udp 10.1.1.10:53: socket: too many open files"
Dec 08 11:28:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:28:07.513110232Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: lookup ec2.us-east-1.amazonaws.com on 10.1.1.10:53: dial udp 10.1.1.10:53: socket: too many open files"
Dec 08 11:29:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:29:07.454424535Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: lookup ec2.us-east-1.amazonaws.com on 10.1.1.10:53: dial udp 10.1.1.10:53: socket: too many open files"
Dec 08 11:29:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:29:07.493904975Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: lookup ec2.us-east-1.amazonaws.com on 10.1.1.10:53: dial udp 10.1.1.10:53: socket: too many open files"
Dec 08 11:29:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:29:07.556286813Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: lookup ec2.us-east-1.amazonaws.com on 10.1.1.10:53: dial udp 10.1.1.10:53: socket: too many open files"
Dec 08 11:29:07 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:29:07.583265194Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.eu-central-1.amazonaws.com/: dial tcp: lookup ec2.eu-central-1.amazonaws.com on 10.1.1.10:53: dial udp 10.1.1.10:53: socket: too many open files"
Dec 08 11:29:09 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:29:09.191216888Z caller=db.go:255 component=tsdb msg="retention cutoff failed" err="open directory: open /data/storage/prometheus/data: too many open files"
Dec 08 11:29:09 prom-server.xxx.com prometheus[27413]: level=error ts=2017-12-08T11:29:09.191287616Z caller=db.go:260 component=tsdb msg="compaction failed" err="plan compaction: open /data/storage/prometheus/data: too many open files"

....

Dec 08 11:34:09 prom-server.xxx.com prometheus[27413]: level=warn ts=2017-12-08T11:34:09.95928083Z caller=scrape.go:673 component="target manager" scrape_pool=aws target=http://10.100.100.21:9100/metrics msg="append failed" err="WAL log samples: log series: write /data/storage/prometheus/data/wal/000402: file already closed"

The text was updated successfully, but these errors were encountered:

brian-brazil · 2017-12-08T12:21:12Z

Try increasing the ulimit for open files.

It makes more sense to ask questions like this on the prometheus-users mailing list rather than in a GitHub issue. On the mailing list, more people are available to potentially respond to your question, and the whole community can benefit from the answers provided.

kelein · 2018-02-07T07:59:55Z

Feb 07 10:54:05 level=info  ts=2018-02-07T02:54:05.460754789Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1517947200000 maxt=1517954400000
Feb 07 10:54:05 level=error ts=2018-02-07T02:54:05.461007543Z caller=compact.go:397 component=tsdb msg="removed tmp folder after failed compaction" err="open /data/prometheus/01C5Q2YFMM61BNGJRB3HTYXJH4.tmp: too many open files"
Feb 07 10:54:05 level=error ts=2018-02-07T02:54:05.461037107Z caller=db.go:265 component=tsdb msg="compaction failed" err="persist head block: open chunk writer: open /data/prometheus/01C5Q2YFMM61BNGJRB3HTYXJH4.tmp/chunks: too many open files"

andreasnuesslein · 2018-02-20T18:32:46Z

Hey @brian-brazil and folks :)

We're having the same problem running into a bunch of these:

Feb 20 18:19:47 pro prometheus[2897]: level=info ts=2018-02-20T18:19:47.316282438Z caller=main.go:421 msg="Scrape manager stopped"
Feb 20 18:19:47 pro prometheus[2897]: level=error ts=2018-02-20T18:19:47.329011989Z caller=main.go:522 msg="Error stopping storage" err="write /var/app/prometheus/data/wal/006383: file already closed"

as well as

Feb 20 18:19:47 pro prometheus[2897]: level=info ts=2018-02-20T18:19:47.316282438Z caller=main.go:421 msg="Scrape manager stopped"
Feb 20 18:19:47 pro prometheus[2897]: level=error ts=2018-02-20T18:19:47.329011989Z caller=main.go:522 msg="Error stopping storage" err="write /var/app/prometheus/data/wal/006383: file already closed"

It happens only after minutes/hours of restarting prometheus. Not after days.
I've been browsing the github bugtracker as well as the google group, without finding my exact problem. I've updated to prometheus 2.2.0-rc.0 to rule out the alertmanager-problem i've been reading about.
I've also tried to heavily increase the ulimit, but according to process_open_fds It's only consuming about 800.
I tried fiddling with sysctl-variables with no luck so far, although I have to admit that I'm not exactly sure about the implications of tcp_tw_reuse, tcp_tw_recycle and similar.

Any help would be highly appreciated.

heap.gz

Thanks!

andreasnuesslein · 2018-02-21T08:50:43Z

I haven't changed anything but now it seems to be fine... I don't know what happened. 😕

Please ignore the above. If it reappears I shall open a new issue

lock · 2019-03-22T21:55:30Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

brian-brazil closed this as completed Dec 8, 2017

lock bot locked and limited conversation to collaborators Mar 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error too many open files and file already closed - Prometheus 2.0 #3563

Error too many open files and file already closed - Prometheus 2.0 #3563

zarkoc commented Dec 8, 2017 •

edited

brian-brazil commented Dec 8, 2017

kelein commented Feb 7, 2018 •

edited

andreasnuesslein commented Feb 20, 2018

andreasnuesslein commented Feb 21, 2018

lock bot commented Mar 22, 2019

Error too many open files and file already closed - Prometheus 2.0 #3563

Error too many open files and file already closed - Prometheus 2.0 #3563

Comments

zarkoc commented Dec 8, 2017 • edited

brian-brazil commented Dec 8, 2017

kelein commented Feb 7, 2018 • edited

andreasnuesslein commented Feb 20, 2018

andreasnuesslein commented Feb 21, 2018

lock bot commented Mar 22, 2019

zarkoc commented Dec 8, 2017 •

edited

kelein commented Feb 7, 2018 •

edited