Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open sockets and fds on reload #3873

Closed
andreasnuesslein opened this Issue Feb 21, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@andreasnuesslein
Copy link

andreasnuesslein commented Feb 21, 2018

What did you do?
Reload, instead of restart Prometheus (systemctl reload prometheus)

What did you see instead? Under which circumstances?
For days I've been having this issue where I had:
err="write /var/app/prometheus/data/wal/006383: file already closed" and msg="Error sending alert" err="Post http://localhost:9093/api/v1/alerts: dial tcp 127.0.0.1:9093: socket: too many open files" and
Get XX/metrics: dial tcp: lookup XX on 8.8.8.8:53: dial udp 8.8.8.8:53: socket: too many open files
constantly. It would run smooth for a few hours and then bam, a gazillion of those errors.
I've been looking through the issues here on github and on google groups and tried a few of the fixes mentioned there.
I think I finally found the problem: for some reason reloading instead of restarting prometheus seems to completely mess up the sockets and filedescriptors.
Example:

Feb 21 09:44:11 pro systemd[1]: Reloading prometheus.service.
Feb 21 09:44:11 pro prometheus[15362]: level=info ts=2018-02-21T09:44:11.520027611Z caller=main.go:588 msg="Loading configuration file" filename=/var/app/prometheus/current/prometheus.yml
Feb 21 09:44:11 pro systemd[1]: Reloaded prometheus.service.
Feb 21 09:44:27 pro prometheus[15362]: level=error ts=2018-02-21T09:44:27.065893388Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="open /var/app/prometheus/current/alerts_node.yml: too many open files"
Feb 21 09:44:27 pro prometheus[15362]: level=error ts=2018-02-21T09:44:27.065949001Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored"
Feb 21 09:44:27 pro prometheus[15362]: level=error ts=2018-02-21T09:44:27.065961525Z caller=main.go:449 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/var/app/prometheus/current/prometheus.yml)"
Feb 21 09:44:32 pro prometheus[15362]: level=error ts=2018-02-21T09:44:32.066462836Z caller=notifier.go:454 component=notifier alertmanager=http://localhost:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://localhost:9093/api/v1/alerts: dial tcp 127.0.0.1:9093: socket: too many open files"
...

I noticed this: #3446
but I already tuned the ulimit to insanely high values and that's not it. My ls /proc/<>/fd |wc was usually around 800 anyways.

Environment

  • System information:

    Linux 4.4.0-112-generic x86_64

  • Prometheus version:

    prometheus, version 2.2.0-rc.0 (branch: HEAD, revision: 1fe05d4)
    build user: root@f7abb25edc70
    build date: 20180213-11:40:47
    go version: go1.9.2

  • Alertmanager version:

    alertmanager, version 0.14.0 (branch: HEAD, revision: 30af4d051b37ce817ea7e35b56c57a0e2ec9dbb0)
    build user: root@37b6a49ebba9
    build date: 20180213-08:16:42
    go version: go1.9.2

Obviously I'm not reloading prometheus anymore now :)

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 21, 2018

Can you share the complete Prometheus logs after restarting? There should be a line showing the actual fd limits.

@andreasnuesslein

This comment has been minimized.

Copy link
Author

andreasnuesslein commented Feb 21, 2018

@simonpasquier damn, you're right, it only says 1024/4096. why?! :D i tried ulimit as well as prlimit...

Feb 21 10:25:44 pro systemd[1]: Started prometheus.service.
Feb 21 10:25:44 pro prometheus[5728]: level=info ts=2018-02-21T10:25:44.773059174Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.2.0-rc.0, branch=HEAD, revision=1fe05d40e4b2f4f7479048b1cc3c42865eb73bab)"
Feb 21 10:25:44 pro prometheus[5728]: level=info ts=2018-02-21T10:25:44.773092754Z caller=main.go:226 build_context="(go=go1.9.2, user=root@f7abb25edc70, date=20180213-11:40:47)"
Feb 21 10:25:44 pro prometheus[5728]: level=info ts=2018-02-21T10:25:44.773104642Z caller=main.go:227 host_details="(Linux 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 pro remerge.io)"
Feb 21 10:25:44 pro prometheus[5728]: level=info ts=2018-02-21T10:25:44.773114687Z caller=main.go:228 fd_limits="(soft=1024, hard=4096)"
Feb 21 10:25:44 pro prometheus[5728]: level=info ts=2018-02-21T10:25:44.774940097Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
Feb 21 10:25:44 pro prometheus[5728]: level=info ts=2018-02-21T10:25:44.774935596Z caller=main.go:502 msg="Starting TSDB ..."
@andreasnuesslein

This comment has been minimized.

Copy link
Author

andreasnuesslein commented Feb 21, 2018

Thanks @simonpasquier!
I finally seem to have solved this by updating the systemd unit file. I now have 10240/10240 and so far reload has worked.
Gonna close the issue and reopen if it still persists

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.