Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 2.0.0 - Panic on reload due to fd limit. #3446

Closed
SuperQ opened this Issue Nov 9, 2017 · 13 comments

Comments

Projects
None yet
5 participants
@SuperQ
Copy link
Member

SuperQ commented Nov 9, 2017

What did you do?

SIGHUP Prometheus

What did you expect to see?

Safe reload

What did you see instead? Under which circumstances?

With a file ulimit 4096 max, I got a SIGSEGV. See trace below.

Environment

  • System information:

    Linux 4.4.0-66-generic x86_64

  • Prometheus version:

prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0)
  build user:       root@615b82cb36b6
  build date:       20171108-07:11:59
  go version:       go1.9.2
  • Prometheus configuration file:
insert configuration here
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
2017-11-09_09:09:34.13670 level=info ts=2017-11-09T09:09:34.136652399Z caller=main.go:394 msg="Loading configuration file" filename=/opt/prometheus/prometheus/prometheus.yml
2017-11-09_09:10:48.47802 level=error ts=2017-11-09T09:10:48.477955224Z caller=file.go:121 component="target manager" discovery=file msg="Error adding file watcher" err="too many open files"
2017-11-09_09:10:48.48135 panic: runtime error: invalid memory address or nil pointer dereference
2017-11-09_09:10:48.48137 [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0xab0f82]
2017-11-09_09:10:48.48137
2017-11-09_09:10:48.48137 goroutine 5666790 [running]:
2017-11-09_09:10:48.48138 github.com/prometheus/prometheus/vendor/gopkg.in/fsnotify%2ev1.(*Watcher).isClosed(0x0, 0xc441b9ad78)
2017-11-09_09:10:48.48138       /go/src/github.com/prometheus/prometheus/vendor/gopkg.in/fsnotify.v1/inotify.go:67 +0x22
2017-11-09_09:10:48.48138 github.com/prometheus/prometheus/vendor/gopkg.in/fsnotify%2ev1.(*Watcher).Close(0x0, 0x1c34528, 0xc4e85b8280)
2017-11-09_09:10:48.48138       /go/src/github.com/prometheus/prometheus/vendor/gopkg.in/fsnotify.v1/inotify.go:76 +0x2f
2017-11-09_09:10:48.48139 github.com/prometheus/prometheus/discovery/file.(*Discovery).stop(0xc4e85b8280)
2017-11-09_09:10:48.48140       /go/src/github.com/prometheus/prometheus/discovery/file/file.go:184 +0x1d1
2017-11-09_09:10:48.48141 github.com/prometheus/prometheus/discovery/file.(*Discovery).Run(0xc4e85b8280, 0x289cb40, 0xc44fcd9c80, 0xc469513320)
2017-11-09_09:10:48.48141       /go/src/github.com/prometheus/prometheus/discovery/file/file.go:122 +0x6e8
2017-11-09_09:10:48.48141 created by github.com/prometheus/prometheus/discovery.(*TargetSet).updateProviders
2017-11-09_09:10:48.48141       /go/src/github.com/prometheus/prometheus/discovery/discovery.go:250 +0x26f
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 13, 2017

I think I understand the cause of the problem after looking at the code: defer d.stop() should be moved after d.watcher has been initialized to a non-nil value in the Run() method. Alternatively the stop() method could return early if d.watcher is nil.

However I'm unable to trigger the bug to verify my hypothesis. Do you have some guidelines to reproduce it?

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Nov 14, 2017

@simonpasquier maybe try using ulimit -n and reduce the file descriptor limit.

@SuperQ

This comment has been minimized.

Copy link
Member Author

SuperQ commented Nov 14, 2017

You can adjust the ulimit of the running Prometheus with prlimit.

Try this:

prom_pid="$(pidof prometheus)"
new_limt="$(($(ls /proc/${prom_pid}/fd | wc -l) + 50))"
prlimit -n "${new_limit}" -p "${prom_pid}"
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 8, 2018

Are you still seeing this?

@theonlydoo

This comment has been minimized.

Copy link

theonlydoo commented May 29, 2018

I do with

[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
MemoryAccounting=true
MemoryLimit=24G
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
User=prometheus
Restart=always
RestartSec=2
EnvironmentFile=/etc/default/prometheus
ExecStart=/usr/bin/prometheus  $ARGS
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill $MAINPID
TimeoutStopSec=300

[Install]
WantedBy=multi-user.target

this systemd.unit service file
it also triggers a crash sometimes, at reload.

May I also add that some targets are not scraped as a consequence.

$ prometheus --version
prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10

A diagnostic element:
I usually trash each target file before regenerating a new set.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 1, 2018

@theonlydoo can you share the logs when it crashes?

@theonlydoo

This comment has been minimized.

Copy link

theonlydoo commented Aug 1, 2018

juil. 31 14:59:33 prometheus prometheus[29940]: level=error ts=2018-07-31T12:59:33.766264436Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
juil. 31 14:59:33 prometheus prometheus[29940]: level=error ts=2018-07-31T12:59:33.766259599Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
juil. 31 14:59:33 prometheus prometheus[29940]: level=error ts=2018-07-31T12:59:33.766266961Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
juil. 31 14:59:33 prometheus prometheus[29940]: level=error ts=2018-07-31T12:59:33.766303485Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
juil. 31 14:59:33 prometheus prometheus[29940]: level=error ts=2018-07-31T12:59:33.766319538Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
juil. 31 14:59:33 prometheus prometheus[29940]: level=error ts=2018-07-31T12:59:33.766332945Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
juil. 31 14:59:33 prometheus prometheus[29940]: level=error ts=2018-07-31T12:59:33.766355581Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
juil. 31 14:59:33 prometheus prometheus[29940]: level=error ts=2018-07-31T12:59:33.766350559Z caller=file.go:223 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"

The config pattern is:

  • trash every file_sd config file
  • recreate each one
  • reload

my guess is that the daemon doesn't releases the deleted fd.
Seen also on:

prometheus, version 2.3.1 (branch: HEAD, revision: 188ca45bd85ce843071e768d855722a9d9dabe03)
  build user:       root@82ef94f1b8f7
  build date:       20180619-15:56:22
  go version:       go1.10.3
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 1, 2018

@theonlydoo to be clear Prometheus is hitting the fd limits but it doesn't panic. Can you share the start of the logs too? There should be a line looking like fd_limits="(soft=xxx, hard=xxx)"

@theonlydoo

This comment has been minimized.

Copy link

theonlydoo commented Aug 1, 2018

yep, it doesn't crash, it just doesn't scrape my targets anymore, which is kinda worse because it's hard to see without having a dashboard in front of me :-)

level=info ts=2018-07-31T13:07:43.559204052Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"

my systemd file sets those limits:

[Service]
MemoryAccounting=true
MemoryLimit=30G
MemoryHigh=24G
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
User=prometheus
Restart=always
RestartSec=2
EnvironmentFile=/etc/default/prometheus
ExecStart=/usr/bin/prometheus  $ARGS
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill $MAINPID
TimeoutStopSec=300

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 1, 2018

That's running into the inotify limit.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 12, 2018

@theonlydoo specifically you need to increase max_user_instances and/or max_user_watches. See https://linux.die.net/man/7/inotify

Closing this issue since the panic doesn't seem to happen anymore.

@theonlydoo

This comment has been minimized.

Copy link

theonlydoo commented Sep 12, 2018

done, I'll keep u posted on this issue if it ever comes back again (testing on 2 different versions)

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.