Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alerts not being fired to any receiver #1681

Closed
Dnile opened this Issue May 27, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@Dnile
Copy link

Dnile commented May 27, 2016

ERRO[2942] Error sending 7 alerts: context deadline exceeded source=notifier.go:188
appears in the logs whenever alerts should be fired.`

prometheus version:
prometheus, version 0.19.1 (branch: master, revision: 500a494)
build user: root@dfc6307dc40d
build date: 20160526-01:42:25
go version: go1.6.2

alertmanager version:
alertmanager, version 0.1.1 (branch: release-0.1, revision: 0e541bf)
build user: root@8c44a0677215
build date: 20160323-10:10:18
go version: go1.5.3

logs:
https://gist.github.com/Dnile/e299f7ca20c8f77aa4d0c92a2158c8a2

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 27, 2016

So the Alertmanager is out of filedescriptors. It'd be interesting to find out what the FDs are used for:

  • actual files (unlikely)
  • incoming connections (from Prometheus or other AM clients)
  • outgoing connections (to notification mechanisms)

Could you run and gist (with sensitive info removed) the following commands on your AM host:

  • sudo ls -l /proc/$(pidof alertmanager)/fd
  • sudo netstat -tpen | grep alertmanager
  • lsof -c alertmanager
@Dnile

This comment has been minimized.

Copy link
Author

Dnile commented May 27, 2016

thanks for the swift response!
attached three command output here:
https://gist.github.com/Dnile/029c0b782067660cb8656a5e6c2ead6f

also, AM and prometheus run on the same box, if that makes any difference.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 27, 2016

Interesting, I'm wondering about all those open sockets that netstat doesn't show. Can you try sudo netstat -pen | grep alertmanager?

@Dnile

This comment has been minimized.

Copy link
Author

Dnile commented May 27, 2016

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 27, 2016

Those are only 13 connections though. The issue is with the many other sockets that lsof shows, but for which it says "can't identify protocol". sudo netstat -pen | grep alertmanager should also show Unix sockets in case they are ones, so now I'm not sure why netstat doesn't show these at all. Or did something change and they aren't in the lsof output anymore either?

@Dnile

This comment has been minimized.

Copy link
Author

Dnile commented May 27, 2016

no, they're still there, but not shown on netstat

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 27, 2016

That's strange. http://serverfault.com/questions/153983/sockets-found-by-lsof-but-not-by-netstat suggests it could be half-open connections, but I'm not sure where they could come from in Alertmanager.

Other things that could be interesting:

Also, if you restart Alertmanager, how fast do those many sockets reappear? Gradually over time or all at once?

@Dnile

This comment has been minimized.

Copy link
Author

Dnile commented May 29, 2016

grouting dump:
https://gist.github.com/Dnile/3a40214d95bcab5774c9032707426adc

strace files:
https://github.com/Dnile/prom-logs

after restarting AM all of those sockets return as soon as prometheus tries to "fire" active alerts

@Dnile

This comment has been minimized.

Copy link
Author

Dnile commented May 30, 2016

also just noticed some ERRO[102279] api error: {server_error {14 14 unable to open database file}} source=api.go:418 errors in the AM log

@Dnile

This comment has been minimized.

Copy link
Author

Dnile commented May 31, 2016

it seems the issue was a corrupt AM db, i removed it and restarted the service and everything is back to normal.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 1, 2016

Thanks for keeping us updated. Closing this as DB issues are covered elsewhere.

@fabxc fabxc closed this Jun 1, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.