You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Impact: munin alerts were non-functional since May 17 till Aug 11
Detection: unexpected alert flood from ooni-munin to slack, noticed by @hellais
Timeline UTC:
17 May 03:30: notify_slack_munin deployed at munin.ooni.io
11 Aug 17:22: darkk deploys dom0-defaults to all:!no_passwd from #136 without filters
11 Aug 17:25: alert flood starts (as there are couple of boxes in warning state and munin alerts on all issues every tick)
11 Aug 18:22: hellais disabled an integration in #ooni-bots channel: ooni-munin
12 Aug 07:00: incident published
What went well:
it was quite easy to silence munin alerting :-)
What went wrong:
notify_slack_munin required curl and was broken since initial setup
darkk did not notice alert flood as it's not relayed to IRC & went AFK half an hour after
innocent-looking apt-get install curl changed behavior of running system
What could be done to prevent relapse and decrease impact:
general rule: avoid any changes to live systems if you're going AFK soon :)
another one: test alerting (e.g. lowering thresholds) while deploying it
The text was updated successfully, but these errors were encountered:
None. This ticket was written down for historical and "knowledge share" purposes. It had no action points.
WRT munin destiny — it has been deprecated, but that node was re-used for tor test-helper. It should be re-deployed with clean and up-to-date debian, but it's tricky to do that within current GH limitations.
Impact: munin alerts were non-functional since May 17 till Aug 11
Detection: unexpected alert flood from
ooni-munin
to slack, noticed by @hellaisTimeline UTC:
17 May 03:30:
notify_slack_munin
deployed at munin.ooni.io11 Aug 17:22: darkk deploys
dom0-defaults
toall:!no_passwd
from #136 without filters11 Aug 17:25: alert flood starts (as there are couple of boxes in warning state and munin alerts on all issues every tick)
11 Aug 18:22:
hellais disabled an integration in #ooni-bots channel: ooni-munin
12 Aug 07:00: incident published
What went well:
What went wrong:
notify_slack_munin
requiredcurl
and was broken since initial setupapt-get install curl
changed behavior of running systemWhat could be done to prevent relapse and decrease impact:
The text was updated successfully, but these errors were encountered: