Suppressing alerts programatically #2673

candlerb · 2017-09-03T08:52:08Z

I got alerts overnight of high disk utilisation and backlog from about 1am to 4am last night.

It was more than 1 hour later that I looked at them, but I think I have found the culprit: it's the monthly mdadm RAID scrub, which starts at 00:57 on the first Sunday of the month.

root@wrn-mon2:~# cat /etc/cron.d/mdadm
#
# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft <madduck@madduck.net>
# distributed under the terms of the Artistic Licence 2.0
#

# By default, run at 00:57 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi

What I would like to do is have some way to automatically silence/suppress the alert here. That is, modify the cronjob to:

temporarily silence specific alerts
do the checkarray
unsilence those alerts

So while disks.conf defines template 10min_disk_utilization, I just want to disable specific alarm instances disk_util.sda.10min_disk_utilization and disk_util.sdb.10min_disk_utilization

I see that it should be possible:

netdata supports overriding templates with alarms. For example, when a template is defined for a set of charts, an alarm with exactly the same name attached to the same chart the template matches, will have higher precedence (i.e. netdata will use the alarm on this chart and prevent the template from being applied to it).

As far as I can see, the health API only allows querying alarms, not controlling them, so I think I have to drop a file under /opt/netdata/etc/netdata/health.d/ and then send a SIGUSR2. Is that correct?

What order are these files read in - e.g. alphabetically - and does it matter here? That is, if netdata reads an 'alarm' definition before the corresponding 'template', will it still do the right thing?

And what's the best way to make a "null alarm" which does nothing, to override an alarm from a template?

Thanks,

Brian.

The text was updated successfully, but these errors were encountered:

ktsaou · 2017-09-03T12:13:11Z

This is a nice idea. I could probably provide a method for this to be done over a unix socket file...

What order are these files read in - e.g. alphabetically - and does it matter here? That is, if netdata reads an 'alarm' definition before the corresponding 'template', will it still do the right thing?

alarms are always applied before templates, no matter how they are read.

I now checked the system calls, and it seems however that if overlapping templates or alarms are given, the result is random (ie 2 alarms with the same name, or 2 templates with the same name).

I will try to fix this...

candlerb · 2017-09-03T12:39:56Z

Thanks.

Thinking about it a bit more, I probably want to suppress the entire template rather than the individual alarms - that saves me having to configure which disks are part of the RAID array, because probably they all are.

dev-zero · 2018-10-15T11:59:53Z

While I would really like this feature also for other things than Disk I/O, couldn't the solution for this to be instead that netdata explicitly reads /sys/block/md*/md/sync_action first and if it says check then disables the warning?

cakrit · 2018-11-23T20:12:54Z

Related to #3187

cakrit · 2018-12-13T09:36:34Z

I have been working on this one instead of #3187, which adds more complexity.
netdata will be listening for health commands on a dedicated, configurable port. The feature will be disabled by default, so that people already running netdata aren't surprised with another port being opened after an update. I'm also adding a required "secret" string, the expected value of which will be configured in netdata.conf, to make it a bit more secure.

The API will support silencing/enabling all health checks, but also selective alarms and templates, for specific hosts, families, charts and contexts, with various combinations of the criteria that may make sense. I expect it will be finished next week.

cakrit · 2018-12-19T12:23:50Z

Moving to next sprint due to #5017

##### Summary fixes #2673 fixes #2149 fixes #5017 fixes #3830 fixes #3187 fixes #5154 Implements a command API for health which will accept commands via a socket to selectively suppress health checks. Allows different ports to accept different request types (streaming, dashboard, api, registry, netdata.conf, badges, management) Removes support for multi-threaded and single-threaded web servers. ##### Component Name health, daemon

##### Summary fixes netdata#2673 fixes netdata#2149 fixes netdata#5017 fixes netdata#3830 fixes netdata#3187 fixes netdata#5154 Implements a command API for health which will accept commands via a socket to selectively suppress health checks. Allows different ports to accept different request types (streaming, dashboard, api, registry, netdata.conf, badges, management) Removes support for multi-threaded and single-threaded web servers. ##### Component Name health, daemon

ktsaou added the enhancement label Sep 16, 2017

paulfantom added area/health and removed enhancement labels Sep 22, 2018

cakrit self-assigned this Nov 23, 2018

cakrit added the feature request New features label Nov 23, 2018

cakrit added the priority/medium label Nov 23, 2018

paulfantom mentioned this issue Nov 24, 2018

Defining and displaying recurring tasks #3600

Closed

cakrit assigned gmosx and Ferroin Nov 25, 2018

cakrit added area/web area/daemon labels Nov 25, 2018

This was referenced Nov 29, 2018

UI for editing alarms #2296

Closed

dismiss alarms from the dashboard #1142

Closed

cakrit added this to the v1.12-rc1 milestone Dec 7, 2018

ktsaou mentioned this issue Dec 7, 2018

health monitoring enhancements #809

Closed

14 tasks

This was referenced Dec 11, 2018

Port ACLs, Management API and Health commands #4969

Merged

maintenance time and silence time #3187

Closed

cakrit unassigned gmosx and Ferroin Dec 17, 2018

cakrit modified the milestones: v1.12-rc1, v1.12-rc2 Dec 19, 2018

cakrit removed this from the v1.12-rc2 milestone Jan 3, 2019

cakrit added this to the v1.12 milestone Jan 3, 2019

cakrit closed this as completed in #4969 Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppressing alerts programatically #2673

Suppressing alerts programatically #2673

candlerb commented Sep 3, 2017 •

edited

Loading

ktsaou commented Sep 3, 2017

candlerb commented Sep 3, 2017

dev-zero commented Oct 15, 2018

cakrit commented Nov 23, 2018

cakrit commented Dec 13, 2018

cakrit commented Dec 19, 2018

Suppressing alerts programatically #2673

Suppressing alerts programatically #2673

Comments

candlerb commented Sep 3, 2017 • edited Loading

ktsaou commented Sep 3, 2017

candlerb commented Sep 3, 2017

dev-zero commented Oct 15, 2018

cakrit commented Nov 23, 2018

cakrit commented Dec 13, 2018

cakrit commented Dec 19, 2018

candlerb commented Sep 3, 2017 •

edited

Loading