Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppressing alerts programatically #2673

Closed
candlerb opened this issue Sep 3, 2017 · 6 comments · Fixed by #4969
Closed

Suppressing alerts programatically #2673

candlerb opened this issue Sep 3, 2017 · 6 comments · Fixed by #4969

Comments

@candlerb
Copy link
Contributor

candlerb commented Sep 3, 2017

I got alerts overnight of high disk utilisation and backlog from about 1am to 4am last night.

It was more than 1 hour later that I looked at them, but I think I have found the culprit: it's the monthly mdadm RAID scrub, which starts at 00:57 on the first Sunday of the month.

root@wrn-mon2:~# cat /etc/cron.d/mdadm
#
# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft <madduck@madduck.net>
# distributed under the terms of the Artistic Licence 2.0
#

# By default, run at 00:57 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi

What I would like to do is have some way to automatically silence/suppress the alert here. That is, modify the cronjob to:

  • temporarily silence specific alerts
  • do the checkarray
  • unsilence those alerts

So while disks.conf defines template 10min_disk_utilization, I just want to disable specific alarm instances disk_util.sda.10min_disk_utilization and disk_util.sdb.10min_disk_utilization

I see that it should be possible:

netdata supports overriding templates with alarms. For example, when a template is defined for a set of charts, an alarm with exactly the same name attached to the same chart the template matches, will have higher precedence (i.e. netdata will use the alarm on this chart and prevent the template from being applied to it).

As far as I can see, the health API only allows querying alarms, not controlling them, so I think I have to drop a file under /opt/netdata/etc/netdata/health.d/ and then send a SIGUSR2. Is that correct?

What order are these files read in - e.g. alphabetically - and does it matter here? That is, if netdata reads an 'alarm' definition before the corresponding 'template', will it still do the right thing?

And what's the best way to make a "null alarm" which does nothing, to override an alarm from a template?

Thanks,

Brian.

@ktsaou
Copy link
Member

ktsaou commented Sep 3, 2017

This is a nice idea. I could probably provide a method for this to be done over a unix socket file...

What order are these files read in - e.g. alphabetically - and does it matter here? That is, if netdata reads an 'alarm' definition before the corresponding 'template', will it still do the right thing?

alarms are always applied before templates, no matter how they are read.

I now checked the system calls, and it seems however that if overlapping templates or alarms are given, the result is random (ie 2 alarms with the same name, or 2 templates with the same name).

I will try to fix this...

@candlerb
Copy link
Contributor Author

candlerb commented Sep 3, 2017

Thanks.

Thinking about it a bit more, I probably want to suppress the entire template rather than the individual alarms - that saves me having to configure which disks are part of the RAID array, because probably they all are.

@dev-zero
Copy link

While I would really like this feature also for other things than Disk I/O, couldn't the solution for this to be instead that netdata explicitly reads /sys/block/md*/md/sync_action first and if it says check then disables the warning?

@cakrit cakrit self-assigned this Nov 23, 2018
@cakrit cakrit added the feature request New features label Nov 23, 2018
@cakrit
Copy link
Contributor

cakrit commented Nov 23, 2018

Related to #3187

@cakrit
Copy link
Contributor

cakrit commented Dec 13, 2018

I have been working on this one instead of #3187, which adds more complexity.
netdata will be listening for health commands on a dedicated, configurable port. The feature will be disabled by default, so that people already running netdata aren't surprised with another port being opened after an update. I'm also adding a required "secret" string, the expected value of which will be configured in netdata.conf, to make it a bit more secure.

The API will support silencing/enabling all health checks, but also selective alarms and templates, for specific hosts, families, charts and contexts, with various combinations of the criteria that may make sense. I expect it will be finished next week.

@cakrit cakrit unassigned gmosx and Ferroin Dec 17, 2018
@cakrit
Copy link
Contributor

cakrit commented Dec 19, 2018

Moving to next sprint due to #5017

@cakrit cakrit modified the milestones: v1.12-rc1, v1.12-rc2 Dec 19, 2018
@cakrit cakrit removed this from the v1.12-rc2 milestone Jan 3, 2019
@cakrit cakrit added this to the v1.12 milestone Jan 3, 2019
cakrit added a commit that referenced this issue Jan 15, 2019
##### Summary
fixes #2673 
fixes #2149
fixes #5017 
fixes #3830 
fixes #3187 
fixes #5154

Implements a command API for health which will accept commands via a socket to selectively suppress health checks. 

Allows different ports to accept different request types  (streaming, dashboard, api, registry, netdata.conf, badges, management)

Removes support for multi-threaded and single-threaded web servers.

##### Component Name
health, daemon
kiku-jw pushed a commit to kiku-jw/netdata that referenced this issue Mar 4, 2019
##### Summary
fixes netdata#2673 
fixes netdata#2149
fixes netdata#5017 
fixes netdata#3830 
fixes netdata#3187 
fixes netdata#5154

Implements a command API for health which will accept commands via a socket to selectively suppress health checks. 

Allows different ports to accept different request types  (streaming, dashboard, api, registry, netdata.conf, badges, management)

Removes support for multi-threaded and single-threaded web servers.

##### Component Name
health, daemon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants