[health] Dispatch some alarms into health.log instead of debug #7576

Saruspete · 2019-12-18T14:59:59Z

Summary

Fixes: #6822
Goal: to provide more detail on the "alarm script" execution (#6822)

This script can be used to send simple notification (features provided
by default), but can also be used to notify more complex systems.
When netdata registers an alarm but the upstream alerting system doesn't
receive the alarm/clear, it's complicated to pinpoint where the alert
was dropped.

Issues where this would be useful:

The health thread popen() the alerting script. If it is somehow stall,
the health thread is stopped too. Adding execution time log with the
notification to have some trace of slow script.
Alerts can be inhibited by the 'delay' health configuration, assuming
user don't want to be flooded by notifications. Many bug tickets were
due to this feature silently discarding notifications (like How do alerts work? #3326 Telegram Alarms works only by hand execution. #3489
httpcheck plugin doesn't send clear event to alerta #3590 alarm-notify.sh not always executed #2952 and more). Here we can have a log while not being in debug
When the script fails but we are not aware of it, so we have the
return code and special path for exit != 0

New configuration options in [health] section to control logging. By default, all are off, thus is the same state as currently (debug).

[health]
	# log alarm exec = no
	# log alarm exec errors = no
	# log alarm silenced = no
	# log alarm skipped = no
	# log alarm clear = no

More logs lines should be added to cover the different paths where alerts come from (like repeating alerts, transcient, etc..)

Component Name

health + libnetdata/logs

Additional Information

Sample output (with all options = yes):

[ Startup ]
2019-12-18 15:28:57: Health not sending notification for alarm 'net_packets.wlp4s0.10s_received_packets_storm' status CLEAR (it has no-clear-notification enabled)
2019-12-18 15:28:57: Health not sending notification for alarm 'net_packets.wwp0s20f0u6.10s_received_packets_storm' status CLEAR (it has no-clear-notification enabled)
2019-12-18 15:28:57: Health not sending notification for alarm 'net_packets.virbr0.10s_received_packets_storm' status CLEAR (it has no-clear-notification enabled)
2019-12-18 15:28:57: Health not sending notification for alarm 'ipv4.tcphandshake.10s_ipv4_tcp_resets_received' status CLEAR (it has no-clear-notification enabled)
2019-12-18 15:28:57: Health not sending notification for alarm 'ipv4.tcphandshake.10s_ipv4_tcp_resets_sent' status CLEAR (it has no-clear-notification enabled)
[ Filling /tmp with dd if=/dev/urandom of=/tmp/coin bs=1M count=15k ]
2019-12-18 15:30:57: executing command 'exec /usr/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'munin' '1572513246' '1572509068' '43' '1576679457' 'out_of_disk_space_time' 'disk_space._tmp' '/tmp' 'CRITICAL' 'CLEAR' '0' 'nan' '68@/usr/lib/netdata/conf.d/health.d/disks.conf' '120' '0' 'hours' 'estimated time the disk will run out of space, if the system continues to add data with the rate of the last hour' '0h' 'undefined' '$this > 0 and $this < (($status == $CRITICAL) ? (24) : (2))' '[ $this = 0.0370937 ] [ $this = 0.0370937 ] [ $status = 1 ] [ $CRITICAL = 4 ] ' '1' '1''
2019-12-18 15:30:57: done executing command in 0.437300 s - returned with code 1
2019-12-18 15:31:57: executing command 'exec /usr/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'munin' '1572513245' '1572509066' '38' '1576679457' 'disk_space_usage' 'disk_space._tmp' '/tmp' 'WARNING' 'CLEAR' '96' '45' '12@/usr/lib/netdata/conf.d/health.d/disks.conf' '120' '0' '%' 'current disk space usage' '96.4%' '44.9%' '$this > (($status >= $WARNING ) ? (80) : (90))' '[ $this = 96.4439625 ] [ $status = 3 ] [ $WARNING = 3 ] ' '1' '1''
2019-12-18 15:31:57: done executing command in 0.625700 s - returned with code 1
[ Emptying /tmp/coin ]
[ Filling again with dd if=/dev/urandom of=/tmp/coin bs=1M count=15k ]
2019-12-18 15:34:57: Health not sending again notification for alarm 'disk_space._tmp.out_of_disk_space_time' status CRITICAL

squash-labs · 2019-12-18T15:00:07Z

Manage this branch in Squash

Test this branch here: https://saruspetehealth-logs-s3w9a.squash.io

health/health.c

Saruspete · 2020-12-08T00:15:39Z

I've reworked the script and now testing it on my prod.
I got one request for the (much needed) spawn system: is there a way to get the execution time for the commands, instead of just the timestamp ? that would be more easy to analyze the submit / start / stop time.

Saruspete · 2020-12-08T00:20:20Z

When all logs are enabled, I got:

2020-12-07 23:54:04: Health not sending notification for first initialization of alarm 'net_packets.eno1.outbound_packets_dropped_ratio' status CLEAR
2020-12-07 23:54:04: Queued command 'exec /opt/netdata/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'bur' '1607381691' '1607379112' '34' '1607381644' 'inbound_packets_dropped_ratio' 'net_packets.eno1' 'eno1' 'CLEAR' 'UNINITIALIZED' '0' 'nan' '83@/opt/netdata/lib/netdata/conf.d/health.d/net.conf' '9' '0' '%' 'the ratio of inbound dropped packets vs the total number of received packets of the network interface, during the last 10 minutes' '0%' '-' '$this >= 0.1' '[ $this = 0 ] ' '2' '0'' as id '1'
2020-12-07 23:54:04: Health not sending notification for first initialization of alarm 'net.eno1.1m_sent_traffic_overflow' status CLEAR
2020-12-07 23:54:04: Health not sending notification for first initialization of alarm 'net.eno1.1m_received_traffic_overflow' status CLEAR
[ ... ]
2020-12-07 23:54:04: done executing command '1' - returned with code 1 at 1607381644
2020-12-07 23:55:04: Health alarm 'net_packets.eno1.inbound_packets_dropped_ratio' = 13.420868 - changed status from CLEAR to CRITICAL
2020-12-07 23:55:04: Queued command 'exec /opt/netdata/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'bur' '1607381709' '1607379112' '35' '1607381704' 'inbound_packets_dropped_ratio' 'net_packets.eno1' 'eno1' 'CRITICAL' 'CLEAR' '13' '0' '83@/opt/netdata/lib/netdata/conf.d/health.d/net.conf' '60' '0' '%' 'the ratio of inbound dropped packets vs the total number of received packets of the network interface, during the last 10 minutes' '13.4%' '0%' '$this >= 2' '[ $this = 13.4208677 ] ' '2' '1'' as id '2'
[ ... ]
2020-12-07 23:55:04: Health not sending notification for first initialization of alarm 'disk_inodes._.disk_inode_usage' status CLEAR
2020-12-07 23:55:04: Health not sending again notification for alarm 'disk_space._.disk_space_usage' status WARNING
2020-12-07 23:55:05: done executing command '2' - returned with code 0 at 1607381704
2020-12-08 00:46:04: Health alarm 'disk_space._.disk_space_usage' = 78.718152 - changed status from WARNING to CLEAR
2020-12-08 00:46:04: Queued command 'exec /opt/netdata/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'bur' '1607381728' '1607379073' '31' '1607383864' 'disk_space_usage' 'disk_space._' '/' 'CLEAR' 'WARNING' '79' '80' '12@/opt/netdata/lib/netdata/conf.d/health.d/disks.conf' '2220' '2220' '%' 'current disk space usage' '78.7%' '80.4%' '$this > (($status >= $WARNING ) ? (80) : (90))' '[ $this = 77.7332227 ] [ $status = 1 ] [ $WARNING = 3 ] ' '1' '1'' as id '3'
2020-12-08 00:46:04: done executing command '3' - returned with code 0 at 1607384764
2020-12-08 01:16:04: Health alarm 'net_packets.eno1.inbound_packets_dropped_ratio' = 5.265169 - changed status from WARNING to CRITICAL
2020-12-08 01:16:04: Health not sending again notification for alarm 'net_packets.eno1.inbound_packets_dropped_ratio' status CRITICAL

There's a bit of storm about first initialization, but then only useful messages from state changes and alarm execution.

ilyam8 · 2021-01-28T15:58:52Z

@Saruspete hi. What are your plans on this PR? Are you still interested in it and going to finish?

Saruspete · 2021-01-28T17:06:41Z

Hello,

it's currently working in production under 1.26 and 1.28 (for about 2 months), I was waiting to check if that didn't generated any issue, but it looks fine for now.
I would have liked a way to get the precise time consumed to run the alarm, but it would have needed to change the alarm_entry struct, which I wasn't fond of.

So far, looks good to me. I'll remove the WIP tag

Saruspete · 2021-03-17T16:19:31Z

A thing just hit me: I removed the mutex that was copy/pasted from access.log as health logs are not in the webserver path. However, alerts are handled by libuv. Are these subject to multithreading and wirtes to health.log need a mutex too ?

thiagoftsm · 2021-03-17T16:26:42Z

Hello @thiagoftsm
Sorry for late reply too ^^
I always forgot to check with these strict lines. Is there any standard "make check" or "make build-strict" like to automate this ?

I do not remeber, we have a make check, but I am not sure if this would do exactly what you want. @Ferroin can you help us here?

fwiw, I tried a simple make with

CFLAGS="-O1 -ggdb -Wall -Wextra -fsanitize=address -static-libasan -fno-omit-frame-pointer -Wformat-signedness -fstack-protector-all -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1"  ./configure --prefix=/home/netdata --exec-prefix=/home/netdata/usr --with-user=adrien

And I'm getting errors about README.md:

make[4]: Entering directory '/home/adrien/devel/sys-monitoring/netdata.saruspete/database/engine/global_uuid_map'
make[4]: *** No rule to make target 'README.md', needed by 'all-am'.  Stop.

And while doing only make netdata got

make: *** No rule to make target 'database/engine/metadata_log/metadatalog.c', needed by 'database/engine/metadata_log/metadatalog.o'.  Stop.

About these last errors, I compiled the latest master and I am not having these errors, please, can you try to rebase your PR.

Saruspete · 2021-03-17T16:28:36Z

It should be rebased already (during my clean):

$ git describe
v1.29.3-161-ge8cba5c1a

thiagoftsm · 2021-03-23T00:49:41Z

A thing just hit me: I removed the mutex that was copy/pasted from access.log as health logs are not in the webserver path. However, alerts are handled by libuv. Are these subject to multithreading and wirtes to health.log need a mutex too ?

Firstly, sorry for the delay.

I think you are making an awesome job here, and probably this log can be one day exported for our dashboard or a file, my suggestion for you is to create a mutex for it, because it will be necessary in the future.

thiagoftsm · 2021-03-23T00:51:27Z

once again, I f***ed my git and also managed to lose the changes. Once again, sorry for the noise. (and if you have hints to avoid that in the future, I'll take them ! )

I rewrote the files from the version I have in production since a few months.

I am not specialist in git, @ilyam8 do you have some tip for us?

thiagoftsm · 2021-03-23T22:30:49Z

Hello @Saruspete ,

I tested your PR now and I saw that the compilation error was fixed, thank you!
Now I wanna give a suggestion to improve your PR, I used the following alarm:

 alarm: dev_dim_template
    on: system.cpu
    os: linux
lookup: sum -3s at 0 every 3 percentage foreach *
 units: %
 every: 1s
  warn: $this > 1
  crit: $this > 4

As you can see the alarm is useless for real world, but it was good for me to identify details on your PR, firtsly, take a look in my error.log:

bash-5.1# grep telegram /var/log/netdata/error.log 
2021-03-23 22:12:38: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_idle is CRITICAL to '-3'
2021-03-23 22:12:39: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3'
2021-03-23 22:12:39: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_user is CRITICAL to '-3'
2021-03-23 22:12:41: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is CRITICAL to '-3'
2021-03-23 22:12:41: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_iowait is WARNING to '-3'
2021-03-23 22:12:44: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_iowait is CLEAR to '-3'
2021-03-23 22:12:50: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3'
2021-03-23 22:12:53: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is CRITICAL to '-3'
2021-03-23 22:13:04: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3'

and now the health.log:

bash-5.1# cat /var/log/netdata/health.log 
2021-03-23 22:12:39: done executing command '1' - returned with code 0 at 1616537557
2021-03-23 22:12:39: done executing command '2' - returned with code 0 at 1616537557
2021-03-23 22:12:39: done executing command '3' - returned with code 0 at 1616537557
2021-03-23 22:12:41: done executing command '4' - returned with code 0 at 1616537560
2021-03-23 22:12:41: done executing command '5' - returned with code 0 at 1616537560
2021-03-23 22:12:44: done executing command '6' - returned with code 0 at 1616537563
2021-03-23 22:12:50: done executing command '7' - returned with code 0 at 1616537568
2021-03-23 22:12:53: done executing command '8' - returned with code 0 at 1616537572
2021-03-23 22:13:04: done executing command '9' - returned with code 0 at 1616537581

I am missing an unique identifier between the two files that allow us to compare the data, as you can see the timestamp was not the same for them, please, can you add the alarm name to health.log?

In my vision, we could also move the information that was kept inside error.log to health.log, because it also has relationship with notification. @ilyam8 and @vlvkobal do you have objection to this idea? In this case, I also suggest to add the alarm name to the current message.

Saruspete · 2021-03-23T22:51:33Z

You should have a queued command %s as id X before, due to the async way of libuv (https://github.com/netdata/netdata/pull/7576/files#diff-69a1925e841976fa93ef3331a8c955d5500c1a8225ebcbeb56fa2dd8397e75fbR379)

The log you're seeing is enabled by HEALTH_ENTRY_FLAG_EXEC_RUN or HEALTH_ENTRY_FLAG_EXEC_FAILED. (by default, the logs are not enabled, all settings are off). However, it shouldn't have a 'returned with code 0': either it worked fine and you should also have the 'queued' log entry (if log alarm exec is on), else it failed and you shouldn't have the code 0. Will look into that.

Also, I couldn't find a way to get the executed command from the ae pointer (doesn't seems be to saved in it). Do you have any idea here ? Should we copy the command_to_run string into the alarm ? That way we could purely skip the 'queued' message, and have in one line the command, time consumed, return code, etc...

thiagoftsm · 2021-03-25T02:11:41Z

Also, I couldn't find a way to get the executed command from the ae pointer (doesn't seems be to saved in it). Do you have any idea here ? Should we copy the command_to_run string into the alarm ? That way we could purely skip the 'queued' message, and have in one line the command, time consumed, return code, etc...

You are right, we do not store the command_to_run inside the alarm entry, this variable is given to our spawn server and it stores the command inside their queue. For this case you will need to copy it.

Saruspete · 2021-03-25T12:09:10Z

Ok, there's no point storing the command long-term, so I don't want to change the alarm_entry struct for that.
I would rather use a fixed-size array or similar to keep the mapping "serial" => "command" for a fixed amount of time (eg 30 seconds) so that logs & transfers can gather the information if needed.
Do you have some ideas or insights on how to implement that efficiently within netdata ?

thiagoftsm · 2021-03-25T16:42:58Z

Ok, there's no point storing the command long-term, so I don't want to change the alarm_entry struct for that.
I would rather use a fixed-size array or similar to keep the mapping "serial" => "command" for a fixed amount of time (eg 30 seconds) so that logs & transfers can gather the information if needed.
Do you have some ideas or insights on how to implement that efficiently within netdata ?

I do not thing the hash will help us, because we dispatch the command with different arguments, and among the arguments is the timestamp. I think we could use the following fields to help us with the identification:

exec: The command that Netdata will run
unique_id : This is the unique id on host for the alarm entry
chart : Chart name
family : Chart family
registry_hostname: Computer hostname

These four components will help us to create an unique identification.

ktsaou · 2022-06-29T13:04:38Z

@thiagoftsm do we need this work to be merged? If yes, please update the PR and merge it. If not, let's close it.

thiagoftsm · 2022-06-29T15:48:38Z

@thiagoftsm do we need this work to be merged? If yes, please update the PR and merge it. If not, let's close it.

After a quick talk with our team, I rebase the PR and now we are proceeding with next steps to merge it.

Dim-P

Also update [logs] section options to reflect the new health.log configuration option?

daemon/main.c

vlvkobal · 2022-07-01T12:40:25Z

Closing this because we are planning to redesign log messages in general and we'll take care of the health logs as well.

Saruspete requested review from cosmix, mfundul, thiagoftsm and vlvkobal as code owners December 18, 2019 14:59

netdatabot added area/daemon area/health labels Dec 18, 2019

thiagoftsm reviewed Dec 18, 2019

View reviewed changes

health/health.c Outdated Show resolved Hide resolved

Saruspete force-pushed the health_logs branch from 115b443 to dc9aa7d Compare December 7, 2020 16:59

Saruspete requested review from Ferroin, joelhans, kaskavel and knatsakis as code owners December 7, 2020 23:36

github-actions bot added area/docs area/exporting area/packaging Packaging and operating systems support area/web labels Dec 7, 2020

Saruspete force-pushed the health_logs branch from 08d2c72 to 5bf6920 Compare December 8, 2020 00:18

knatsakis removed request for cosmix and knatsakis December 9, 2020 14:42

ilyam8 added the group/d-c label Jan 28, 2021

Saruspete changed the title ~~[WIP][health] Dispatch some alarms into health.log instead of debug~~ [health] Dispatch some alarms into health.log instead of debug Jan 28, 2021

ilyam8 removed request for mfundul, Ferroin, joelhans and kaskavel March 4, 2021 12:32

ilyam8 removed the data-collection label Apr 14, 2022

thiagoftsm requested review from MrZammler and Dim-P June 29, 2022 15:47

Dim-P reviewed Jun 30, 2022

View reviewed changes

daemon/main.c Outdated Show resolved Hide resolved

thiagoftsm requested a review from Dim-P July 1, 2022 09:26

vlvkobal previously approved these changes Jul 1, 2022

View reviewed changes

thiagoftsm previously approved these changes Jul 1, 2022

View reviewed changes

Saruspete and others added 6 commits July 1, 2022 10:55

Health: add specific logging for alarm execution without debugging

6ad0496

Health: fix logs formatting

95f6177

Health: add specific logging for alarm execution without debugging

eeff625

Health: typo while redoing lost changes

d694c16

Health: add mutex to health log

4c0a1e6

Move config option to logs section

52178e2

vlvkobal dismissed stale reviews from thiagoftsm and themself via 52178e2 July 1, 2022 12:25

vlvkobal force-pushed the health_logs branch from 5930c74 to 52178e2 Compare July 1, 2022 12:25

vlvkobal closed this Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[health] Dispatch some alarms into health.log instead of debug #7576

[health] Dispatch some alarms into health.log instead of debug #7576

Saruspete commented Dec 18, 2019 •

edited by ilyam8

Loading

squash-labs bot commented Dec 18, 2019

Saruspete commented Dec 8, 2020

Saruspete commented Dec 8, 2020 •

edited

Loading

ilyam8 commented Jan 28, 2021

Saruspete commented Jan 28, 2021 •

edited

Loading

Saruspete commented Mar 17, 2021

thiagoftsm commented Mar 17, 2021

Saruspete commented Mar 17, 2021

thiagoftsm commented Mar 23, 2021

thiagoftsm commented Mar 23, 2021

thiagoftsm commented Mar 23, 2021

Saruspete commented Mar 23, 2021 •

edited

Loading

thiagoftsm commented Mar 25, 2021

Saruspete commented Mar 25, 2021

thiagoftsm commented Mar 25, 2021

ktsaou commented Jun 29, 2022

thiagoftsm commented Jun 29, 2022

Dim-P left a comment

vlvkobal commented Jul 1, 2022

[health] Dispatch some alarms into health.log instead of debug #7576

[health] Dispatch some alarms into health.log instead of debug #7576

Conversation

Saruspete commented Dec 18, 2019 • edited by ilyam8 Loading

Summary

Component Name

Additional Information

squash-labs bot commented Dec 18, 2019

Manage this branch in Squash

Saruspete commented Dec 8, 2020

Saruspete commented Dec 8, 2020 • edited Loading

ilyam8 commented Jan 28, 2021

Saruspete commented Jan 28, 2021 • edited Loading

Saruspete commented Mar 17, 2021

thiagoftsm commented Mar 17, 2021

Saruspete commented Mar 17, 2021

thiagoftsm commented Mar 23, 2021

thiagoftsm commented Mar 23, 2021

thiagoftsm commented Mar 23, 2021

Saruspete commented Mar 23, 2021 • edited Loading

thiagoftsm commented Mar 25, 2021

Saruspete commented Mar 25, 2021

thiagoftsm commented Mar 25, 2021

ktsaou commented Jun 29, 2022

thiagoftsm commented Jun 29, 2022

Dim-P left a comment

Choose a reason for hiding this comment

vlvkobal commented Jul 1, 2022

Saruspete commented Dec 18, 2019 •

edited by ilyam8

Loading

Saruspete commented Dec 8, 2020 •

edited

Loading

Saruspete commented Jan 28, 2021 •

edited

Loading

Saruspete commented Mar 23, 2021 •

edited

Loading