Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[health] Dispatch some alarms into health.log instead of debug #7576

Closed
wants to merge 6 commits into from

Conversation

Saruspete
Copy link
Contributor

@Saruspete Saruspete commented Dec 18, 2019

Summary

Fixes: #6822
Goal: to provide more detail on the "alarm script" execution (#6822)

This script can be used to send simple notification (features provided
by default), but can also be used to notify more complex systems.
When netdata registers an alarm but the upstream alerting system doesn't
receive the alarm/clear, it's complicated to pinpoint where the alert
was dropped.

Issues where this would be useful:

New configuration options in [health] section to control logging. By default, all are off, thus is the same state as currently (debug).

[health]
	# log alarm exec = no
	# log alarm exec errors = no
	# log alarm silenced = no
	# log alarm skipped = no
	# log alarm clear = no

More logs lines should be added to cover the different paths where alerts come from (like repeating alerts, transcient, etc..)

Component Name

health + libnetdata/logs

Additional Information

Sample output (with all options = yes):

[ Startup ]
2019-12-18 15:28:57: Health not sending notification for alarm 'net_packets.wlp4s0.10s_received_packets_storm' status CLEAR (it has no-clear-notification enabled)
2019-12-18 15:28:57: Health not sending notification for alarm 'net_packets.wwp0s20f0u6.10s_received_packets_storm' status CLEAR (it has no-clear-notification enabled)
2019-12-18 15:28:57: Health not sending notification for alarm 'net_packets.virbr0.10s_received_packets_storm' status CLEAR (it has no-clear-notification enabled)
2019-12-18 15:28:57: Health not sending notification for alarm 'ipv4.tcphandshake.10s_ipv4_tcp_resets_received' status CLEAR (it has no-clear-notification enabled)
2019-12-18 15:28:57: Health not sending notification for alarm 'ipv4.tcphandshake.10s_ipv4_tcp_resets_sent' status CLEAR (it has no-clear-notification enabled)
[ Filling /tmp with dd if=/dev/urandom of=/tmp/coin bs=1M count=15k ]
2019-12-18 15:30:57: executing command 'exec /usr/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'munin' '1572513246' '1572509068' '43' '1576679457' 'out_of_disk_space_time' 'disk_space._tmp' '/tmp' 'CRITICAL' 'CLEAR' '0' 'nan' '68@/usr/lib/netdata/conf.d/health.d/disks.conf' '120' '0' 'hours' 'estimated time the disk will run out of space, if the system continues to add data with the rate of the last hour' '0h' 'undefined' '$this > 0 and $this < (($status == $CRITICAL) ? (24) : (2))' '[ $this = 0.0370937 ] [ $this = 0.0370937 ] [ $status = 1 ] [ $CRITICAL = 4 ] ' '1' '1''
2019-12-18 15:30:57: done executing command in 0.437300 s - returned with code 1
2019-12-18 15:31:57: executing command 'exec /usr/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'munin' '1572513245' '1572509066' '38' '1576679457' 'disk_space_usage' 'disk_space._tmp' '/tmp' 'WARNING' 'CLEAR' '96' '45' '12@/usr/lib/netdata/conf.d/health.d/disks.conf' '120' '0' '%' 'current disk space usage' '96.4%' '44.9%' '$this > (($status >= $WARNING ) ? (80) : (90))' '[ $this = 96.4439625 ] [ $status = 3 ] [ $WARNING = 3 ] ' '1' '1''
2019-12-18 15:31:57: done executing command in 0.625700 s - returned with code 1
[ Emptying /tmp/coin ]
[ Filling again with dd if=/dev/urandom of=/tmp/coin bs=1M count=15k ]
2019-12-18 15:34:57: Health not sending again notification for alarm 'disk_space._tmp.out_of_disk_space_time' status CRITICAL

@squash-labs
Copy link

squash-labs bot commented Dec 18, 2019

Manage this branch in Squash

Test this branch here: https://saruspetehealth-logs-s3w9a.squash.io

health/health.c Outdated Show resolved Hide resolved
@Saruspete
Copy link
Contributor Author

I've reworked the script and now testing it on my prod.
I got one request for the (much needed) spawn system: is there a way to get the execution time for the commands, instead of just the timestamp ? that would be more easy to analyze the submit / start / stop time.

@Saruspete
Copy link
Contributor Author

Saruspete commented Dec 8, 2020

When all logs are enabled, I got:

2020-12-07 23:54:04: Health not sending notification for first initialization of alarm 'net_packets.eno1.outbound_packets_dropped_ratio' status CLEAR
2020-12-07 23:54:04: Queued command 'exec /opt/netdata/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'bur' '1607381691' '1607379112' '34' '1607381644' 'inbound_packets_dropped_ratio' 'net_packets.eno1' 'eno1' 'CLEAR' 'UNINITIALIZED' '0' 'nan' '83@/opt/netdata/lib/netdata/conf.d/health.d/net.conf' '9' '0' '%' 'the ratio of inbound dropped packets vs the total number of received packets of the network interface, during the last 10 minutes' '0%' '-' '$this >= 0.1' '[ $this = 0 ] ' '2' '0'' as id '1'
2020-12-07 23:54:04: Health not sending notification for first initialization of alarm 'net.eno1.1m_sent_traffic_overflow' status CLEAR
2020-12-07 23:54:04: Health not sending notification for first initialization of alarm 'net.eno1.1m_received_traffic_overflow' status CLEAR
[ ... ]
2020-12-07 23:54:04: done executing command '1' - returned with code 1 at 1607381644
2020-12-07 23:55:04: Health alarm 'net_packets.eno1.inbound_packets_dropped_ratio' = 13.420868 - changed status from CLEAR to CRITICAL
2020-12-07 23:55:04: Queued command 'exec /opt/netdata/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'bur' '1607381709' '1607379112' '35' '1607381704' 'inbound_packets_dropped_ratio' 'net_packets.eno1' 'eno1' 'CRITICAL' 'CLEAR' '13' '0' '83@/opt/netdata/lib/netdata/conf.d/health.d/net.conf' '60' '0' '%' 'the ratio of inbound dropped packets vs the total number of received packets of the network interface, during the last 10 minutes' '13.4%' '0%' '$this >= 2' '[ $this = 13.4208677 ] ' '2' '1'' as id '2'
[ ... ]
2020-12-07 23:55:04: Health not sending notification for first initialization of alarm 'disk_inodes._.disk_inode_usage' status CLEAR
2020-12-07 23:55:04: Health not sending again notification for alarm 'disk_space._.disk_space_usage' status WARNING
2020-12-07 23:55:05: done executing command '2' - returned with code 0 at 1607381704
2020-12-08 00:46:04: Health alarm 'disk_space._.disk_space_usage' = 78.718152 - changed status from WARNING to CLEAR
2020-12-08 00:46:04: Queued command 'exec /opt/netdata/libexec/netdata/plugins.d/alarm-notify.sh 'sysadmin' 'bur' '1607381728' '1607379073' '31' '1607383864' 'disk_space_usage' 'disk_space._' '/' 'CLEAR' 'WARNING' '79' '80' '12@/opt/netdata/lib/netdata/conf.d/health.d/disks.conf' '2220' '2220' '%' 'current disk space usage' '78.7%' '80.4%' '$this > (($status >= $WARNING ) ? (80) : (90))' '[ $this = 77.7332227 ] [ $status = 1 ] [ $WARNING = 3 ] ' '1' '1'' as id '3'
2020-12-08 00:46:04: done executing command '3' - returned with code 0 at 1607384764
2020-12-08 01:16:04: Health alarm 'net_packets.eno1.inbound_packets_dropped_ratio' = 5.265169 - changed status from WARNING to CRITICAL
2020-12-08 01:16:04: Health not sending again notification for alarm 'net_packets.eno1.inbound_packets_dropped_ratio' status CRITICAL

There's a bit of storm about first initialization, but then only useful messages from state changes and alarm execution.

@ilyam8
Copy link
Member

ilyam8 commented Jan 28, 2021

@Saruspete hi. What are your plans on this PR? Are you still interested in it and going to finish?

@Saruspete
Copy link
Contributor Author

Saruspete commented Jan 28, 2021

Hello,

it's currently working in production under 1.26 and 1.28 (for about 2 months), I was waiting to check if that didn't generated any issue, but it looks fine for now.
I would have liked a way to get the precise time consumed to run the alarm, but it would have needed to change the alarm_entry struct, which I wasn't fond of.

So far, looks good to me. I'll remove the WIP tag

@Saruspete Saruspete changed the title [WIP][health] Dispatch some alarms into health.log instead of debug [health] Dispatch some alarms into health.log instead of debug Jan 28, 2021
@Saruspete
Copy link
Contributor Author

A thing just hit me: I removed the mutex that was copy/pasted from access.log as health logs are not in the webserver path. However, alerts are handled by libuv. Are these subject to multithreading and wirtes to health.log need a mutex too ?

@thiagoftsm
Copy link
Contributor

Hello @thiagoftsm
Sorry for late reply too ^^
I always forgot to check with these strict lines. Is there any standard "make check" or "make build-strict" like to automate this ?

I do not remeber, we have a make check, but I am not sure if this would do exactly what you want. @Ferroin can you help us here?

fwiw, I tried a simple make with

CFLAGS="-O1 -ggdb -Wall -Wextra -fsanitize=address -static-libasan -fno-omit-frame-pointer -Wformat-signedness -fstack-protector-all -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1"  ./configure --prefix=/home/netdata --exec-prefix=/home/netdata/usr --with-user=adrien

And I'm getting errors about README.md:

make[4]: Entering directory '/home/adrien/devel/sys-monitoring/netdata.saruspete/database/engine/global_uuid_map'
make[4]: *** No rule to make target 'README.md', needed by 'all-am'.  Stop.

And while doing only make netdata got

make: *** No rule to make target 'database/engine/metadata_log/metadatalog.c', needed by 'database/engine/metadata_log/metadatalog.o'.  Stop.

About these last errors, I compiled the latest master and I am not having these errors, please, can you try to rebase your PR.

@Saruspete
Copy link
Contributor Author

It should be rebased already (during my clean):

$ git describe
v1.29.3-161-ge8cba5c1a

@thiagoftsm
Copy link
Contributor

A thing just hit me: I removed the mutex that was copy/pasted from access.log as health logs are not in the webserver path. However, alerts are handled by libuv. Are these subject to multithreading and wirtes to health.log need a mutex too ?

Firstly, sorry for the delay.

I think you are making an awesome job here, and probably this log can be one day exported for our dashboard or a file, my suggestion for you is to create a mutex for it, because it will be necessary in the future.

@thiagoftsm
Copy link
Contributor

once again, I f***ed my git and also managed to lose the changes. Once again, sorry for the noise. (and if you have hints to avoid that in the future, I'll take them ! )

I rewrote the files from the version I have in production since a few months.

I am not specialist in git, @ilyam8 do you have some tip for us?

@thiagoftsm
Copy link
Contributor

Hello @Saruspete ,

I tested your PR now and I saw that the compilation error was fixed, thank you!
Now I wanna give a suggestion to improve your PR, I used the following alarm:

 alarm: dev_dim_template
    on: system.cpu
    os: linux
lookup: sum -3s at 0 every 3 percentage foreach *
 units: %
 every: 1s
  warn: $this > 1
  crit: $this > 4

As you can see the alarm is useless for real world, but it was good for me to identify details on your PR, firtsly, take a look in my error.log:

bash-5.1# grep telegram /var/log/netdata/error.log 
2021-03-23 22:12:38: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_idle is CRITICAL to '-3'
2021-03-23 22:12:39: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3'
2021-03-23 22:12:39: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_user is CRITICAL to '-3'
2021-03-23 22:12:41: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is CRITICAL to '-3'
2021-03-23 22:12:41: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_iowait is WARNING to '-3'
2021-03-23 22:12:44: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_iowait is CLEAR to '-3'
2021-03-23 22:12:50: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3'
2021-03-23 22:12:53: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is CRITICAL to '-3'
2021-03-23 22:13:04: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3'

and now the health.log:

bash-5.1# cat /var/log/netdata/health.log 
2021-03-23 22:12:39: done executing command '1' - returned with code 0 at 1616537557
2021-03-23 22:12:39: done executing command '2' - returned with code 0 at 1616537557
2021-03-23 22:12:39: done executing command '3' - returned with code 0 at 1616537557
2021-03-23 22:12:41: done executing command '4' - returned with code 0 at 1616537560
2021-03-23 22:12:41: done executing command '5' - returned with code 0 at 1616537560
2021-03-23 22:12:44: done executing command '6' - returned with code 0 at 1616537563
2021-03-23 22:12:50: done executing command '7' - returned with code 0 at 1616537568
2021-03-23 22:12:53: done executing command '8' - returned with code 0 at 1616537572
2021-03-23 22:13:04: done executing command '9' - returned with code 0 at 1616537581

I am missing an unique identifier between the two files that allow us to compare the data, as you can see the timestamp was not the same for them, please, can you add the alarm name to health.log?

In my vision, we could also move the information that was kept inside error.log to health.log, because it also has relationship with notification. @ilyam8 and @vlvkobal do you have objection to this idea? In this case, I also suggest to add the alarm name to the current message.

@Saruspete
Copy link
Contributor Author

Saruspete commented Mar 23, 2021

You should have a queued command %s as id X before, due to the async way of libuv (https://github.com/netdata/netdata/pull/7576/files#diff-69a1925e841976fa93ef3331a8c955d5500c1a8225ebcbeb56fa2dd8397e75fbR379)

The log you're seeing is enabled by HEALTH_ENTRY_FLAG_EXEC_RUN or HEALTH_ENTRY_FLAG_EXEC_FAILED. (by default, the logs are not enabled, all settings are off). However, it shouldn't have a 'returned with code 0': either it worked fine and you should also have the 'queued' log entry (if log alarm exec is on), else it failed and you shouldn't have the code 0. Will look into that.

Also, I couldn't find a way to get the executed command from the ae pointer (doesn't seems be to saved in it). Do you have any idea here ? Should we copy the command_to_run string into the alarm ? That way we could purely skip the 'queued' message, and have in one line the command, time consumed, return code, etc...

@thiagoftsm
Copy link
Contributor

Also, I couldn't find a way to get the executed command from the ae pointer (doesn't seems be to saved in it). Do you have any idea here ? Should we copy the command_to_run string into the alarm ? That way we could purely skip the 'queued' message, and have in one line the command, time consumed, return code, etc...

You are right, we do not store the command_to_run inside the alarm entry, this variable is given to our spawn server and it stores the command inside their queue. For this case you will need to copy it.

@Saruspete
Copy link
Contributor Author

Ok, there's no point storing the command long-term, so I don't want to change the alarm_entry struct for that.
I would rather use a fixed-size array or similar to keep the mapping "serial" => "command" for a fixed amount of time (eg 30 seconds) so that logs & transfers can gather the information if needed.
Do you have some ideas or insights on how to implement that efficiently within netdata ?

@thiagoftsm
Copy link
Contributor

Ok, there's no point storing the command long-term, so I don't want to change the alarm_entry struct for that.
I would rather use a fixed-size array or similar to keep the mapping "serial" => "command" for a fixed amount of time (eg 30 seconds) so that logs & transfers can gather the information if needed.
Do you have some ideas or insights on how to implement that efficiently within netdata ?

I do not thing the hash will help us, because we dispatch the command with different arguments, and among the arguments is the timestamp. I think we could use the following fields to help us with the identification:

  • exec: The command that Netdata will run
  • unique_id : This is the unique id on host for the alarm entry
  • chart : Chart name
  • family : Chart family
  • registry_hostname: Computer hostname

These four components will help us to create an unique identification.

@ktsaou
Copy link
Member

ktsaou commented Jun 29, 2022

@thiagoftsm do we need this work to be merged? If yes, please update the PR and merge it. If not, let's close it.

@thiagoftsm
Copy link
Contributor

@thiagoftsm do we need this work to be merged? If yes, please update the PR and merge it. If not, let's close it.

After a quick talk with our team, I rebase the PR and now we are proceeding with next steps to merge it.

Copy link
Contributor

@Dim-P Dim-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update [logs] section options to reflect the new health.log configuration option?

daemon/main.c Outdated Show resolved Hide resolved
@thiagoftsm thiagoftsm requested a review from Dim-P July 1, 2022 09:26
vlvkobal
vlvkobal previously approved these changes Jul 1, 2022
thiagoftsm
thiagoftsm previously approved these changes Jul 1, 2022
@vlvkobal vlvkobal dismissed stale reviews from thiagoftsm and themself via 52178e2 July 1, 2022 12:25
@vlvkobal
Copy link
Contributor

vlvkobal commented Jul 1, 2022

Closing this because we are planning to redesign log messages in general and we'll take care of the health logs as well.

@vlvkobal vlvkobal closed this Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[health] Log standard message when executing health_script
8 participants