Change offset(nan) and sync(0) to WARNING #7

rodrigogansobarbieri · 2020-08-14T19:28:56Z

Due to network instability in some environments,
the NTP algorithm may discard all servers/peers
temporarily, triggering CRITICAL Nagios alert
"CRITICAL: offset is out of range (nan)".

There are two problems with that alert:

The message is misleading, as offset=nan does
not mean anything. The worst metric selection
hides the real alert "CRITICAL: No sync peer selected".
Therefore, the worst metric selection list is reversed,
as when "No sync peer selected" happens,
"offset is out of range (nan)" always happens as well.
The problem is usually transient and does not
require immediate attention. It may go away with
network workload changes, or may persist until the
root cause of the issue is investigated. Given that it
is of "CRITICAL" level, it obfuscates other much more
critical alerts from NTPmon and also Nagios. Therefore,
the WARNING level is more appropriate.

Closes #5

Due to network instability in some environments, the NTP algorithm may discard all servers/peers temporarily, triggering CRITICAL Nagios alert "CRITICAL: offset is out of range (nan)". There are two problems with that alert: 1) The message is misleading, as offset=nan does not mean anything. The worst metric selection hides the real alert "CRITICAL: No sync peer selected". Therefore, the worst metric selection list is reversed, as when "No sync peer selected" happens, "offset is out of range (nan)" always happens as well. 2) The problem is usually transient and does not require immediate attention. It may go away with network workload changes, or may persist until the root cause of the issue is investigated. Given that it is of "CRITICAL" level, it obfuscates other much more critical alerts from NTPmon and also Nagios. Therefore, the WARNING level is more appropriate. Closes paulgear#5

paulgear

@rodrigogansobarbieri If it's OK with you, I'm going to close this PR now. I've fixed the NaN problem in a different way, and the downgrading of sync failure from CRITICAL to WARNING is not a change which I'm willing to consider now, since the whole point of NTP is to sync the clock. I think the reversal of the list in worst_metric() is worth thinking about a bit further, and I've created issue #11 to track this.

rodrigogansobarbieri force-pushed the issue5 branch from cdc70b5 to dfedbc8 Compare August 14, 2020 19:30

rodrigogansobarbieri force-pushed the issue5 branch from dfedbc8 to c628a66 Compare August 14, 2020 20:06

paulgear mentioned this pull request Oct 5, 2020

Do not allow NaN to be produced as a metric #10

Merged

paulgear reviewed Oct 13, 2020

View reviewed changes

rodrigogansobarbieri closed this Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change offset(nan) and sync(0) to WARNING #7

Change offset(nan) and sync(0) to WARNING #7

rodrigogansobarbieri commented Aug 14, 2020 •

edited

Loading

paulgear left a comment •

edited

Loading

Change offset(nan) and sync(0) to WARNING #7

Change offset(nan) and sync(0) to WARNING #7

Conversation

rodrigogansobarbieri commented Aug 14, 2020 • edited Loading

paulgear left a comment • edited Loading

Choose a reason for hiding this comment

rodrigogansobarbieri commented Aug 14, 2020 •

edited

Loading

paulgear left a comment •

edited

Loading