enable `ml_1min_node_ar` as a default alert #14687

andrewm4894 · 2023-03-08T15:06:27Z

Summary

The PR enables a node level (warning only) anomaly rate based alert by default.

Adds this ml based alert in health.d/ml.conf:

# node level anomaly rate
# https://learn.netdata.cloud/docs/agent/ml#node-anomaly-rate
# if node level anomaly rate is above 1% then warning (pick your own threshold that works best via trial and error).
template: ml_1min_node_ar
      on: anomaly_detection.anomaly_rate
      os: *
   hosts: *
  lookup: average -1m for anomaly_rate
    calc: $this
   units: %
   every: 30s
    warn: $this > 1
    info: rolling 1min node level anomaly rate
      to: silent

Test Plan

Dogfood internally on a wide range of systems to "get a feel" for this alert once we are ready.

Additional Information

For users: How does this change affect me?

- A new alert called `ml_1min_node_ar` will be created. - It will be a warning alert if the node anomaly rate is 1% or more for 1 minute or more.

andrewm4894 · 2023-09-11T14:46:34Z

@ktsaou fyi - i think we are ready to add this 1 ml based default alert that will trigger a warning if node anomaly rate goes over 1% for 1 minute or more.

I will finalize and make PR ready for review.

andrewm4894 · 2023-09-11T16:08:56Z

Here is example of it working as expected

health/health.d/ml.conf

andrewm4894 · 2023-09-11T16:14:18Z

@ilyam8 qq - when the alert gets triggered i see it as ml_1min_node_ar_anomaly_rate is this a combination of <template_name>_<dimension_name> ? wondering if i can/should re-jig my lookup in some way to make the alert instance name a little cleaner perhaps.

health/health.d/ml.conf

ilyam8 · 2023-09-11T16:39:06Z

@andrewm4894 it has <dimension_name> because you use foreach, I am not sure we need that.

andrewm4894 · 2023-09-11T16:46:24Z

@andrewm4894 it has <dimension_name> because you use foreach, I am not sure we need that.

Cool yeah thought that, I'll use for instead of foreach

Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>

ilyam8 · 2023-09-12T07:14:36Z

@andrewm4894 how did you decide on the value (1%)? It was 5% before. Do you think 1% AR is worth user attention? If so, why?

andrewm4894 · 2023-09-12T13:20:08Z

is a bit cleaner now the am using of instead of foreach

andrewm4894 · 2023-09-12T13:35:33Z

@andrewm4894 how did you decide on the value (1%)? It was 5% before. Do you think 1% AR is worth user attention? If so, why?

dimension anomaly score threshold = 0.99

So under the hood for each model 1% is the threshold and so if the node itself is consistently above 1% that means a lot of the random 1%'s that might happen are now happening around the same time.

5% was a bit of a higher threshold to try have less noise but ever since this PR that updated the defaults to ~24 hours training out of the box and use more models in scoring. Getting 1% anomalous nodes averaged over a minute is a significant signal that at least 1% of your metrics are anomalous so might be worth looking into to see if is something you care about or not.

Have been dogfooding it on various configs in ML demo room:

Basically, with new defaults on typical installs actually getting 5% Node AR for a window of 1 min or more is actually very rare and so a little too conservative.

1% is reasonable enough and easy enough to explain and reason about too as a default.

For sure though it will be very useful to see what wider community thinks too as there kind of is no one perfect value for this threshold so we just have to pick one that on balance may be useful.

"Contamination Factor" is basically what the above param is and so 1% is nice symmetry with that and about as good a theoretical motivation as you can have but it is subjective and ultimately needs be a bit empirical.

At some stage it would be cool for the agent to actually learn this threshold itself dynamically once the node reaches some sort of equilibrium when all models trained and things mostly normal - it could then adjust the threshold based on the observed node anomaly rate. But this a bit too complex for now and would have extra overhead.

So the tl;dr is:

some domain expertise.
some internal dogfooding.
least worst value given we have to pick one.

ilyam8 · 2023-09-12T15:26:22Z

@andrewm4894 is it possible that daily updates will trigger it? Did you check it?

andrewm4894 · 2023-09-12T15:42:34Z

@andrewm4894 is it possible that daily updates will trigger it? Did you check it?

Yeah it should not. The first nightly update might trigger it but then that will be picked up and trained on and then next update following day should not be enough to trigger it. We exclude the netdata charts but they just naturally have some side effects on normal charts when updating and not really any way to handle it unless we actually were to use the health management api as part of netdata updater or something but i think that's a different issue.

or we could try disable it before the updater and re-enable it after the updater and just handle it that way: https://learn.netdata.cloud/docs/alerting/health-api-calls#disable-or-silence-specific-alerts

MrZammler · 2023-09-13T10:00:45Z

Ok, testing this will let you know.

MrZammler · 2023-09-13T10:27:10Z

@andrewm4894 Can we add class, component and type ? Not sure what to put there, but a suggestion might be class: Utilization or Workload, component: ML ? and type: System maybe ?

andrewm4894 · 2023-09-13T11:31:41Z

adding guide for netdata assistant here: https://github.com/netdata/cloud-netdata-assistant/pull/12

ilyam8 · 2023-09-13T12:33:22Z

health/health.d/ml.conf

+       os: *
+    hosts: *
+   lookup: average -1m of anomaly_rate
+     calc: $this


nit-pick: calc is not needed (it does nothing).

andrewm4894 · 2023-09-13T13:46:33Z

will do a quick little blog post to announce too and add some context for sharing with users.

andrewm4894 · 2023-09-13T16:39:08Z

cross linking the blog post as fyi: https://blog.netdata.cloud/our-first-ml-based-anomaly-alert/

MrZammler · 2023-09-14T08:49:12Z

Hello :-)

andrewm4894 · 2023-09-14T09:43:00Z

Hello :-)

:) ok now let's see tomorrow

andrewm4894 · 2023-09-14T09:44:28Z

@MrZammler can you add me or share some more images of the alerts and an anomalies results etc.

Assuming it's a Netdata update or something else?

enable ml_1min_node_ar as a default alert

67829c5

github-actions bot added the area/health label Mar 8, 2023

andrewm4894 added the area/ml Machine Learning Related Issues label Mar 8, 2023

andrewm4894 self-assigned this Mar 8, 2023

andrewm4894 added the initiative Label used to clearly distinguish tickets related to initiatives label Mar 24, 2023

andrewm4894 mentioned this pull request May 5, 2023

Integrate AR into Alerts netdata/netdata-cloud#802

Open

Merge branch 'netdata:master' into add-1min-node-ar-alert

11f5c29

github-actions bot removed the area/ml Machine Learning Related Issues label Sep 11, 2023

andrewm4894 added 2 commits September 11, 2023 15:15

Merge branch 'netdata:master' into add-1min-node-ar-alert

550d7ec

simplify alert config

efab0c5

andrewm4894 requested a review from ilyam8 September 11, 2023 14:46

andrewm4894 marked this pull request as ready for review September 11, 2023 16:06

andrewm4894 requested review from thiagoftsm and MrZammler as code owners September 11, 2023 16:06

andrewm4894 added the area/ml Machine Learning Related Issues label Sep 11, 2023

andrewm4894 requested review from vkalintiris and ktsaou September 11, 2023 16:10

ilyam8 reviewed Sep 11, 2023

View reviewed changes

health/health.d/ml.conf Outdated Show resolved Hide resolved

ilyam8 reviewed Sep 11, 2023

View reviewed changes

health/health.d/ml.conf Outdated Show resolved Hide resolved

Use for instead of foreach

66a5146

github-actions bot removed the area/ml Machine Learning Related Issues label Sep 11, 2023

andrewm4894 and others added 2 commits September 11, 2023 21:59

Set to as silent

a252929

Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>

spacing

39233c7

andrewm4894 mentioned this pull request Sep 11, 2023

Add 1min_node_anomaly_rate alert template #15012

Closed

andrewm4894 added 2 commits September 12, 2023 13:02

"of" instead of "for"

7737dca

Merge branch 'netdata:master' into add-1min-node-ar-alert

eedf817

add class, component and type

8596cf2

MrZammler approved these changes Sep 13, 2023

View reviewed changes

ilyam8 approved these changes Sep 13, 2023

View reviewed changes

ilyam8 merged commit 92515e4 into netdata:master Sep 13, 2023
128 checks passed

andrewm4894 deleted the add-1min-node-ar-alert branch September 13, 2023 13:46

andrewm4894 added the area/ml Machine Learning Related Issues label Sep 13, 2023

andrewm4894 restored the add-1min-node-ar-alert branch September 13, 2023 14:33

andrewm4894 deleted the add-1min-node-ar-alert branch September 14, 2023 16:00

andrewm4894 mentioned this pull request Sep 14, 2023

extend ml default training from ~24 to ~48 hours #15971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable `ml_1min_node_ar` as a default alert #14687

enable `ml_1min_node_ar` as a default alert #14687

andrewm4894 commented Mar 8, 2023 •

edited

andrewm4894 commented Sep 11, 2023

andrewm4894 commented Sep 11, 2023

andrewm4894 commented Sep 11, 2023

ilyam8 commented Sep 11, 2023

andrewm4894 commented Sep 11, 2023

ilyam8 commented Sep 12, 2023

andrewm4894 commented Sep 12, 2023

andrewm4894 commented Sep 12, 2023

ilyam8 commented Sep 12, 2023

andrewm4894 commented Sep 12, 2023 •

edited

MrZammler commented Sep 13, 2023

MrZammler commented Sep 13, 2023

andrewm4894 commented Sep 13, 2023

ilyam8 Sep 13, 2023

andrewm4894 commented Sep 13, 2023

andrewm4894 commented Sep 13, 2023

MrZammler commented Sep 14, 2023

andrewm4894 commented Sep 14, 2023

andrewm4894 commented Sep 14, 2023

enable ml_1min_node_ar as a default alert #14687

enable ml_1min_node_ar as a default alert #14687

Conversation

andrewm4894 commented Mar 8, 2023 • edited

Summary

Test Plan

Additional Information

andrewm4894 commented Sep 11, 2023

andrewm4894 commented Sep 11, 2023

andrewm4894 commented Sep 11, 2023

ilyam8 commented Sep 11, 2023

andrewm4894 commented Sep 11, 2023

ilyam8 commented Sep 12, 2023

andrewm4894 commented Sep 12, 2023

andrewm4894 commented Sep 12, 2023

ilyam8 commented Sep 12, 2023

andrewm4894 commented Sep 12, 2023 • edited

MrZammler commented Sep 13, 2023

MrZammler commented Sep 13, 2023

andrewm4894 commented Sep 13, 2023

ilyam8 Sep 13, 2023

Choose a reason for hiding this comment

andrewm4894 commented Sep 13, 2023

andrewm4894 commented Sep 13, 2023

MrZammler commented Sep 14, 2023

andrewm4894 commented Sep 14, 2023

andrewm4894 commented Sep 14, 2023

enable `ml_1min_node_ar` as a default alert #14687

enable `ml_1min_node_ar` as a default alert #14687

andrewm4894 commented Mar 8, 2023 •

edited

andrewm4894 commented Sep 12, 2023 •

edited