Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable ml_1min_node_ar as a default alert #14687

Merged
merged 10 commits into from Sep 13, 2023

Conversation

andrewm4894
Copy link
Contributor

@andrewm4894 andrewm4894 commented Mar 8, 2023

Summary

The PR enables a node level (warning only) anomaly rate based alert by default.

Adds this ml based alert in health.d/ml.conf:

# node level anomaly rate
# https://learn.netdata.cloud/docs/agent/ml#node-anomaly-rate
# if node level anomaly rate is above 1% then warning (pick your own threshold that works best via trial and error).
template: ml_1min_node_ar
      on: anomaly_detection.anomaly_rate
      os: *
   hosts: *
  lookup: average -1m for anomaly_rate
    calc: $this
   units: %
   every: 30s
    warn: $this > 1
    info: rolling 1min node level anomaly rate
      to: silent
Test Plan

Dogfood internally on a wide range of systems to "get a feel" for this alert once we are ready.

Additional Information
For users: How does this change affect me? - A new alert called `ml_1min_node_ar` will be created. - It will be a warning alert if the node anomaly rate is 1% or more for 1 minute or more.

@andrewm4894 andrewm4894 added the area/ml Machine Learning Related Issues label Mar 8, 2023
@andrewm4894 andrewm4894 self-assigned this Mar 8, 2023
@andrewm4894 andrewm4894 added the initiative Label used to clearly distinguish tickets related to initiatives label Mar 24, 2023
@github-actions github-actions bot removed the area/ml Machine Learning Related Issues label Sep 11, 2023
@andrewm4894
Copy link
Contributor Author

@ktsaou fyi - i think we are ready to add this 1 ml based default alert that will trigger a warning if node anomaly rate goes over 1% for 1 minute or more.

I will finalize and make PR ready for review.

@andrewm4894 andrewm4894 marked this pull request as ready for review September 11, 2023 16:06
@andrewm4894
Copy link
Contributor Author

Here is example of it working as expected

image

image

@andrewm4894 andrewm4894 added the area/ml Machine Learning Related Issues label Sep 11, 2023
health/health.d/ml.conf Outdated Show resolved Hide resolved
@andrewm4894
Copy link
Contributor Author

@ilyam8 qq - when the alert gets triggered i see it as ml_1min_node_ar_anomaly_rate is this a combination of <template_name>_<dimension_name> ? wondering if i can/should re-jig my lookup in some way to make the alert instance name a little cleaner perhaps.

health/health.d/ml.conf Outdated Show resolved Hide resolved
@ilyam8
Copy link
Member

ilyam8 commented Sep 11, 2023

@andrewm4894 it has <dimension_name> because you use foreach, I am not sure we need that.

@andrewm4894
Copy link
Contributor Author

@andrewm4894 it has <dimension_name> because you use foreach, I am not sure we need that.

Cool yeah thought that, I'll use for instead of foreach

@github-actions github-actions bot removed the area/ml Machine Learning Related Issues label Sep 11, 2023
andrewm4894 and others added 2 commits September 11, 2023 21:59
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
@ilyam8
Copy link
Member

ilyam8 commented Sep 12, 2023

@andrewm4894 how did you decide on the value (1%)? It was 5% before. Do you think 1% AR is worth user attention? If so, why?

@andrewm4894
Copy link
Contributor Author

is a bit cleaner now the am using of instead of foreach

image

@andrewm4894
Copy link
Contributor Author

@andrewm4894 how did you decide on the value (1%)? It was 5% before. Do you think 1% AR is worth user attention? If so, why?

dimension anomaly score threshold = 0.99

So under the hood for each model 1% is the threshold and so if the node itself is consistently above 1% that means a lot of the random 1%'s that might happen are now happening around the same time.

5% was a bit of a higher threshold to try have less noise but ever since this PR that updated the defaults to ~24 hours training out of the box and use more models in scoring. Getting 1% anomalous nodes averaged over a minute is a significant signal that at least 1% of your metrics are anomalous so might be worth looking into to see if is something you care about or not.

Have been dogfooding it on various configs in ML demo room:
image

Basically, with new defaults on typical installs actually getting 5% Node AR for a window of 1 min or more is actually very rare and so a little too conservative.

1% is reasonable enough and easy enough to explain and reason about too as a default.

For sure though it will be very useful to see what wider community thinks too as there kind of is no one perfect value for this threshold so we just have to pick one that on balance may be useful.

"Contamination Factor" is basically what the above param is and so 1% is nice symmetry with that and about as good a theoretical motivation as you can have but it is subjective and ultimately needs be a bit empirical.

At some stage it would be cool for the agent to actually learn this threshold itself dynamically once the node reaches some sort of equilibrium when all models trained and things mostly normal - it could then adjust the threshold based on the observed node anomaly rate. But this a bit too complex for now and would have extra overhead.

So the tl;dr is:

  • some domain expertise.
  • some internal dogfooding.
  • least worst value given we have to pick one.

@ilyam8
Copy link
Member

ilyam8 commented Sep 12, 2023

@andrewm4894 is it possible that daily updates will trigger it? Did you check it?

@andrewm4894
Copy link
Contributor Author

andrewm4894 commented Sep 12, 2023

@andrewm4894 is it possible that daily updates will trigger it? Did you check it?

Yeah it should not. The first nightly update might trigger it but then that will be picked up and trained on and then next update following day should not be enough to trigger it. We exclude the netdata charts but they just naturally have some side effects on normal charts when updating and not really any way to handle it unless we actually were to use the health management api as part of netdata updater or something but i think that's a different issue.

or we could try disable it before the updater and re-enable it after the updater and just handle it that way: https://learn.netdata.cloud/docs/alerting/health-api-calls#disable-or-silence-specific-alerts

@MrZammler
Copy link
Contributor

Ok, testing this will let you know.

@MrZammler
Copy link
Contributor

@andrewm4894 Can we add class, component and type ? Not sure what to put there, but a suggestion might be class: Utilization or Workload, component: ML ? and type: System maybe ?

@andrewm4894
Copy link
Contributor Author

adding guide for netdata assistant here: https://github.com/netdata/cloud-netdata-assistant/pull/12

os: *
hosts: *
lookup: average -1m of anomaly_rate
calc: $this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit-pick: calc is not needed (it does nothing).

@ilyam8 ilyam8 merged commit 92515e4 into netdata:master Sep 13, 2023
128 checks passed
@andrewm4894 andrewm4894 deleted the add-1min-node-ar-alert branch September 13, 2023 13:46
@andrewm4894
Copy link
Contributor Author

will do a quick little blog post to announce too and add some context for sharing with users.

@andrewm4894 andrewm4894 added the area/ml Machine Learning Related Issues label Sep 13, 2023
@andrewm4894 andrewm4894 restored the add-1min-node-ar-alert branch September 13, 2023 14:33
@andrewm4894
Copy link
Contributor Author

cross linking the blog post as fyi: https://blog.netdata.cloud/our-first-ml-based-anomaly-alert/

@MrZammler
Copy link
Contributor

Hello :-)

image

@andrewm4894
Copy link
Contributor Author

Hello :-)

image

:) ok now let's see tomorrow

@andrewm4894
Copy link
Contributor Author

@MrZammler can you add me or share some more images of the alerts and an anomalies results etc.

Assuming it's a Netdata update or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/health area/ml Machine Learning Related Issues initiative Label used to clearly distinguish tickets related to initiatives
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants