New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable ml_1min_node_ar
as a default alert
#14687
Conversation
@ktsaou fyi - i think we are ready to add this 1 ml based default alert that will trigger a warning if node anomaly rate goes over 1% for 1 minute or more. I will finalize and make PR ready for review. |
@ilyam8 qq - when the alert gets triggered i see it as |
@andrewm4894 it has |
Cool yeah thought that, I'll use for instead of foreach |
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
@andrewm4894 how did you decide on the value (1%)? It was 5% before. Do you think 1% AR is worth user attention? If so, why? |
So under the hood for each model 1% is the threshold and so if the node itself is consistently above 1% that means a lot of the random 1%'s that might happen are now happening around the same time. 5% was a bit of a higher threshold to try have less noise but ever since this PR that updated the defaults to ~24 hours training out of the box and use more models in scoring. Getting 1% anomalous nodes averaged over a minute is a significant signal that at least 1% of your metrics are anomalous so might be worth looking into to see if is something you care about or not. Have been dogfooding it on various configs in ML demo room: Basically, with new defaults on typical installs actually getting 5% Node AR for a window of 1 min or more is actually very rare and so a little too conservative. 1% is reasonable enough and easy enough to explain and reason about too as a default. For sure though it will be very useful to see what wider community thinks too as there kind of is no one perfect value for this threshold so we just have to pick one that on balance may be useful. "Contamination Factor" is basically what the above param is and so 1% is nice symmetry with that and about as good a theoretical motivation as you can have but it is subjective and ultimately needs be a bit empirical. At some stage it would be cool for the agent to actually learn this threshold itself dynamically once the node reaches some sort of equilibrium when all models trained and things mostly normal - it could then adjust the threshold based on the observed node anomaly rate. But this a bit too complex for now and would have extra overhead. So the tl;dr is:
|
@andrewm4894 is it possible that daily updates will trigger it? Did you check it? |
Yeah it should not. The first nightly update might trigger it but then that will be picked up and trained on and then next update following day should not be enough to trigger it. We exclude the netdata charts but they just naturally have some side effects on normal charts when updating and not really any way to handle it unless we actually were to use the health management api as part of netdata updater or something but i think that's a different issue. or we could try disable it before the updater and re-enable it after the updater and just handle it that way: https://learn.netdata.cloud/docs/alerting/health-api-calls#disable-or-silence-specific-alerts |
Ok, testing this will let you know. |
@andrewm4894 Can we add |
adding guide for netdata assistant here: https://github.com/netdata/cloud-netdata-assistant/pull/12 |
os: * | ||
hosts: * | ||
lookup: average -1m of anomaly_rate | ||
calc: $this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit-pick: calc
is not needed (it does nothing).
will do a quick little blog post to announce too and add some context for sharing with users. |
cross linking the blog post as fyi: https://blog.netdata.cloud/our-first-ml-based-anomaly-alert/ |
@MrZammler can you add me or share some more images of the alerts and an anomalies results etc. Assuming it's a Netdata update or something else? |
Summary
The PR enables a node level (warning only) anomaly rate based alert by default.
Adds this ml based alert in
health.d/ml.conf
:Test Plan
Dogfood internally on a wide range of systems to "get a feel" for this alert once we are ready.
Additional Information
For users: How does this change affect me?
- A new alert called `ml_1min_node_ar` will be created. - It will be a warning alert if the node anomaly rate is 1% or more for 1 minute or more.