New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track anomaly rates with DBEngine. #12083
Track anomaly rates with DBEngine. #12083
Conversation
ml/Config.cc
Outdated
@@ -22,7 +22,7 @@ static T clamp(const T& Value, const T& Min, const T& Max) { | |||
void Config::readMLConfig(void) { | |||
const char *ConfigSectionML = CONFIG_SECTION_ML; | |||
|
|||
bool EnableAnomalyDetection = config_get_boolean(ConfigSectionML, "enabled", false); | |||
bool EnableAnomalyDetection = config_get_boolean(ConfigSectionML, "enabled", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the changes in this file are meant to make testing easy and they will be reverted prior to merging the PR. ML is enabled and detection/prediction runs every minute just for the system.cpu
chart. The anomaly rates are updated every 5 seconds.
This pull request introduces 1 alert when merging df780a7 into 4d9ccc1 - view on LGTM.com new alerts:
|
My concern as we discussed briefly. If the anomaly rate chart is not streamed to a parent (that is claimed) and the child is not claimed, I don't know if it's possible for the cloud to query for this data the parent on behalf of the child. The routing as referred by the cloud backend, is that when they want to get data for a child through a parent, they issue a query like '/node/XXXXXXXX/api/v1/something' on the parent, but those data are stored on the parent and it can reply for those. @stelfrag / @underhood if I'm wrong on this maybe? It's kinda like alerts, i.e. they can get alerts for a child, but only if the parent has calculated them (not the ones calculated by the child). |
We have two charts:
|
feb5025
to
0d99169
Compare
This pull request introduces 1 alert when merging 0d99169 into a6d3de2 - view on LGTM.com new alerts:
|
had it seqfault:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above comments
Can you post the request so that I can take a look? |
Was a cloud request when i scrolled to the anomaly chart which failed to load. Dont have anymore the exact request but i can try reproduce. |
That makes me even more curious about what happened, can you reproduce this? On cloud, we only show the host's "anomaly rate" chart (ie. not the "anomaly rates" as described here #12083 (comment)). |
seems to crash on this:
Steps to reproduce:
|
my comment above would explain the crash @underhood mentions since e.g. on one of my nodes it's |
Let's say that you have a child that runs ML for itself. The child will create a number of Suppose now that the child streams to the parent. The parent will be aware of the child's If we don't have a unique name for the newly created charts on the parent, then there will be a collision of chart names. That is, a parent will receive the child's This PR addresses this by creating the I don't know what is the proper way to handle this on the cloud-side, ie. how the cloud aggregates these charts in the overview section. |
@vkalintiris would we be able to remove the chart renaming to unique from this PR and handle it separately. My understanding is that doing so will not break anything and then we can tackle the parent/child and renaming stuff separate maybe. Will think about it but i can see lots of potential unintended consequences of adding the suffix on the agent side and it just does not feel like an elegant solution to me. Ideally maybe cloud should be able to figure it all out would be my hope since it should know what it needs about each node instance. If it needs to know more to handle this specific case then we should add what it needs to the ml-info endpoint instead of sort of breaking the naming conventions of charts. |
Could the parent not just handle this and just overwrite the |
This pull request introduces 1 alert when merging b9d493f into e7102bd - view on LGTM.com new alerts:
|
this is looking good to me now and seems to be working as expected |
b9d493f
to
f013af4
Compare
This pull request introduces 1 alert when merging f013af4 into 4d07506 - view on LGTM.com new alerts:
|
Hello @vkalintiris , Please, take a look at this LGTM report. |
For the ACLK part, I think it's ok. I got the chart in |
This commit adds support for tracking anomaly rates with DBEngine. We do so by creating a single chart with id "anomaly_detection.anomaly_rates" for each trainable/predictable host, which is responsible for tracking the anomaly rate of each dimension that we train/predict for that host. The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly rates charts. We use this flag to: - Disable exposing the anomaly rates charts through the functionality in backends/, exporting/ and streaming/. - Skip generation of configuration options for the name, algorithm, multiplier, divisor of each dimension in an anomaly rates chart. - Skip the creation of health variables for anomaly rates dimensions. - Skip the chart/dim queue of ACLK. - Post-process the RRDR result of an anomaly rates chart, so that we can return a sorted, trimmed number of anomalous dimensions. In a child/parent configuration where both the child and the parent run ML for the child, we want to be able to stream the rest of the ML-related charts to the parent. To be able to do this without any chart name collisions, the charts are now created on localhost and their IDs and titles have the node's machine_guid and hostname as a suffix, respectively.
The reverted changes where meant for local testing only. This commit restores the default values that we want to have when someone runs anomaly detection on their node.
f013af4
to
b8c0b21
Compare
@vkalintiris any sense in trying to include #12212 in this branch at this stage? or can wait? |
@andrewm4894 I believe the reviewers are pretty much okay to accept this after addressing the latest comments. @thiagoftsm @MrZammler @underhood: what do you think? |
I agree 100%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
Summary
This commit adds support for tracking anomaly rates with DBEngine. We
do so by creating a single chart with id "anomaly_detection.anomaly_rates" for
each trainable/predictable host, which is responsible for tracking the anomaly
rate of each dimension that we train/predict for that host.
The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly
rates charts. We use this flag to:
In a child/parent configuration where both the child and the parent run
ML for the child, we want to be able to stream the rest of the ML-related
charts to the parent. To be able to do this without any chart name collisions,
the charts are now created on localhost and their IDs and titles have the node's
machine_guid and hostname as a suffix, respectively.
Test Plan