Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track anomaly rates with DBEngine. #12083

Merged
merged 7 commits into from Feb 24, 2022

Conversation

vkalintiris
Copy link
Contributor

Summary

This commit adds support for tracking anomaly rates with DBEngine. We
do so by creating a single chart with id "anomaly_detection.anomaly_rates" for
each trainable/predictable host, which is responsible for tracking the anomaly
rate of each dimension that we train/predict for that host.

The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly
rates charts. We use this flag to:

- Disable exposing the anomaly rates charts through the functionality
  in backends/, exporting/ and streaming/.
- Skip generation of configuration options for the name, algorithm,
  multiplier, divisor of each dimension in an anomaly rates chart.
- Skip the creation of health variables for anomaly rates dimensions.
- Skip the chart/dim queue of ACLK.
- Post-process the RRDR result of an anomaly rates chart, so that we can
  return a sorted, trimmed number of anomalous dimensions.

In a child/parent configuration where both the child and the parent run
ML for the child, we want to be able to stream the rest of the ML-related
charts to the parent. To be able to do this without any chart name collisions,
the charts are now created on localhost and their IDs and titles have the node's
machine_guid and hostname as a suffix, respectively.

Test Plan
  • Local & CI builds.
  • Staging once merged.

ml/Config.cc Outdated
@@ -22,7 +22,7 @@ static T clamp(const T& Value, const T& Min, const T& Max) {
void Config::readMLConfig(void) {
const char *ConfigSectionML = CONFIG_SECTION_ML;

bool EnableAnomalyDetection = config_get_boolean(ConfigSectionML, "enabled", false);
bool EnableAnomalyDetection = config_get_boolean(ConfigSectionML, "enabled", true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes in this file are meant to make testing easy and they will be reverted prior to merging the PR. ML is enabled and detection/prediction runs every minute just for the system.cpu chart. The anomaly rates are updated every 5 seconds.

@lgtm-com
Copy link

lgtm-com bot commented Feb 3, 2022

This pull request introduces 1 alert when merging df780a7 into 4d9ccc1 - view on LGTM.com

new alerts:

  • 1 for Declaration hides parameter

@MrZammler
Copy link
Contributor

My concern as we discussed briefly. If the anomaly rate chart is not streamed to a parent (that is claimed) and the child is not claimed, I don't know if it's possible for the cloud to query for this data the parent on behalf of the child.

The routing as referred by the cloud backend, is that when they want to get data for a child through a parent, they issue a query like '/node/XXXXXXXX/api/v1/something' on the parent, but those data are stored on the parent and it can reply for those. @stelfrag / @underhood if I'm wrong on this maybe?

It's kinda like alerts, i.e. they can get alerts for a child, but only if the parent has calculated them (not the ones calculated by the child).

@MrZammler
Copy link
Contributor

MrZammler commented Feb 8, 2022

Hi Vasili!

We seem at first glance, to be sending the chart to the cloud (this is via the new arch - just a single claimed node):

image

I think we'd like not to transmit to cloud, right? Will check for a way, will let you know.

@vkalintiris
Copy link
Contributor Author

We seem at first glance, to be sending the chart to the cloud (this is via the new arch - just a single claimed node):

We have two charts:

  • "anomaly_detection.anomaly_rate": contains 1 dim with the anomaly rate of the host, ie. num_anomalous_dims / num_total_dims.
  • "anomaly_detection.anomaly_rates": contains many dimensions, ie. one for each trainable dimension.

@lgtm-com
Copy link

lgtm-com bot commented Feb 14, 2022

This pull request introduces 1 alert when merging 0d99169 into a6d3de2 - view on LGTM.com

new alerts:

  • 1 for Declaration hides parameter

@underhood
Copy link
Contributor

had it seqfault:

#0  0x000000000044f59c in rrdset2anything_api_v1 (st=st@entry=0x9c1b2c0, wb=0x6ae7df0, dimensions=dimensions@entry=0x0, format=format@entry=0, points=points@entry=432, after=after@entry=-900, before=0, group_method=1, group_time=0, 
    options=547, latest_timestamp=0x7f59605c17f0, context_param_list=0x6c6d550, chart_label_key=0x0, max_anomaly_rates=0) at web/api/formatters/rrd2json.c:232
#1  0x0000000000456ab8 in web_client_api_request_v1_data (host=0x2efe130, w=0xb9e7e40, url=<optimized out>) at web/api/web_api_v1.c:614
#2  0x0000000000457f01 in web_client_api_request_v1 (host=host@entry=0x2efe130, w=w@entry=0xb9e7e40, url=url@entry=0xb98c0f2 "data") at web/api/web_api_v1.c:1320

Copy link
Contributor

@underhood underhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above comments

@vkalintiris
Copy link
Contributor Author

had it seqfault:

Can you post the request so that I can take a look?

@underhood
Copy link
Contributor

Was a cloud request when i scrolled to the anomaly chart which failed to load. Dont have anymore the exact request but i can try reproduce.

@vkalintiris
Copy link
Contributor Author

Was a cloud request when i scrolled to the anomaly chart which failed to load. Dont have anymore the exact request but i can try reproduce.

That makes me even more curious about what happened, can you reproduce this? On cloud, we only show the host's "anomaly rate" chart (ie. not the "anomaly rates" as described here #12083 (comment)).

@underhood
Copy link
Contributor

underhood commented Feb 14, 2022

seems to crash on this:

0x63f3a60 "GET /node/e41348ff-f917-4a58-9ee6-189891f9ec37/api/v1/data?v2=&chart=anomaly_detection.dimensions&format=json&options=jsonwrap%7Cnonzero%7Cflip%7Cms&_=1644856250370&points=432&group=average&gtime=0&after=-900 HTTP/1.1\r\nHost: cloud-agent-data-ctrl-service.app\r\nAccept-Encoding: gzip\r\nMqtt-Broker-Addr: vernemq-1.vernemq-headless.infra:1883\r\nNetdata-Request-Id: 8YCwXP0cSO-2772813\r\nUser-Agent: Go-http-client/2.0\r\n\r\n"

here
Screenshot from 2022-02-14 17-33-19

Steps to reproduce:

  1. did clean build from this PR/branch
  2. netdata -D
  3. wait agent is connected to staging
  4. go to single node view of this node - all charts load
  5. click anomaly group shortcut in menu on the right side
  6. fireworks (agent crashed, on cloud chart never loads, timeouts and node is updated to be offline)

@andrewm4894
Copy link
Contributor

andrewm4894 commented Feb 14, 2022

i think the custom suffix on chart names that this adds breaks the overview screen on cloud:

image

in cloud i then end up with a new chart per suffix (as opposed to them just all flowing into the same composite chart):

image

@andrewm4894
Copy link
Contributor

my comment above would explain the crash @underhood mentions since chart=anomaly_detection.dimensions does not exist but chart=anomaly_detection.dimensions_<some random number> would exist.

e.g. on one of my nodes it's anomaly_detection.dimensions_ba9c3782-8da5-11ec-9aee-42010a8e003c

image

@vkalintiris
Copy link
Contributor Author

@andrewm4894 & @juacker

Let's say that you have a child that runs ML for itself. The child will create a number of anomaly_detection.* charts.

Suppose now that the child streams to the parent. The parent will be aware of the child's anomaly_detection.* charts. If the parent has been configured to run ML for its children, then it needs to create a new set of anomaly_detection.* charts (one set for each child) that provide ML-related information about the child's training/prediction.

If we don't have a unique name for the newly created charts on the parent, then there will be a collision of chart names. That is, a parent will receive the child's anomaly_detection.dimension chart through streaming, and then try to add its own data to the same anomaly_detection.dimension chart if it runs ML for the child.

This PR addresses this by creating the anomaly_detection.* charts with a machine GUID on the localhost. This means that the child will have create a anomaly_detection.dimensions_<machine_guid_of_child> chart and stream it to the parent. The chart will be accessible from the parent for web requests that specify child's host. If the parent has been configured to run ML for its children, a new chart will be created on localhost with the same name, ie. anomaly_detection.dimensions_<machine_guid_of_child> and it will be accessible directly from the parent without specifying the child's host`.

I don't know what is the proper way to handle this on the cloud-side, ie. how the cloud aggregates these charts in the overview section.

@andrewm4894
Copy link
Contributor

@vkalintiris would we be able to remove the chart renaming to unique from this PR and handle it separately. My understanding is that doing so will not break anything and then we can tackle the parent/child and renaming stuff separate maybe. Will think about it but i can see lots of potential unintended consequences of adding the suffix on the agent side and it just does not feel like an elegant solution to me.

Ideally maybe cloud should be able to figure it all out would be my hope since it should know what it needs about each node instance. If it needs to know more to handle this specific case then we should add what it needs to the ml-info endpoint instead of sort of breaking the naming conventions of charts.

@andrewm4894
Copy link
Contributor

If we don't have a unique name for the newly created charts on the parent, then there will be a collision of chart names. That is, a parent will receive the child's anomaly_detection.dimension chart through streaming, and then try to add its own data to the same anomaly_detection.dimension chart if it runs ML for the child.

Could the parent not just handle this and just overwrite the anomaly_detection. charts for now. Such that if you configure ml on a parent then it will be those anomaly_detection. charts based on the parent ml config that you see.

@lgtm-com
Copy link

lgtm-com bot commented Feb 15, 2022

This pull request introduces 1 alert when merging b9d493f into e7102bd - view on LGTM.com

new alerts:

  • 1 for Declaration hides parameter

@andrewm4894
Copy link
Contributor

andrewm4894 commented Feb 17, 2022

this is looking good to me now and seems to be working as expected

@lgtm-com
Copy link

lgtm-com bot commented Feb 22, 2022

This pull request introduces 1 alert when merging f013af4 into 4d07506 - view on LGTM.com

new alerts:

  • 1 for Declaration hides parameter

andrewm4894
andrewm4894 previously approved these changes Feb 22, 2022
@thiagoftsm
Copy link
Contributor

Hello @vkalintiris ,

Please, take a look at this LGTM report.

@MrZammler
Copy link
Contributor

For the ACLK part, I think it's ok. I got the chart in http://localhost:19999/api/v1/data?chart=anomaly_detection.anomaly_rates, but there's no sign of it in messages to the cloud.

This commit adds support for tracking anomaly rates with DBEngine. We
do so by creating a single chart with id "anomaly_detection.anomaly_rates" for
each trainable/predictable host, which is responsible for tracking the anomaly
rate of each dimension that we train/predict for that host.

The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly
rates charts. We use this flag to:

    - Disable exposing the anomaly rates charts through the functionality
      in backends/, exporting/ and streaming/.
    - Skip generation of configuration options for the name, algorithm,
      multiplier, divisor of each dimension in an anomaly rates chart.
    - Skip the creation of health variables for anomaly rates dimensions.
    - Skip the chart/dim queue of ACLK.
    - Post-process the RRDR result of an anomaly rates chart, so that we can
      return a sorted, trimmed number of anomalous dimensions.

In a child/parent configuration where both the child and the parent run
ML for the child, we want to be able to stream the rest of the ML-related
charts to the parent. To be able to do this without any chart name collisions,
the charts are now created on localhost and their IDs and titles have the node's
machine_guid and hostname as a suffix, respectively.
The reverted changes where meant for local testing only. This commit
restores the default values that we want to have when someone runs
anomaly detection on their node.
@vkalintiris
Copy link
Contributor Author

I forgot to add an empty commit after rebasing the branch. These are the only new changes I added based on review feedback:
changes

@andrewm4894
Copy link
Contributor

@vkalintiris any sense in trying to include #12212 in this branch at this stage? or can wait?

@vkalintiris
Copy link
Contributor Author

@andrewm4894 I believe the reviewers are pretty much okay to accept this after addressing the latest comments. @thiagoftsm @MrZammler @underhood: what do you think?

@thiagoftsm
Copy link
Contributor

@andrewm4894 I believe the reviewers are pretty much okay to accept this after addressing the latest comments. @thiagoftsm @MrZammler @underhood: what do you think?

I agree 100%.
In time, I already tested this PR and it is working as expected.

Copy link
Contributor

@andrewm4894 andrewm4894 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@vkalintiris vkalintiris merged commit 69ea17d into netdata:master Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants