Track anomaly rates with DBEngine. #12083

vkalintiris · 2022-02-03T15:52:23Z

Summary

This commit adds support for tracking anomaly rates with DBEngine. We
do so by creating a single chart with id "anomaly_detection.anomaly_rates" for
each trainable/predictable host, which is responsible for tracking the anomaly
rate of each dimension that we train/predict for that host.

The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly
rates charts. We use this flag to:

- Disable exposing the anomaly rates charts through the functionality
  in backends/, exporting/ and streaming/.
- Skip generation of configuration options for the name, algorithm,
  multiplier, divisor of each dimension in an anomaly rates chart.
- Skip the creation of health variables for anomaly rates dimensions.
- Skip the chart/dim queue of ACLK.
- Post-process the RRDR result of an anomaly rates chart, so that we can
  return a sorted, trimmed number of anomalous dimensions.

In a child/parent configuration where both the child and the parent run
ML for the child, we want to be able to stream the rest of the ML-related
charts to the parent. To be able to do this without any chart name collisions,
the charts are now created on localhost and their IDs and titles have the node's
machine_guid and hostname as a suffix, respectively.

Test Plan

Local & CI builds.
Staging once merged.

vkalintiris · 2022-02-03T16:13:14Z

ml/Config.cc

@@ -22,7 +22,7 @@ static T clamp(const T& Value, const T& Min, const T& Max) {
 void Config::readMLConfig(void) {
    const char *ConfigSectionML = CONFIG_SECTION_ML;

-    bool EnableAnomalyDetection = config_get_boolean(ConfigSectionML, "enabled", false);
+    bool EnableAnomalyDetection = config_get_boolean(ConfigSectionML, "enabled", true);


Most of the changes in this file are meant to make testing easy and they will be reverted prior to merging the PR. ML is enabled and detection/prediction runs every minute just for the system.cpu chart. The anomaly rates are updated every 5 seconds.

lgtm-com · 2022-02-03T16:53:25Z

This pull request introduces 1 alert when merging df780a7 into 4d9ccc1 - view on LGTM.com

new alerts:

1 for Declaration hides parameter

MrZammler · 2022-02-04T09:30:38Z

My concern as we discussed briefly. If the anomaly rate chart is not streamed to a parent (that is claimed) and the child is not claimed, I don't know if it's possible for the cloud to query for this data the parent on behalf of the child.

The routing as referred by the cloud backend, is that when they want to get data for a child through a parent, they issue a query like '/node/XXXXXXXX/api/v1/something' on the parent, but those data are stored on the parent and it can reply for those. @stelfrag / @underhood if I'm wrong on this maybe?

It's kinda like alerts, i.e. they can get alerts for a child, but only if the parent has calculated them (not the ones calculated by the child).

MrZammler · 2022-02-08T14:46:34Z

Hi Vasili!

We seem at first glance, to be sending the chart to the cloud (this is via the new arch - just a single claimed node):

I think we'd like not to transmit to cloud, right? Will check for a way, will let you know.

vkalintiris · 2022-02-08T15:51:27Z

We seem at first glance, to be sending the chart to the cloud (this is via the new arch - just a single claimed node):

We have two charts:

"anomaly_detection.anomaly_rate": contains 1 dim with the anomaly rate of the host, ie. num_anomalous_dims / num_total_dims.
"anomaly_detection.anomaly_rates": contains many dimensions, ie. one for each trainable dimension.

lgtm-com · 2022-02-14T11:07:04Z

This pull request introduces 1 alert when merging 0d99169 into a6d3de2 - view on LGTM.com

new alerts:

1 for Declaration hides parameter

underhood · 2022-02-14T15:26:33Z

had it seqfault:

#0  0x000000000044f59c in rrdset2anything_api_v1 (st=st@entry=0x9c1b2c0, wb=0x6ae7df0, dimensions=dimensions@entry=0x0, format=format@entry=0, points=points@entry=432, after=after@entry=-900, before=0, group_method=1, group_time=0, 
    options=547, latest_timestamp=0x7f59605c17f0, context_param_list=0x6c6d550, chart_label_key=0x0, max_anomaly_rates=0) at web/api/formatters/rrd2json.c:232
#1  0x0000000000456ab8 in web_client_api_request_v1_data (host=0x2efe130, w=0xb9e7e40, url=<optimized out>) at web/api/web_api_v1.c:614
#2  0x0000000000457f01 in web_client_api_request_v1 (host=host@entry=0x2efe130, w=w@entry=0xb9e7e40, url=url@entry=0xb98c0f2 "data") at web/api/web_api_v1.c:1320

underhood

above comments

vkalintiris · 2022-02-14T16:03:01Z

had it seqfault:

Can you post the request so that I can take a look?

underhood · 2022-02-14T16:22:24Z

Was a cloud request when i scrolled to the anomaly chart which failed to load. Dont have anymore the exact request but i can try reproduce.

vkalintiris · 2022-02-14T16:28:35Z

Was a cloud request when i scrolled to the anomaly chart which failed to load. Dont have anymore the exact request but i can try reproduce.

That makes me even more curious about what happened, can you reproduce this? On cloud, we only show the host's "anomaly rate" chart (ie. not the "anomaly rates" as described here #12083 (comment)).

underhood · 2022-02-14T16:34:46Z

seems to crash on this:

0x63f3a60 "GET /node/e41348ff-f917-4a58-9ee6-189891f9ec37/api/v1/data?v2=&chart=anomaly_detection.dimensions&format=json&options=jsonwrap%7Cnonzero%7Cflip%7Cms&_=1644856250370&points=432&group=average&gtime=0&after=-900 HTTP/1.1\r\nHost: cloud-agent-data-ctrl-service.app\r\nAccept-Encoding: gzip\r\nMqtt-Broker-Addr: vernemq-1.vernemq-headless.infra:1883\r\nNetdata-Request-Id: 8YCwXP0cSO-2772813\r\nUser-Agent: Go-http-client/2.0\r\n\r\n"

here

Steps to reproduce:

did clean build from this PR/branch
netdata -D
wait agent is connected to staging
go to single node view of this node - all charts load
click anomaly group shortcut in menu on the right side
fireworks (agent crashed, on cloud chart never loads, timeouts and node is updated to be offline)

andrewm4894 · 2022-02-14T17:08:27Z

i think the custom suffix on chart names that this adds breaks the overview screen on cloud:

in cloud i then end up with a new chart per suffix (as opposed to them just all flowing into the same composite chart):

andrewm4894 · 2022-02-14T17:11:17Z

my comment above would explain the crash @underhood mentions since chart=anomaly_detection.dimensions does not exist but chart=anomaly_detection.dimensions_<some random number> would exist.

e.g. on one of my nodes it's anomaly_detection.dimensions_ba9c3782-8da5-11ec-9aee-42010a8e003c

vkalintiris · 2022-02-15T09:43:32Z

@andrewm4894 & @juacker

Let's say that you have a child that runs ML for itself. The child will create a number of anomaly_detection.* charts.

Suppose now that the child streams to the parent. The parent will be aware of the child's anomaly_detection.* charts. If the parent has been configured to run ML for its children, then it needs to create a new set of anomaly_detection.* charts (one set for each child) that provide ML-related information about the child's training/prediction.

If we don't have a unique name for the newly created charts on the parent, then there will be a collision of chart names. That is, a parent will receive the child's anomaly_detection.dimension chart through streaming, and then try to add its own data to the same anomaly_detection.dimension chart if it runs ML for the child.

This PR addresses this by creating the anomaly_detection.* charts with a machine GUID on the localhost. This means that the child will have create a anomaly_detection.dimensions_<machine_guid_of_child> chart and stream it to the parent. The chart will be accessible from the parent for web requests that specify child's host. If the parent has been configured to run ML for its children, a new chart will be created on localhost with the same name, ie. anomaly_detection.dimensions_<machine_guid_of_child> and it will be accessible directly from the parent without specifying the child's host`.

I don't know what is the proper way to handle this on the cloud-side, ie. how the cloud aggregates these charts in the overview section.

andrewm4894 · 2022-02-15T09:48:37Z

@vkalintiris would we be able to remove the chart renaming to unique from this PR and handle it separately. My understanding is that doing so will not break anything and then we can tackle the parent/child and renaming stuff separate maybe. Will think about it but i can see lots of potential unintended consequences of adding the suffix on the agent side and it just does not feel like an elegant solution to me.

Ideally maybe cloud should be able to figure it all out would be my hope since it should know what it needs about each node instance. If it needs to know more to handle this specific case then we should add what it needs to the ml-info endpoint instead of sort of breaking the naming conventions of charts.

andrewm4894 · 2022-02-15T09:51:25Z

If we don't have a unique name for the newly created charts on the parent, then there will be a collision of chart names. That is, a parent will receive the child's anomaly_detection.dimension chart through streaming, and then try to add its own data to the same anomaly_detection.dimension chart if it runs ML for the child.

Could the parent not just handle this and just overwrite the anomaly_detection. charts for now. Such that if you configure ml on a parent then it will be those anomaly_detection. charts based on the parent ml config that you see.

lgtm-com · 2022-02-15T14:09:47Z

This pull request introduces 1 alert when merging b9d493f into e7102bd - view on LGTM.com

new alerts:

1 for Declaration hides parameter

andrewm4894 · 2022-02-17T09:51:58Z

this is looking good to me now and seems to be working as expected

lgtm-com · 2022-02-22T10:15:50Z

This pull request introduces 1 alert when merging f013af4 into 4d07506 - view on LGTM.com

new alerts:

1 for Declaration hides parameter

thiagoftsm · 2022-02-22T14:50:44Z

Hello @vkalintiris ,

Please, take a look at this LGTM report.

MrZammler · 2022-02-22T15:17:17Z

For the ACLK part, I think it's ok. I got the chart in http://localhost:19999/api/v1/data?chart=anomaly_detection.anomaly_rates, but there's no sign of it in messages to the cloud.

This commit adds support for tracking anomaly rates with DBEngine. We do so by creating a single chart with id "anomaly_detection.anomaly_rates" for each trainable/predictable host, which is responsible for tracking the anomaly rate of each dimension that we train/predict for that host. The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly rates charts. We use this flag to: - Disable exposing the anomaly rates charts through the functionality in backends/, exporting/ and streaming/. - Skip generation of configuration options for the name, algorithm, multiplier, divisor of each dimension in an anomaly rates chart. - Skip the creation of health variables for anomaly rates dimensions. - Skip the chart/dim queue of ACLK. - Post-process the RRDR result of an anomaly rates chart, so that we can return a sorted, trimmed number of anomalous dimensions. In a child/parent configuration where both the child and the parent run ML for the child, we want to be able to stream the rest of the ML-related charts to the parent. To be able to do this without any chart name collisions, the charts are now created on localhost and their IDs and titles have the node's machine_guid and hostname as a suffix, respectively.

The reverted changes where meant for local testing only. This commit restores the default values that we want to have when someone runs anomaly detection on their node.

vkalintiris · 2022-02-23T09:55:39Z

I forgot to add an empty commit after rebasing the branch. These are the only new changes I added based on review feedback:

andrewm4894 · 2022-02-23T14:48:36Z

@vkalintiris any sense in trying to include #12212 in this branch at this stage? or can wait?

vkalintiris · 2022-02-23T14:50:58Z

@andrewm4894 I believe the reviewers are pretty much okay to accept this after addressing the latest comments. @thiagoftsm @MrZammler @underhood: what do you think?

thiagoftsm · 2022-02-23T17:39:33Z

@andrewm4894 I believe the reviewers are pretty much okay to accept this after addressing the latest comments. @thiagoftsm @MrZammler @underhood: what do you think?

I agree 100%.
In time, I already tested this PR and it is working as expected.

andrewm4894

Lgtm

vkalintiris added the area/ml Machine Learning Related Issues label Feb 3, 2022

vkalintiris requested review from MrZammler, andrewm4894 and underhood February 3, 2022 15:52

vkalintiris requested review from siamaktavakoli, thiagoftsm and vlvkobal as code owners February 3, 2022 15:52

github-actions bot added area/backends area/database area/exporting area/streaming area/web labels Feb 3, 2022

vkalintiris removed the request for review from siamaktavakoli February 3, 2022 15:54

vkalintiris commented Feb 3, 2022

View reviewed changes

vkalintiris force-pushed the db-engine-anomaly-rates branch from feb5025 to 0d99169 Compare February 14, 2022 10:37

underhood suggested changes Feb 14, 2022

View reviewed changes

vkalintiris force-pushed the db-engine-anomaly-rates branch from b9d493f to f013af4 Compare February 22, 2022 09:46

vkalintiris requested a review from underhood February 22, 2022 11:15

andrewm4894 previously approved these changes Feb 22, 2022

View reviewed changes

vkalintiris added 7 commits February 23, 2022 11:35

Fix exporting_engine tests.

8166da6

Restore default ML configuration.

3534eab

The reverted changes where meant for local testing only. This commit restores the default values that we want to have when someone runs anomaly detection on their node.

Set context for anomaly_detection.* charts.

2cc3917

Check for anomaly rates chart only with a valid pointer.

c142764

Remove duplicate code.

238d995

Use a more descriptive name for id/title pair variable

b8c0b21

vkalintiris dismissed andrewm4894’s stale review via b8c0b21 February 23, 2022 09:53

vkalintiris force-pushed the db-engine-anomaly-rates branch from f013af4 to b8c0b21 Compare February 23, 2022 09:53

andrewm4894 approved these changes Feb 23, 2022

View reviewed changes

thiagoftsm approved these changes Feb 23, 2022

View reviewed changes

MrZammler approved these changes Feb 24, 2022

View reviewed changes

underhood approved these changes Feb 24, 2022

View reviewed changes

vkalintiris merged commit 69ea17d into netdata:master Feb 24, 2022

vkalintiris mentioned this pull request Feb 24, 2022

Skip training of constant metrics. #12212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track anomaly rates with DBEngine. #12083

Track anomaly rates with DBEngine. #12083

vkalintiris commented Feb 3, 2022

vkalintiris Feb 3, 2022

lgtm-com bot commented Feb 3, 2022

MrZammler commented Feb 4, 2022

MrZammler commented Feb 8, 2022 •

edited

vkalintiris commented Feb 8, 2022

lgtm-com bot commented Feb 14, 2022

underhood commented Feb 14, 2022

underhood left a comment

vkalintiris commented Feb 14, 2022

underhood commented Feb 14, 2022

vkalintiris commented Feb 14, 2022

underhood commented Feb 14, 2022 •

edited

andrewm4894 commented Feb 14, 2022 •

edited

andrewm4894 commented Feb 14, 2022

vkalintiris commented Feb 15, 2022

andrewm4894 commented Feb 15, 2022

andrewm4894 commented Feb 15, 2022

lgtm-com bot commented Feb 15, 2022

andrewm4894 commented Feb 17, 2022 •

edited

lgtm-com bot commented Feb 22, 2022

thiagoftsm commented Feb 22, 2022

MrZammler commented Feb 22, 2022

vkalintiris commented Feb 23, 2022

andrewm4894 commented Feb 23, 2022

vkalintiris commented Feb 23, 2022

thiagoftsm commented Feb 23, 2022

andrewm4894 left a comment

Track anomaly rates with DBEngine. #12083

Track anomaly rates with DBEngine. #12083

Conversation

vkalintiris commented Feb 3, 2022

Summary

Test Plan

vkalintiris Feb 3, 2022

Choose a reason for hiding this comment

lgtm-com bot commented Feb 3, 2022

MrZammler commented Feb 4, 2022

MrZammler commented Feb 8, 2022 • edited

vkalintiris commented Feb 8, 2022

lgtm-com bot commented Feb 14, 2022

underhood commented Feb 14, 2022

underhood left a comment

Choose a reason for hiding this comment

vkalintiris commented Feb 14, 2022

underhood commented Feb 14, 2022

vkalintiris commented Feb 14, 2022

underhood commented Feb 14, 2022 • edited

andrewm4894 commented Feb 14, 2022 • edited

andrewm4894 commented Feb 14, 2022

vkalintiris commented Feb 15, 2022

andrewm4894 commented Feb 15, 2022

andrewm4894 commented Feb 15, 2022

lgtm-com bot commented Feb 15, 2022

andrewm4894 commented Feb 17, 2022 • edited

lgtm-com bot commented Feb 22, 2022

thiagoftsm commented Feb 22, 2022

MrZammler commented Feb 22, 2022

vkalintiris commented Feb 23, 2022

andrewm4894 commented Feb 23, 2022

vkalintiris commented Feb 23, 2022

thiagoftsm commented Feb 23, 2022

andrewm4894 left a comment

Choose a reason for hiding this comment

MrZammler commented Feb 8, 2022 •

edited

underhood commented Feb 14, 2022 •

edited

andrewm4894 commented Feb 14, 2022 •

edited

andrewm4894 commented Feb 17, 2022 •

edited