Skip training of constant metrics. #12212

vkalintiris · 2022-02-22T09:58:30Z

Summary

Detect dimensions whose values do not change, and skip them from
training. This allows us to reduce the number of training operations
by ~40-50%.

Notice that we don't skip the very 1st training iteration, because a
dimension's value might change at any point in time, and we need to
have a trained model in order to compute its anomaly score.

Test Plan

Custom log calls to keep track of which dimensions are skipped vs. trained.
@andrewm4894 will run on staging.
CI jobs

Additional Information

Resolves #12180

Detect dimensions whose values do not change, and skip them from training. This allows us to reduce the number of training operations by ~40-50%. Notice that we don't skip the very 1st training iteration, because a dimension's value might change at any point in time, and we need to have a trained model in order to compute its anomaly score.

andrewm4894 · 2022-02-22T10:49:35Z

I'll test this today on some devml nodes. I was going to ask about extending the idea of all this to flag when a dim is too unique or lumpy in some sense to make sense to retrain the model. But that would maybe get a bit too complicated for now in having to define some measure of uniqueness or something like that. I like the simple implementation here that i imagine has hardly any cost impact whereas maintaining of making some more complicated measure to decide to retrain or not would just be more complex for now.

So i think this is a great easy initial optimization that will remove a good chunk of unnecessary computation in some average sense of a node that will always have a large subset of dims that are all a constant value.

In future we may define some concept of upfront "dimension data validation" checks and in some way use that to decide when to retrain or not but for now makes total sense to start simple like this.

andrewm4894 · 2022-02-22T15:29:30Z

This does indeed look like it saves a lot of netdata cpu usage on retraining on a node thats just default settings and not doing much. It will of course depend on each specific node how this will generalize but it will only ever help and potentially will help quite a lot.

I'll leave my dev node running for a few more hours and follow up on it tomorrow but so far all looks good to me.

andrewm4894 · 2022-02-22T19:16:40Z

can clearly see impact below...

using this branch (smooth cpu usage after initial training):

vs some other nodes with default setting (you see small cpu bump each retraining window):

vkalintiris · 2022-02-23T08:28:51Z

can clearly see impact below...

using this branch (smooth cpu usage after initial training):

vs some other nodes with default setting (you see small cpu bump each retraining window):

Thanks for the feedback @andrewm4894. Out of curiosity, could you share a screenshot from apps.mem.netdata with/without this optimization?

andrewm4894

major improvement in terms of CPU overhead!

andrewm4894 · 2022-02-23T13:45:11Z

@vkalintiris devml4 is the one with this branch - seems much lower than the others too:

ktsaou · 2022-02-23T17:10:57Z

great work! You rock @vkalintiris !

ktsaou · 2022-02-23T17:11:47Z

Do we have a dedicated chart for the training thread cpu consumption? If not, I think we should add it. The code has a lot of examples for similar threads.

vkalintiris · 2022-02-24T09:15:44Z

Do we have a dedicated chart for the training thread cpu consumption?

No, we have one for total time spent on training + apps.cpu.netdata.

If not, I think we should add it.

I'll create a PR from a local branch now that #12083 got merged.

vkalintiris added the area/ml Machine Learning Related Issues label Feb 22, 2022

vkalintiris requested review from erdem2000, MrZammler and underhood February 22, 2022 09:58

vkalintiris requested a review from andrewm4894 as a code owner February 22, 2022 09:58

MrZammler approved these changes Feb 23, 2022

View reviewed changes

andrewm4894 approved these changes Feb 23, 2022

View reviewed changes

vkalintiris merged commit 207a743 into netdata:master Feb 23, 2022

andrewm4894 mentioned this pull request Feb 23, 2022

Track anomaly rates with DBEngine. #12083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip training of constant metrics. #12212

Skip training of constant metrics. #12212

vkalintiris commented Feb 22, 2022

andrewm4894 commented Feb 22, 2022

andrewm4894 commented Feb 22, 2022 •

edited

andrewm4894 commented Feb 22, 2022

vkalintiris commented Feb 23, 2022

andrewm4894 left a comment

andrewm4894 commented Feb 23, 2022

ktsaou commented Feb 23, 2022

ktsaou commented Feb 23, 2022

vkalintiris commented Feb 24, 2022

Skip training of constant metrics. #12212

Skip training of constant metrics. #12212

Conversation

vkalintiris commented Feb 22, 2022

Summary

Test Plan

Additional Information

andrewm4894 commented Feb 22, 2022

andrewm4894 commented Feb 22, 2022 • edited

andrewm4894 commented Feb 22, 2022

vkalintiris commented Feb 23, 2022

andrewm4894 left a comment

Choose a reason for hiding this comment

andrewm4894 commented Feb 23, 2022

ktsaou commented Feb 23, 2022

ktsaou commented Feb 23, 2022

vkalintiris commented Feb 24, 2022

andrewm4894 commented Feb 22, 2022 •

edited