Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip training of constant metrics. #12212

Merged
merged 1 commit into from Feb 23, 2022

Conversation

vkalintiris
Copy link
Contributor

Summary

Detect dimensions whose values do not change, and skip them from
training. This allows us to reduce the number of training operations
by ~40-50%.

Notice that we don't skip the very 1st training iteration, because a
dimension's value might change at any point in time, and we need to
have a trained model in order to compute its anomaly score.

Test Plan
  • Custom log calls to keep track of which dimensions are skipped vs. trained.
  • @andrewm4894 will run on staging.
  • CI jobs
Additional Information

Resolves #12180

Detect dimensions whose values do not change, and skip them from
training. This allows us to reduce the number of training operations
by ~40-50%.

Notice that we don't skip the very 1st training iteration, because a
dimension's value might change at any point in time, and we need to
have a trained model in order to compute its anomaly score.
@vkalintiris vkalintiris added the area/ml Machine Learning Related Issues label Feb 22, 2022
@andrewm4894
Copy link
Contributor

I'll test this today on some devml nodes. I was going to ask about extending the idea of all this to flag when a dim is too unique or lumpy in some sense to make sense to retrain the model. But that would maybe get a bit too complicated for now in having to define some measure of uniqueness or something like that. I like the simple implementation here that i imagine has hardly any cost impact whereas maintaining of making some more complicated measure to decide to retrain or not would just be more complex for now.

So i think this is a great easy initial optimization that will remove a good chunk of unnecessary computation in some average sense of a node that will always have a large subset of dims that are all a constant value.

In future we may define some concept of upfront "dimension data validation" checks and in some way use that to decide when to retrain or not but for now makes total sense to start simple like this.

@andrewm4894
Copy link
Contributor

andrewm4894 commented Feb 22, 2022

This does indeed look like it saves a lot of netdata cpu usage on retraining on a node thats just default settings and not doing much. It will of course depend on each specific node how this will generalize but it will only ever help and potentially will help quite a lot.

I'll leave my dev node running for a few more hours and follow up on it tomorrow but so far all looks good to me.

image

image

@andrewm4894
Copy link
Contributor

can clearly see impact below...

using this branch (smooth cpu usage after initial training):

image

vs some other nodes with default setting (you see small cpu bump each retraining window):

image

image

image

image

@vkalintiris
Copy link
Contributor Author

can clearly see impact below...

using this branch (smooth cpu usage after initial training):

image

vs some other nodes with default setting (you see small cpu bump each retraining window):

image

image

image

image

Thanks for the feedback @andrewm4894. Out of curiosity, could you share a screenshot from apps.mem.netdata with/without this optimization?

Copy link
Contributor

@andrewm4894 andrewm4894 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

major improvement in terms of CPU overhead!

@andrewm4894
Copy link
Contributor

@vkalintiris devml4 is the one with this branch - seems much lower than the others too:
image

@vkalintiris vkalintiris merged commit 207a743 into netdata:master Feb 23, 2022
@ktsaou
Copy link
Member

ktsaou commented Feb 23, 2022

great work! You rock @vkalintiris !

@ktsaou
Copy link
Member

ktsaou commented Feb 23, 2022

Do we have a dedicated chart for the training thread cpu consumption? If not, I think we should add it. The code has a lot of examples for similar threads.

@vkalintiris
Copy link
Contributor Author

Do we have a dedicated chart for the training thread cpu consumption?

No, we have one for total time spent on training + apps.cpu.netdata.

If not, I think we should add it.

I'll create a PR from a local branch now that #12083 got merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ml Machine Learning Related Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exclude constant dimensions from ML training
4 participants