Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a PrometheusRule for kata monitor #189

Merged
merged 2 commits into from Jun 1, 2022

Conversation

littlejawa
Copy link
Contributor

- Description of the problem which is fixed/What is the use case
We want to provide some information back through telemetry, but there is limitation on how much data can be sent.
As kata-monitor report information per node, any metric we want to expose would be multiplied by the number of nodes in the cluster, making us go above the limit in some situations.

We need to sum information up, and expose total numbers at the cluster level, and not individual per-node information.

- What I did
Adding a PrometheusRule to have prometheus take the number of running shim per node, make the total of it, and expose it as a new metric.
This rule can be extended in the future to add additional metrics as we need.

Before sending metrics to telemetry, we need to sum it up, because sending
individual metric per node is not scaling.
This rule makes Prometheus compute the total of running VMs for the cluster
and expose it as a new metric.

Signed-off-by: Julien Ropé <jrope@redhat.com>
@openshift-ci openshift-ci bot requested review from bpradipt and jensfr May 5, 2022 12:38
@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 5, 2022
@openshift-ci
Copy link

openshift-ci bot commented May 5, 2022

Hi @littlejawa. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@littlejawa
Copy link
Contributor Author

/cc @snir911 FYI

@openshift-ci openshift-ci bot requested a review from snir911 May 5, 2022 12:39
@openshift-ci
Copy link

openshift-ci bot commented May 5, 2022

@littlejawa: GitHub didn't allow me to request PR reviews from the following users: FYI.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @snir911 FYI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Julien Ropé <jrope@redhat.com>
@bpradipt
Copy link
Contributor

bpradipt commented May 9, 2022

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 9, 2022
- name: kata_monitor_rules
rules:
- expr: sum(kata_monitor_running_shim_count)
record: cluster:kata_monitor_running_shim_count:sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @littlejawa, can you please confirm if this yaml was generated by running make bundle after adding the config/kata-monitor/kata-monitor-prometheus-rules.yaml ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which command generated it, but I found it in my "bundle" folder after building / testing the operator.

Is it wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not wrong, but I expected the bundle manifest and the config manifest to be exactly the same. So I was curious :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what it is trying to tell me is that I should fix the indentation in the config manifest :-)
Also: it removes the namespace... As it's installed by the operator, maybe I don't need to specify it in the config manifest either?

Note that I did the same (have the namespace specified) in the ServiceMonitor config manifest, and it was removed from the bundle too.

Copy link

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 I'm from @openshift/openshift-team-monitoring
My remark is a bit off-topic but kata_monitor_running_shim_count doesn't follow Prometheus good practices: it looks like it's a gauge metric while the _count suffix is reserved for histograms and summaries (see https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations). It's not a big issue but it would be nice if it could be fixed upstream :)

@littlejawa
Copy link
Contributor Author

wave I'm from @openshift/openshift-team-monitoring My remark is a bit off-topic but kata_monitor_running_shim_count doesn't follow Prometheus good practices: it looks like it's a gauge metric while the _count suffix is reserved for histograms and summaries (see https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations). It's not a big issue but it would be nice if it could be fixed upstream :)

Thanks @simonpasquier - you're right, it's probably a gauge.
This metric comes from a (long) list exported by one of our components. I suspect other items in there would benefit from a rename. I doubt this can be done quickly though - I expect pushbacks if other users of the upstream component are already relying on them.
I'll keep that in mind though.

As this metric is bound to be exported via telemetry, do you think we should rename the record rule (cluster:kata_monitor_running_shim_count:sum)? I feel it's supposed to keep the name of the underlying metrics for consistency, but at least it would appear correct when used in telemetry.
What do you think?

@bpradipt
Copy link
Contributor

wave I'm from @openshift/openshift-team-monitoring My remark is a bit off-topic but kata_monitor_running_shim_count doesn't follow Prometheus good practices: it looks like it's a gauge metric while the _count suffix is reserved for histograms and summaries (see https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations). It's not a big issue but it would be nice if it could be fixed upstream :)

Thanks @simonpasquier - you're right, it's probably a gauge. This metric comes from a (long) list exported by one of our components. I suspect other items in there would benefit from a rename. I doubt this can be done quickly though - I expect pushbacks if other users of the upstream component are already relying on them. I'll keep that in mind though.

As this metric is bound to be exported via telemetry, do you think we should rename the record rule (cluster:kata_monitor_running_shim_count:sum)? I feel it's supposed to keep the name of the underlying metrics for consistency, but at least it would appear correct when used in telemetry. What do you think?

@simonpasquier any comments on the above question from @littlejawa ?

@bpradipt
Copy link
Contributor

/ok-to-test

@openshift-ci
Copy link

openshift-ci bot commented May 30, 2022

@littlejawa: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@simonpasquier
Copy link

sorry for the lag :)

cluster:kata_monitor_running_shim_count:sum is fine by me.

@bpradipt bpradipt merged commit eda5b31 into openshift:master Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants