Adding a PrometheusRule for kata monitor #189

littlejawa · 2022-05-05T12:37:47Z

- Description of the problem which is fixed/What is the use case
We want to provide some information back through telemetry, but there is limitation on how much data can be sent.
As kata-monitor report information per node, any metric we want to expose would be multiplied by the number of nodes in the cluster, making us go above the limit in some situations.

We need to sum information up, and expose total numbers at the cluster level, and not individual per-node information.

- What I did
Adding a PrometheusRule to have prometheus take the number of running shim per node, make the total of it, and expose it as a new metric.
This rule can be extended in the future to add additional metrics as we need.

Before sending metrics to telemetry, we need to sum it up, because sending individual metric per node is not scaling. This rule makes Prometheus compute the total of running VMs for the cluster and expose it as a new metric. Signed-off-by: Julien Ropé <jrope@redhat.com>

openshift-ci · 2022-05-05T12:38:25Z

Hi @littlejawa. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

littlejawa · 2022-05-05T12:39:03Z

/cc @snir911 FYI

openshift-ci · 2022-05-05T12:39:06Z

@littlejawa: GitHub didn't allow me to request PR reviews from the following users: FYI.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @snir911 FYI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Julien Ropé <jrope@redhat.com>

bpradipt · 2022-05-09T09:20:53Z

/ok-to-test

bpradipt · 2022-05-09T09:23:22Z

...manifests/prometheus-sandboxed-containers-rules_monitoring.coreos.com_v1_prometheusrule.yaml

+  - name: kata_monitor_rules
+    rules:
+    - expr: sum(kata_monitor_running_shim_count)
+      record: cluster:kata_monitor_running_shim_count:sum


Hi @littlejawa, can you please confirm if this yaml was generated by running make bundle after adding the config/kata-monitor/kata-monitor-prometheus-rules.yaml ?

I'm not sure which command generated it, but I found it in my "bundle" folder after building / testing the operator.

Is it wrong?

No it's not wrong, but I expected the bundle manifest and the config manifest to be exactly the same. So I was curious :-)

Maybe what it is trying to tell me is that I should fix the indentation in the config manifest :-)
Also: it removes the namespace... As it's installed by the operator, maybe I don't need to specify it in the config manifest either?

Note that I did the same (have the namespace specified) in the ServiceMonitor config manifest, and it was removed from the bundle too.

simonpasquier

👋 I'm from @openshift/openshift-team-monitoring
My remark is a bit off-topic but kata_monitor_running_shim_count doesn't follow Prometheus good practices: it looks like it's a gauge metric while the _count suffix is reserved for histograms and summaries (see https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations). It's not a big issue but it would be nice if it could be fixed upstream :)

littlejawa · 2022-05-09T13:29:06Z

wave I'm from @openshift/openshift-team-monitoring My remark is a bit off-topic but kata_monitor_running_shim_count doesn't follow Prometheus good practices: it looks like it's a gauge metric while the _count suffix is reserved for histograms and summaries (see https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations). It's not a big issue but it would be nice if it could be fixed upstream :)

Thanks @simonpasquier - you're right, it's probably a gauge.
This metric comes from a (long) list exported by one of our components. I suspect other items in there would benefit from a rename. I doubt this can be done quickly though - I expect pushbacks if other users of the upstream component are already relying on them.
I'll keep that in mind though.

As this metric is bound to be exported via telemetry, do you think we should rename the record rule (cluster:kata_monitor_running_shim_count:sum)? I feel it's supposed to keep the name of the underlying metrics for consistency, but at least it would appear correct when used in telemetry.
What do you think?

bpradipt · 2022-05-25T05:32:12Z

wave I'm from @openshift/openshift-team-monitoring My remark is a bit off-topic but kata_monitor_running_shim_count doesn't follow Prometheus good practices: it looks like it's a gauge metric while the _count suffix is reserved for histograms and summaries (see https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations). It's not a big issue but it would be nice if it could be fixed upstream :)

Thanks @simonpasquier - you're right, it's probably a gauge. This metric comes from a (long) list exported by one of our components. I suspect other items in there would benefit from a rename. I doubt this can be done quickly though - I expect pushbacks if other users of the upstream component are already relying on them. I'll keep that in mind though.

As this metric is bound to be exported via telemetry, do you think we should rename the record rule (cluster:kata_monitor_running_shim_count:sum)? I feel it's supposed to keep the name of the underlying metrics for consistency, but at least it would appear correct when used in telemetry. What do you think?

@simonpasquier any comments on the above question from @littlejawa ?

bpradipt · 2022-05-30T07:03:14Z

/ok-to-test

openshift-ci · 2022-05-30T08:45:53Z

@littlejawa: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

simonpasquier · 2022-05-30T10:09:25Z

sorry for the lag :)

cluster:kata_monitor_running_shim_count:sum is fine by me.

openshift-ci bot requested review from bpradipt and jensfr May 5, 2022 12:38

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 5, 2022

openshift-ci bot requested a review from snir911 May 5, 2022 12:39

Adding manifest for PrometheusRule

fb0b2f1

Signed-off-by: Julien Ropé <jrope@redhat.com>

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 9, 2022

bpradipt reviewed May 9, 2022

View reviewed changes

simonpasquier reviewed May 9, 2022

View reviewed changes

bpradipt requested review from simonpasquier and bpradipt May 30, 2022 07:03

snir911 approved these changes May 30, 2022

View reviewed changes

simonpasquier mentioned this pull request May 31, 2022

Adding a usage metric for Openshift Sandboxed Containers openshift/cluster-monitoring-operator#1662

Merged

bpradipt approved these changes Jun 1, 2022

View reviewed changes

bpradipt merged commit eda5b31 into openshift:master Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a PrometheusRule for kata monitor #189

Adding a PrometheusRule for kata monitor #189

littlejawa commented May 5, 2022

openshift-ci bot commented May 5, 2022

littlejawa commented May 5, 2022

openshift-ci bot commented May 5, 2022

bpradipt commented May 9, 2022

bpradipt May 9, 2022

littlejawa May 9, 2022

bpradipt May 9, 2022

littlejawa May 9, 2022

simonpasquier left a comment

littlejawa commented May 9, 2022

bpradipt commented May 25, 2022

bpradipt commented May 30, 2022

openshift-ci bot commented May 30, 2022

simonpasquier commented May 30, 2022

Adding a PrometheusRule for kata monitor #189

Adding a PrometheusRule for kata monitor #189

Conversation

littlejawa commented May 5, 2022

openshift-ci bot commented May 5, 2022

littlejawa commented May 5, 2022

openshift-ci bot commented May 5, 2022

bpradipt commented May 9, 2022

bpradipt May 9, 2022

Choose a reason for hiding this comment

littlejawa May 9, 2022

Choose a reason for hiding this comment

bpradipt May 9, 2022

Choose a reason for hiding this comment

littlejawa May 9, 2022

Choose a reason for hiding this comment

simonpasquier left a comment

Choose a reason for hiding this comment

littlejawa commented May 9, 2022

bpradipt commented May 25, 2022

bpradipt commented May 30, 2022

openshift-ci bot commented May 30, 2022

simonpasquier commented May 30, 2022