New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a PrometheusRule for kata monitor #189
Conversation
Before sending metrics to telemetry, we need to sum it up, because sending individual metric per node is not scaling. This rule makes Prometheus compute the total of running VMs for the cluster and expose it as a new metric. Signed-off-by: Julien Ropé <jrope@redhat.com>
Hi @littlejawa. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @snir911 FYI |
@littlejawa: GitHub didn't allow me to request PR reviews from the following users: FYI. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Julien Ropé <jrope@redhat.com>
/ok-to-test |
- name: kata_monitor_rules | ||
rules: | ||
- expr: sum(kata_monitor_running_shim_count) | ||
record: cluster:kata_monitor_running_shim_count:sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @littlejawa, can you please confirm if this yaml was generated by running make bundle
after adding the config/kata-monitor/kata-monitor-prometheus-rules.yaml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which command generated it, but I found it in my "bundle" folder after building / testing the operator.
Is it wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not wrong, but I expected the bundle manifest and the config manifest to be exactly the same. So I was curious :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what it is trying to tell me is that I should fix the indentation in the config manifest :-)
Also: it removes the namespace... As it's installed by the operator, maybe I don't need to specify it in the config manifest either?
Note that I did the same (have the namespace specified) in the ServiceMonitor config manifest, and it was removed from the bundle too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 I'm from @openshift/openshift-team-monitoring
My remark is a bit off-topic but kata_monitor_running_shim_count
doesn't follow Prometheus good practices: it looks like it's a gauge metric while the _count
suffix is reserved for histograms and summaries (see https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations). It's not a big issue but it would be nice if it could be fixed upstream :)
Thanks @simonpasquier - you're right, it's probably a gauge. As this metric is bound to be exported via telemetry, do you think we should rename the record rule ( |
@simonpasquier any comments on the above question from @littlejawa ? |
/ok-to-test |
@littlejawa: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
sorry for the lag :)
|
- Description of the problem which is fixed/What is the use case
We want to provide some information back through telemetry, but there is limitation on how much data can be sent.
As kata-monitor report information per node, any metric we want to expose would be multiplied by the number of nodes in the cluster, making us go above the limit in some situations.
We need to sum information up, and expose total numbers at the cluster level, and not individual per-node information.
- What I did
Adding a PrometheusRule to have prometheus take the number of running shim per node, make the total of it, and expose it as a new metric.
This rule can be extended in the future to add additional metrics as we need.