New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MON-3302: add RHACS telemetry metrics #2062
MON-3302: add RHACS telemetry metrics #2062
Conversation
@stehessel: This pull request references MON-3302 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@stehessel: This pull request references MON-3302 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
92c2f30
to
ae7cce8
Compare
/retest |
1 similar comment
/retest |
79704c6
to
421420c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can rox_sensor_secured_nodes ever be different from the number of nodes in a cluster? If not, the nodes capacity is already reported in Telemetry via cluster:node_instance_type_count:sum
. And the same goes for cpu capacity with cluster:capacity_cpu_cores:sum
.
Instead of 2 metrics per sensor, you could have a single rhacs:telemetry:rox_sensor_info
metric with the following labels
central_id
hosting
install_method
build
sensor_version
sensor_id
Reporting the number of nodes/vcpus per sensor would require a few PromQL joins but nothing too fancy IMHO.
The same remark goes for the central metrics, IIUC:
rox_central_secured_clusters
is equal to the sum of clusters running a sensor.rox_central_secured_nodes
is equal to the sum of the nodes reported by all the sensors.rox_central_secured_vcpu
is equal to the sum of the vcpus reported by all the sensors.
If so, reporting 1 info metric from the sensor with (sensor_id, central_id) + 1 info metric from the central with (central_id) is enough to provide the same information.
Hi @simonpasquier thanks for the review.
Good question, the values should be the same for sensors running on OpenShift (as long as
For Central it's not the same, because not all secured clusters are Sensors running on OpenShift. There could also be Sensors running on plain k8s, in which case we don't get metrics from telemeter for those clusters. So for Central we definitely need our own
We can do that. In general I'm a bit unsure how to proceed, maybe you can advise in terms of timeline. Originally I was hoping to get the metrics like the recording rules into our ACS 4.2 release. Sadly we are now past the code freeze, which was last week. Any changes like above suggestions would have to go into the next release 4.3 in 9 weeks. I'd like to avoid more waiting if possible - i.e. worst case would be "wait for ACS 4.3 release; then start the backport of telemeter config to OCP 4.x versions" because that will take many more months. So with that in mind what do you think:
|
So I talked with our release team and we can likely still squeeze in the change to the recording rules. If you have time, feel free to take a look at stackrox/stackrox#7673. |
For OCP, it's usually a matter of 1-2 weeks per minor release.
From experience, a temporary solution lasts forever. I'd very much prefer to have a clean implementation from the start. |
Thanks, it clarifies a lot. |
|
@simonpasquier I removed the sensor gauge metrics and the branding label. Can you take another look? Do these test failures need to be debugged? I'm not sure how they are related to my changes. |
Will do. The e2e test failures are flakes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend to add a rhacs:telemetry:rox_central_info
metric similar to rhacs:telemetry:rox_sensor_info
and remove the labels other than central_id
from the rhacs:telemetry:rox_central_secured_*
metrics.
Hi @simonpasquier , I followed your suggestions. Can you please take another look? |
/cc @jan--f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updated version! I only found one nit in a comment otherwise I'll defer to @jan--f for a second check.
Having said that, I'm still not sure to understand/agree with the way monitoring is enabled and the fact that service monitors and prometheus rules are dropped in the openshift-monitoring namespaces. In particular for operator-based installations, you would benefit from using the operatorframework.io/cluster-monitoring=true
annotation as described in https://github.com/openshift/enhancements/blob/master/enhancements/olm/olm-managed-operator-metrics.md#openshift-monitoring-prometheus-operator-support.
The reason behind this implementation is that ACS supports multiple installation methods (manifest, Helm, OLM operator), and we wanted the monitoring to be consistent for all of them. |
@stehessel: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
@stehessel: This pull request references MON-3302 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jan--f, stehessel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@stehessel: This pull request references MON-3302 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@jan--f I don't think this repo has the @simonpasquier Is there anything else holding up the merge now that @jan--f has approved? |
/skip |
@stehessel this is just prow doing its thing. The PR should merge once the bot is satisfied. |
@jan--f That is great to hear, thanks! |
/cherrypick release-4.14 |
@simonpasquier: new pull request created: #2137 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302
As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.