Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-3302: add RHACS telemetry metrics #2062

Merged

Conversation

stehessel
Copy link
Contributor

@stehessel stehessel commented Aug 6, 2023

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 6, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 6, 2023

@stehessel: This pull request references MON-3302 which is a valid jira issue.

In response to this:

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from marioferh and sthaha August 6, 2023 23:39
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 6, 2023

@stehessel: This pull request references MON-3302 which is a valid jira issue.

In response to this:

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stehessel stehessel force-pushed the MON-3302/rhacs-telemetry-metrics branch from 92c2f30 to ae7cce8 Compare August 7, 2023 13:32
@stehessel
Copy link
Contributor Author

/retest

1 similar comment
@stehessel
Copy link
Contributor Author

/retest

@stehessel stehessel force-pushed the MON-3302/rhacs-telemetry-metrics branch from 79704c6 to 421420c Compare August 14, 2023 19:53
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can rox_sensor_secured_nodes ever be different from the number of nodes in a cluster? If not, the nodes capacity is already reported in Telemetry via cluster:node_instance_type_count:sum . And the same goes for cpu capacity with cluster:capacity_cpu_cores:sum.

Instead of 2 metrics per sensor, you could have a single rhacs:telemetry:rox_sensor_info metric with the following labels

  • central_id
  • hosting
  • install_method
  • build
  • sensor_version
  • sensor_id

Reporting the number of nodes/vcpus per sensor would require a few PromQL joins but nothing too fancy IMHO.

The same remark goes for the central metrics, IIUC:

  • rox_central_secured_clusters is equal to the sum of clusters running a sensor.
  • rox_central_secured_nodes is equal to the sum of the nodes reported by all the sensors.
  • rox_central_secured_vcpu is equal to the sum of the vcpus reported by all the sensors.

If so, reporting 1 info metric from the sensor with (sensor_id, central_id) + 1 info metric from the central with (central_id) is enough to provide the same information.

Documentation/data-collection.md Outdated Show resolved Hide resolved
@stehessel
Copy link
Contributor Author

stehessel commented Aug 31, 2023

Hi @simonpasquier thanks for the review.

Can rox_sensor_secured_nodes ever be different from the number of nodes in a cluster? If not, the nodes capacity is already reported in Telemetry via cluster:node_instance_type_count:sum . And the same goes for cpu capacity with cluster:capacity_cpu_cores:sum.

Good question, the values should be the same for sensors running on OpenShift (as long as cluster:capacity_cpu_cores:sum is in K8s cpu units and not physical core in case of hyperthreading). One scenario I can imagine where having our own node/cpu count would be useful, is for debugging purposes. For example when a customer reports that the core count shown in ACS UI is different from their OpenShift cluster capacity, we could compare the telemetry metrics for confirmation. But that's an edge case admittedly, if joining the promQL queries work ok for the dashboards + tableu pipeline, we could do that.

The same remark goes for the central metrics, IIUC:

For Central it's not the same, because not all secured clusters are Sensors running on OpenShift. There could also be Sensors running on plain k8s, in which case we don't get metrics from telemeter for those clusters. So for Central we definitely need our own {cluster, nodes, vcpu} metrics.

(suggestion) I'd rather see the filtering on branding="RHACS" being performed in the recording rule. Which would be consistent with the name of metric too (e.g. rhacs:...).

We can do that.

In general I'm a bit unsure how to proceed, maybe you can advise in terms of timeline. Originally I was hoping to get the metrics like the recording rules into our ACS 4.2 release. Sadly we are now past the code freeze, which was last week. Any changes like above suggestions would have to go into the next release 4.3 in 9 weeks. I'd like to avoid more waiting if possible - i.e. worst case would be "wait for ACS 4.3 release; then start the backport of telemeter config to OCP 4.x versions" because that will take many more months. So with that in mind what do you think:

  • How long does the backporting process usually take?
  • How critical are above suggestions (replacing sensor node + vcpu metric with info metric; remove branding from recording rule) for you? Do they justify waiting another release cycle vs just going with what we have now?

@stehessel
Copy link
Contributor Author

So I talked with our release team and we can likely still squeeze in the change to the recording rules. If you have time, feel free to take a look at stackrox/stackrox#7673.

@simonpasquier
Copy link
Contributor

How long does the backporting process usually take?

For OCP, it's usually a matter of 1-2 weeks per minor release.

How critical are above suggestions (replacing sensor node + vcpu metric with info metric; remove branding from recording rule) for you? Do they justify waiting another release cycle vs just going with what we have now?

From experience, a temporary solution lasts forever. I'd very much prefer to have a clean implementation from the start.

@simonpasquier
Copy link
Contributor

For Central it's not the same, ...

Thanks, it clarifies a lot.

@simonpasquier
Copy link
Contributor

as long as cluster:capacity_cpu_cores:sum is in K8s cpu units and not physical core in case of hyperthreading

cluster:capacity_cpu_cores:sum is computed from the kube_node_status_capacity metric which comes from the Kubernetes node status subresource. IIUC this is exactly the same source of information than RHACS:

https://github.com/stackrox/stackrox/blob/54e2a4a2204f55171ce334357a181dc52e85c961/sensor/kubernetes/clustermetrics/cluster_metrics.go#L134-L140

@stehessel
Copy link
Contributor Author

@simonpasquier I removed the sensor gauge metrics and the branding label. Can you take another look? Do these test failures need to be debugged? I'm not sure how they are related to my changes.

@simonpasquier
Copy link
Contributor

Will do. The e2e test failures are flakes.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to add a rhacs:telemetry:rox_central_info metric similar to rhacs:telemetry:rox_sensor_info and remove the labels other than central_id from the rhacs:telemetry:rox_central_secured_* metrics.

Documentation/data-collection.md Show resolved Hide resolved
Documentation/data-collection.md Show resolved Hide resolved
Documentation/data-collection.md Outdated Show resolved Hide resolved
Documentation/data-collection.md Show resolved Hide resolved
@stehessel
Copy link
Contributor Author

Hi @simonpasquier , I followed your suggestions. Can you please take another look?

@simonpasquier
Copy link
Contributor

/cc @jan--f

@openshift-ci openshift-ci bot requested a review from jan--f September 19, 2023 15:08
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated version! I only found one nit in a comment otherwise I'll defer to @jan--f for a second check.

Having said that, I'm still not sure to understand/agree with the way monitoring is enabled and the fact that service monitors and prometheus rules are dropped in the openshift-monitoring namespaces. In particular for operator-based installations, you would benefit from using the operatorframework.io/cluster-monitoring=true annotation as described in https://github.com/openshift/enhancements/blob/master/enhancements/olm/olm-managed-operator-metrics.md#openshift-monitoring-prometheus-operator-support.

@stehessel
Copy link
Contributor Author

Having said that, I'm still not sure to understand/agree with the way monitoring is enabled and the fact that service monitors and prometheus rules are dropped in the openshift-monitoring namespaces. In particular for operator-based installations, you would benefit from using the operatorframework.io/cluster-monitoring=true annotation as described in https://github.com/openshift/enhancements/blob/master/enhancements/olm/olm-managed-operator-metrics.md#openshift-monitoring-prometheus-operator-support.

The reason behind this implementation is that ACS supports multiple installation methods (manifest, Helm, OLM operator), and we wanted the monitoring to be consistent for all of them.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2023

@stehessel: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 6e6aff9 link false /test e2e-aws-ovn-single-node
ci/prow/versions 6e6aff9 link false /test versions

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jan--f
Copy link
Contributor

jan--f commented Oct 5, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 5, 2023

@stehessel: This pull request references MON-3302 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 5, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, stehessel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 5, 2023

@stehessel: This pull request references MON-3302 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 9d56a3a and 2 for PR HEAD 6e6aff9 in total

@stehessel
Copy link
Contributor Author

@jan--f I don't think this repo has the /lgtm automation, so you need to click Approve manually.

@simonpasquier Is there anything else holding up the merge now that @jan--f has approved?

@jan--f
Copy link
Contributor

jan--f commented Oct 5, 2023

/skip

@jan--f
Copy link
Contributor

jan--f commented Oct 5, 2023

@stehessel this is just prow doing its thing. The PR should merge once the bot is satisfied.

@stehessel
Copy link
Contributor Author

@jan--f That is great to hear, thanks!

@openshift-ci openshift-ci bot merged commit 2d506c7 into openshift:master Oct 5, 2023
13 of 16 checks passed
@simonpasquier
Copy link
Contributor

/cherrypick release-4.14

@openshift-cherrypick-robot

@simonpasquier: new pull request created: #2137

In response to this:

/cherrypick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants