MON-3302: add RHACS telemetry metrics #2062

stehessel · 2023-08-06T23:39:13Z

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.

openshift-ci-robot · 2023-08-06T23:39:16Z

@stehessel: This pull request references MON-3302 which is a valid jira issue.

In response to this:

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-06T23:40:29Z

@stehessel: This pull request references MON-3302 which is a valid jira issue.

In response to this:

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stehessel · 2023-08-08T10:32:17Z

/retest

stehessel · 2023-08-08T17:32:45Z

/retest

simonpasquier

Can rox_sensor_secured_nodes ever be different from the number of nodes in a cluster? If not, the nodes capacity is already reported in Telemetry via cluster:node_instance_type_count:sum . And the same goes for cpu capacity with cluster:capacity_cpu_cores:sum.

Instead of 2 metrics per sensor, you could have a single rhacs:telemetry:rox_sensor_info metric with the following labels

central_id
hosting
install_method
build
sensor_version
sensor_id

Reporting the number of nodes/vcpus per sensor would require a few PromQL joins but nothing too fancy IMHO.

The same remark goes for the central metrics, IIUC:

rox_central_secured_clusters is equal to the sum of clusters running a sensor.
rox_central_secured_nodes is equal to the sum of the nodes reported by all the sensors.
rox_central_secured_vcpu is equal to the sum of the vcpus reported by all the sensors.

If so, reporting 1 info metric from the sensor with (sensor_id, central_id) + 1 info metric from the central with (central_id) is enough to provide the same information.

Documentation/data-collection.md

stehessel · 2023-08-31T15:21:30Z

Hi @simonpasquier thanks for the review.

Can rox_sensor_secured_nodes ever be different from the number of nodes in a cluster? If not, the nodes capacity is already reported in Telemetry via cluster:node_instance_type_count:sum . And the same goes for cpu capacity with cluster:capacity_cpu_cores:sum.

Good question, the values should be the same for sensors running on OpenShift (as long as cluster:capacity_cpu_cores:sum is in K8s cpu units and not physical core in case of hyperthreading). One scenario I can imagine where having our own node/cpu count would be useful, is for debugging purposes. For example when a customer reports that the core count shown in ACS UI is different from their OpenShift cluster capacity, we could compare the telemetry metrics for confirmation. But that's an edge case admittedly, if joining the promQL queries work ok for the dashboards + tableu pipeline, we could do that.

The same remark goes for the central metrics, IIUC:

For Central it's not the same, because not all secured clusters are Sensors running on OpenShift. There could also be Sensors running on plain k8s, in which case we don't get metrics from telemeter for those clusters. So for Central we definitely need our own {cluster, nodes, vcpu} metrics.

(suggestion) I'd rather see the filtering on branding="RHACS" being performed in the recording rule. Which would be consistent with the name of metric too (e.g. rhacs:...).

We can do that.

In general I'm a bit unsure how to proceed, maybe you can advise in terms of timeline. Originally I was hoping to get the metrics like the recording rules into our ACS 4.2 release. Sadly we are now past the code freeze, which was last week. Any changes like above suggestions would have to go into the next release 4.3 in 9 weeks. I'd like to avoid more waiting if possible - i.e. worst case would be "wait for ACS 4.3 release; then start the backport of telemeter config to OCP 4.x versions" because that will take many more months. So with that in mind what do you think:

How long does the backporting process usually take?
How critical are above suggestions (replacing sensor node + vcpu metric with info metric; remove branding from recording rule) for you? Do they justify waiting another release cycle vs just going with what we have now?

stehessel · 2023-09-01T15:48:25Z

So I talked with our release team and we can likely still squeeze in the change to the recording rules. If you have time, feel free to take a look at stackrox/stackrox#7673.

simonpasquier · 2023-09-04T10:16:43Z

How long does the backporting process usually take?

For OCP, it's usually a matter of 1-2 weeks per minor release.

How critical are above suggestions (replacing sensor node + vcpu metric with info metric; remove branding from recording rule) for you? Do they justify waiting another release cycle vs just going with what we have now?

From experience, a temporary solution lasts forever. I'd very much prefer to have a clean implementation from the start.

simonpasquier · 2023-09-04T10:17:53Z

For Central it's not the same, ...

Thanks, it clarifies a lot.

simonpasquier · 2023-09-04T10:24:36Z

as long as cluster:capacity_cpu_cores:sum is in K8s cpu units and not physical core in case of hyperthreading

cluster:capacity_cpu_cores:sum is computed from the kube_node_status_capacity metric which comes from the Kubernetes node status subresource. IIUC this is exactly the same source of information than RHACS:

https://github.com/stackrox/stackrox/blob/54e2a4a2204f55171ce334357a181dc52e85c961/sensor/kubernetes/clustermetrics/cluster_metrics.go#L134-L140

stehessel · 2023-09-04T21:04:34Z

@simonpasquier I removed the sensor gauge metrics and the branding label. Can you take another look? Do these test failures need to be debugged? I'm not sure how they are related to my changes.

simonpasquier · 2023-09-06T11:28:45Z

Will do. The e2e test failures are flakes.

simonpasquier

I'd recommend to add a rhacs:telemetry:rox_central_info metric similar to rhacs:telemetry:rox_sensor_info and remove the labels other than central_id from the rhacs:telemetry:rox_central_secured_* metrics.

Documentation/data-collection.md

stehessel · 2023-09-11T11:05:51Z

Hi @simonpasquier , I followed your suggestions. Can you please take another look?

simonpasquier · 2023-09-19T15:08:05Z

/cc @jan--f

simonpasquier

Thanks for the updated version! I only found one nit in a comment otherwise I'll defer to @jan--f for a second check.

Having said that, I'm still not sure to understand/agree with the way monitoring is enabled and the fact that service monitors and prometheus rules are dropped in the openshift-monitoring namespaces. In particular for operator-based installations, you would benefit from using the operatorframework.io/cluster-monitoring=true annotation as described in https://github.com/openshift/enhancements/blob/master/enhancements/olm/olm-managed-operator-metrics.md#openshift-monitoring-prometheus-operator-support.

manifests/0000_50_cluster-monitoring-operator_04-config.yaml

stehessel · 2023-09-19T16:14:42Z

Having said that, I'm still not sure to understand/agree with the way monitoring is enabled and the fact that service monitors and prometheus rules are dropped in the openshift-monitoring namespaces. In particular for operator-based installations, you would benefit from using the operatorframework.io/cluster-monitoring=true annotation as described in https://github.com/openshift/enhancements/blob/master/enhancements/olm/olm-managed-operator-metrics.md#openshift-monitoring-prometheus-operator-support.

The reason behind this implementation is that ACS supports multiple installation methods (manifest, Helm, OLM operator), and we wanted the monitoring to be consistent for all of them.

openshift-ci · 2023-09-19T21:02:27Z

@stehessel: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`6e6aff9`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/versions	`6e6aff9`	link	false	`/test versions`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jan--f · 2023-10-05T07:29:28Z

/lgtm

openshift-ci-robot · 2023-10-05T07:29:54Z

@stehessel: This pull request references MON-3302 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-10-05T07:29:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, stehessel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jan--f]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-10-05T07:30:03Z

@stehessel: This pull request references MON-3302 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

I didn't see any other Telemeter metrics in the changelog, let me know if you prefer an entry for it. Related Jira ticket: https://issues.redhat.com/browse/MON-3302

As mentioned in the ticket, if possible we would like to backport this to existing OpenShift versions so we don't have to wait until customers upgrade to OpenShift 4.14.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-10-05T07:56:15Z

/retest-required

Remaining retests: 0 against base HEAD 9d56a3a and 2 for PR HEAD 6e6aff9 in total

stehessel · 2023-10-05T10:51:35Z

@jan--f I don't think this repo has the /lgtm automation, so you need to click Approve manually.

@simonpasquier Is there anything else holding up the merge now that @jan--f has approved?

jan--f · 2023-10-05T11:28:26Z

/skip

jan--f · 2023-10-05T11:29:28Z

@stehessel this is just prow doing its thing. The PR should merge once the bot is satisfied.

stehessel · 2023-10-05T12:36:31Z

@jan--f That is great to hear, thanks!

simonpasquier · 2023-10-31T10:15:48Z

/cherrypick release-4.14

openshift-cherrypick-robot · 2023-10-31T10:17:00Z

@simonpasquier: new pull request created: #2137

In response to this:

/cherrypick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 6, 2023

openshift-ci bot requested review from marioferh and sthaha August 6, 2023 23:39

stehessel force-pushed the MON-3302/rhacs-telemetry-metrics branch from 92c2f30 to ae7cce8 Compare August 7, 2023 13:32

MON-3302: add RHACS telemetry metrics

421420c

stehessel force-pushed the MON-3302/rhacs-telemetry-metrics branch from 79704c6 to 421420c Compare August 14, 2023 19:53

simonpasquier reviewed Aug 31, 2023

View reviewed changes

Documentation/data-collection.md Outdated Show resolved Hide resolved

stehessel mentioned this pull request Sep 1, 2023

ROX-18276: adapt telemetry metrics to Telemeter feedback stackrox/stackrox#7673

Merged

address review comments

d3178c9

stehessel requested a review from simonpasquier September 1, 2023 16:50

simonpasquier reviewed Sep 6, 2023

View reviewed changes

Documentation/data-collection.md Show resolved Hide resolved

Documentation/data-collection.md Show resolved Hide resolved

Documentation/data-collection.md Outdated Show resolved Hide resolved

Documentation/data-collection.md Show resolved Hide resolved

stehessel mentioned this pull request Sep 9, 2023

ROX-18276: Central info metric stackrox/stackrox#7734

Merged

central info metric

baa40ba

stehessel requested a review from simonpasquier September 9, 2023 01:45

openshift-ci bot requested a review from jan--f September 19, 2023 15:08

simonpasquier reviewed Sep 19, 2023

View reviewed changes

manifests/0000_50_cluster-monitoring-operator_04-config.yaml Outdated Show resolved Hide resolved

fix typo in comment

6e6aff9

openshift-ci bot assigned jan--f Oct 5, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2023

openshift-ci bot merged commit 2d506c7 into openshift:master Oct 5, 2023
13 of 16 checks passed

stehessel mentioned this pull request Oct 5, 2023

Add rhacs telemetry metrics rhobs/configuration#618

Merged

openshift-cherrypick-robot mentioned this pull request Oct 31, 2023

OCPBUGS-22734: [release-4.14] add RHACS telemetry metrics #2137

Merged

This was referenced Nov 2, 2023

OCPBUGS-22842: [release-4.13] add RHACS telemetry metrics #2139

Merged

OCPBUGS-22843: [release-4.12] add RHACS telemetry metrics #2140

Merged

OCPBUGS-22845: [release-4.11] add RHACS telemetry metrics #2141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MON-3302: add RHACS telemetry metrics #2062

MON-3302: add RHACS telemetry metrics #2062

stehessel commented Aug 6, 2023 •

edited

openshift-ci-robot commented Aug 6, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Aug 6, 2023 •

edited by openshift-ci bot

stehessel commented Aug 8, 2023

stehessel commented Aug 8, 2023

simonpasquier left a comment

stehessel commented Aug 31, 2023 •

edited

stehessel commented Sep 1, 2023

simonpasquier commented Sep 4, 2023

simonpasquier commented Sep 4, 2023

simonpasquier commented Sep 4, 2023

stehessel commented Sep 4, 2023

simonpasquier commented Sep 6, 2023

simonpasquier left a comment

stehessel commented Sep 11, 2023

simonpasquier commented Sep 19, 2023

simonpasquier left a comment

stehessel commented Sep 19, 2023

openshift-ci bot commented Sep 19, 2023 •

edited

jan--f commented Oct 5, 2023

openshift-ci-robot commented Oct 5, 2023 •

edited by openshift-ci bot

openshift-ci bot commented Oct 5, 2023

openshift-ci-robot commented Oct 5, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Oct 5, 2023

stehessel commented Oct 5, 2023

jan--f commented Oct 5, 2023

jan--f commented Oct 5, 2023

stehessel commented Oct 5, 2023

simonpasquier commented Oct 31, 2023

openshift-cherrypick-robot commented Oct 31, 2023

MON-3302: add RHACS telemetry metrics #2062

MON-3302: add RHACS telemetry metrics #2062

Conversation

stehessel commented Aug 6, 2023 • edited

openshift-ci-robot commented Aug 6, 2023 • edited by openshift-ci bot

openshift-ci-robot commented Aug 6, 2023 • edited by openshift-ci bot

stehessel commented Aug 8, 2023

stehessel commented Aug 8, 2023

simonpasquier left a comment

Choose a reason for hiding this comment

stehessel commented Aug 31, 2023 • edited

stehessel commented Sep 1, 2023

simonpasquier commented Sep 4, 2023

simonpasquier commented Sep 4, 2023

simonpasquier commented Sep 4, 2023

stehessel commented Sep 4, 2023

simonpasquier commented Sep 6, 2023

simonpasquier left a comment

Choose a reason for hiding this comment

stehessel commented Sep 11, 2023

simonpasquier commented Sep 19, 2023

simonpasquier left a comment

Choose a reason for hiding this comment

stehessel commented Sep 19, 2023

openshift-ci bot commented Sep 19, 2023 • edited

jan--f commented Oct 5, 2023

openshift-ci-robot commented Oct 5, 2023 • edited by openshift-ci bot

openshift-ci bot commented Oct 5, 2023

openshift-ci-robot commented Oct 5, 2023 • edited by openshift-ci bot

openshift-ci-robot commented Oct 5, 2023

stehessel commented Oct 5, 2023

jan--f commented Oct 5, 2023

jan--f commented Oct 5, 2023

stehessel commented Oct 5, 2023

simonpasquier commented Oct 31, 2023

openshift-cherrypick-robot commented Oct 31, 2023

stehessel commented Aug 6, 2023 •

edited

openshift-ci-robot commented Aug 6, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Aug 6, 2023 •

edited by openshift-ci bot

stehessel commented Aug 31, 2023 •

edited

openshift-ci bot commented Sep 19, 2023 •

edited

openshift-ci-robot commented Oct 5, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Oct 5, 2023 •

edited by openshift-ci bot