Bug 1996785: [MON-1536]Remove unused rules. #1316

raptorsun · 2021-08-09T16:26:43Z

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

The recording rules that are not unused in alerts, telemetry metrics, console, or dashboard definitions are removed.

The following rules are removed:

build_error_rate
cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
code:registry_api_request_count:rate:sum
instance:node_cpu:ratio
kube_pod_status_ready:etcd:sum
kube_pod_status_ready:image_registry:sum
namespace:container_spec_cpu_shares:sum
node:node_num_cpu:sum
pod:container_spec_cpu_shares:sum

paulfantom · 2021-08-11T10:42:58Z

jsonnet/patch-rules.libsonnet was moved to https://github.com/openshift/cluster-monitoring-operator/blob/master/jsonnet/utils/sanitize-rules.libsonnet

arajkumar · 2021-08-16T07:03:03Z

@raptorsun I will close https://bugzilla.redhat.com/show_bug.cgi?id=1986033 as a duplicate of this.

raptorsun · 2021-08-16T07:40:19Z

@raptorsun I will close https://bugzilla.redhat.com/show_bug.cgi?id=1986033 as a duplicate of this.

Yes, please. This PR is removing the duplicated rules mentioned in the Bugzilla record. Thank you :)

simonpasquier · 2021-08-17T13:24:55Z

jsonnet/utils/sanitize-rules.libsonnet

+      { record: 'cluster:capacity_cpu_cores_hyperthread_enabled:sum' },
+      { record: 'cluster:capacity_cpu_sockets_hyperthread_enabled:sum' },
+      { record: 'cluster:container_cpu_usage:ratio' },
+      { record: 'cluster:container_spec_cpu_shares:ratio' },
+      { record: 'cluster:hyperthread_enabled_nodes' },
+      { record: 'cluster:infra_nodes' },
+      { record: 'cluster:master_infra_nodes' },
+      { record: 'cluster:memory_usage:ratio' },
+      { record: 'cluster:node_cpu:ratio' },
+      { record: 'cluster:node_cpu:sum_rate5m' },
+      { record: 'cluster:usage:containers:sum' },
+      { record: 'cluster:usage:ingress_frontend_bytes_in:rate5m:sum' },
+      { record: 'cluster:usage:ingress_frontend_bytes_out:rate5m:sum' },
+      { record: 'cluster:usage:ingress_frontend_connections:sum' },
+      { record: 'cluster:usage:kube_node_ready:avg5m' },
+      { record: 'cluster:usage:kube_schedulable_node_ready_reachable:avg5m' },
+      { record: 'cluster:usage:openshift:ingress_request_error:fraction5m' },
+      { record: 'cluster:usage:openshift:ingress_request_total:irate5m' },
+      { record: 'cluster:usage:openshift:kube_running_pod_ready:avg' },
+      { record: 'cluster:usage:pods:terminal:workload:sum' },
+      { record: 'cluster:usage:resources:sum' },
+      { record: 'cluster:usage:workload:capacity_physical_cpu_core_seconds' },
+      { record: 'cluster:usage:workload:capacity_physical_cpu_cores:max:5m' },
+      { record: 'cluster:usage:workload:capacity_physical_cpu_cores:min:5m' },
+      { record: 'cluster:usage:workload:ingress_request_error:fraction5m' },
+      { record: 'cluster:usage:workload:ingress_request_total:irate5m' },
+      { record: 'cluster:usage:workload:kube_running_pod_ready:avg' },


at least all of these rules are sent to Telemetry so they shouldn't be removed...

Thank you @simonpasquier :)
These rules are not present in the config file "metrics.yaml" for telemetry client.
Is there other config files we should pay attention to?

raptorsun · 2021-08-18T11:48:44Z

/retest

raptorsun · 2021-08-18T12:16:25Z

Need to merge this PR to fix the test ci/prow/versions

simonpasquier

We have recording rules like kube_running_pod_ready that we shouldn't remove because they are used by other recording rules. I think that we've reached the limits of hack/check-rec-rule-usage.sh, it would probably be easier and more accurate to parse rule files in Go to identify these cases.

assets/cluster-monitoring-operator/prometheus-rule.yaml

simonpasquier · 2021-08-19T13:58:53Z

assets/control-plane/prometheus-rule.yaml

-          max by (node, namespace, pod) (
-            label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)")
-        ))
-      record: 'node_namespace_pod:kube_pod_info:'


this recording metric is used by prometheus-adapter

cluster-monitoring-operator/assets/prometheus-adapter/config-map.yaml

Line 19 in 6e48653

node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}

If it hasn't been caught by the e2e tests, it would be worth extending them :)

By "extending them", do you mean remove all rules derived from "node_namespace_pod:kube_pod_info:" ? I have run e2e test in CMO and it does not raise error by removing "record:node_namespace_pod:kube_pod_info:".

I just find out this rule is also reference in record: 'cluster:cpu_usage_cores:sum' This record rule is also reference in telemetry client config.

I am going to put it back for safety 😅

I meant that our e2e tests should be failing if by accident we remove a rule that is required by prometheus-adapter.

shall we create some tests in CMO checking the completeness of rules required for prometheus-adapter? or Prometheus-adapter shall provide a tool checking this?

We already have TestPodMetricsPresence and TestNodeMetricsPresence in the operator e2e tests and I would have expected these tests to fail if recorded metrics that are used by prometheus adapter are removed.

jsonnet/utils/sanitize-rules.libsonnet

assets/control-plane/prometheus-rule.yaml

jsonnet/utils/sanitize-rules.libsonnet

simonpasquier · 2021-08-20T13:33:58Z

jsonnet/utils/sanitize-rules.libsonnet

+  {
+    name: 'openshift-ingress.rules',
+    rules: [
+      { record: 'code:cluster:ingress_http_request_count:rate5m:sum' },


same here

cluster-monitoring-operator/jsonnet/rules.libsonnet

Lines 468 to 471 in 6e48653

{

expr: 'sum by (code) (rate(haproxy_server_http_responses_total[5m]) > 0)',

record: 'code:cluster:ingress_http_request_count:rate5m:sum',

},

jsonnet/utils/sanitize-rules.libsonnet

raptorsun · 2021-08-30T09:26:55Z

/retest

simonpasquier · 2021-08-30T12:50:24Z

namespace:container_memory_usage_bytes:sum is used in the operator e2e tests but TBH I'm not sure about the purpose of the test...

cluster-monitoring-operator/test/e2e/main_test.go

Lines 139 to 147 in 5675d32

    
           // Once we have the need to test multiple recording rules, we can unite them in 
        
           // a single test function. 
        
           func TestMemoryUsageRecordingRule(t *testing.T) { 
        
           	f.ThanosQuerierClient.WaitForQueryReturnGreaterEqualOne( 
        
           		t, 
        
           		time.Minute, 
        
           		"count(namespace:container_memory_usage_bytes:sum)", 
        
           	) 
        
           }

raptorsun · 2021-08-30T14:29:30Z

"namespace:container_memory_usage_bytes:sum" has been added back. Thanks, @simonpasquier :)

simonpasquier · 2021-08-31T09:18:33Z

jsonnet/utils/sanitize-rules.libsonnet

+  {
+    name: 'kube-prometheus-node-recording.rules',
+    rules: [
+      { record: 'instance:node_cpu:ratio' },
+    ],
+  },
+  {
+    name: 'node.rules',
+    rules: [
+      { record: 'node:node_num_cpu:sum' },
+    ],
+  },
+  {
+    name: 'openshift-kubernetes.rules',
+    rules: [
+      { record: 'namespace:container_spec_cpu_shares:sum' },
+      { record: 'pod:container_memory_usage_bytes:sum' },
+      { record: 'pod:container_spec_cpu_shares:sum' },
+    ],
+  },


can you add a comment with a link to the BZ so we don't have to look into the Git history to understand why these rules are removed?

RiRa12621 · 2021-08-31T13:36:41Z

jsonnet/rules.libsonnet

@@ -465,10 +465,6 @@ function(params) {
    {
      name: 'openshift-ingress.rules',
      rules: [
-        {
-          expr: 'sum by (code) (rate(haproxy_server_http_responses_total[5m]) > 0)',


Is there a replacement for this impemented in any of the networking edge components?

RiRa12621 · 2021-08-31T13:37:23Z

Generally looks good, just wondering if there's a replacement for the networking edge related rules

raptorsun · 2021-09-01T08:56:30Z

Thank you @RiRa12621 😀
I'm going to check the ingress rule with network team. It is the only thing to decide before merging.

raptorsun · 2021-09-01T15:39:50Z

Not able to find out its replacement, I have put the ingress rule back. @RiRa12621
We are ready to merge this PR now.

RiRa12621 · 2021-09-01T15:57:04Z

lgtm from my end

raptorsun · 2021-09-01T23:29:15Z

/retest

openshift-ci · 2021-09-02T07:51:06Z

@raptorsun: This pull request references Bugzilla bug 1996785, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

Bug 1996785: [MON-1536]Remove unused rules.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-09-02T08:04:49Z

@raptorsun: This pull request references Bugzilla bug 1996785, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

Bug 1996785: [MON-1536]Remove unused rules.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

juzhao · 2021-09-02T08:41:53Z

/label qe-approved

raptorsun · 2021-09-02T08:51:10Z

/retest

simonpasquier

/lgtm

openshift-ci · 2021-09-02T09:47:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: raptorsun, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [raptorsun,simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-09-02T11:30:33Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

raptorsun · 2021-09-02T14:51:36Z

/retest-required

raptorsun · 2021-09-02T19:08:50Z

/retest-required

raptorsun · 2021-09-03T06:16:22Z

/retest-required

raptorsun · 2021-09-03T10:06:46Z

/retest-required

openshift-ci · 2021-09-03T12:42:24Z

@raptorsun: All pull requests linked via external trackers have merged:

openshift/cluster-monitoring-operator#1316

Bugzilla bug 1996785 has been moved to the MODIFIED state.

In response to this:

Bug 1996785: [MON-1536]Remove unused rules.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 9, 2021

openshift-ci bot requested review from bison and sthaha August 9, 2021 16:26

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2021

raptorsun force-pushed the feature/MON-1536 branch from b97a109 to 1f6a0ba Compare August 10, 2021 10:15

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2021

raptorsun force-pushed the feature/MON-1536 branch from 1f6a0ba to 807af40 Compare August 11, 2021 08:54

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2021

raptorsun force-pushed the feature/MON-1536 branch from 807af40 to 7ad39e0 Compare August 11, 2021 13:17

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2021

raptorsun force-pushed the feature/MON-1536 branch from 7ad39e0 to 5d91096 Compare August 12, 2021 15:16

simonpasquier reviewed Aug 17, 2021

View reviewed changes

raptorsun force-pushed the feature/MON-1536 branch from 5d91096 to 8212607 Compare August 17, 2021 16:08

raptorsun changed the title ~~[WIP] [MON-1536] Remove unused rules.~~ [MON-1536] Remove unused rules. Aug 18, 2021

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2021

simonpasquier reviewed Aug 19, 2021

View reviewed changes

update the script checking unused rules

141f775

raptorsun force-pushed the feature/MON-1536 branch from 8212607 to 113f80c Compare August 20, 2021 13:27

simonpasquier reviewed Aug 20, 2021

View reviewed changes

jsonnet/utils/sanitize-rules.libsonnet Outdated Show resolved Hide resolved