New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1996785: [MON-1536]Remove unused rules. #1316
Bug 1996785: [MON-1536]Remove unused rules. #1316
Conversation
b97a109
to
1f6a0ba
Compare
1f6a0ba
to
807af40
Compare
|
807af40
to
7ad39e0
Compare
7ad39e0
to
5d91096
Compare
@raptorsun I will close https://bugzilla.redhat.com/show_bug.cgi?id=1986033 as a duplicate of this. |
Yes, please. This PR is removing the duplicated rules mentioned in the Bugzilla record. Thank you :) |
{ record: 'cluster:capacity_cpu_cores_hyperthread_enabled:sum' }, | ||
{ record: 'cluster:capacity_cpu_sockets_hyperthread_enabled:sum' }, | ||
{ record: 'cluster:container_cpu_usage:ratio' }, | ||
{ record: 'cluster:container_spec_cpu_shares:ratio' }, | ||
{ record: 'cluster:hyperthread_enabled_nodes' }, | ||
{ record: 'cluster:infra_nodes' }, | ||
{ record: 'cluster:master_infra_nodes' }, | ||
{ record: 'cluster:memory_usage:ratio' }, | ||
{ record: 'cluster:node_cpu:ratio' }, | ||
{ record: 'cluster:node_cpu:sum_rate5m' }, | ||
{ record: 'cluster:usage:containers:sum' }, | ||
{ record: 'cluster:usage:ingress_frontend_bytes_in:rate5m:sum' }, | ||
{ record: 'cluster:usage:ingress_frontend_bytes_out:rate5m:sum' }, | ||
{ record: 'cluster:usage:ingress_frontend_connections:sum' }, | ||
{ record: 'cluster:usage:kube_node_ready:avg5m' }, | ||
{ record: 'cluster:usage:kube_schedulable_node_ready_reachable:avg5m' }, | ||
{ record: 'cluster:usage:openshift:ingress_request_error:fraction5m' }, | ||
{ record: 'cluster:usage:openshift:ingress_request_total:irate5m' }, | ||
{ record: 'cluster:usage:openshift:kube_running_pod_ready:avg' }, | ||
{ record: 'cluster:usage:pods:terminal:workload:sum' }, | ||
{ record: 'cluster:usage:resources:sum' }, | ||
{ record: 'cluster:usage:workload:capacity_physical_cpu_core_seconds' }, | ||
{ record: 'cluster:usage:workload:capacity_physical_cpu_cores:max:5m' }, | ||
{ record: 'cluster:usage:workload:capacity_physical_cpu_cores:min:5m' }, | ||
{ record: 'cluster:usage:workload:ingress_request_error:fraction5m' }, | ||
{ record: 'cluster:usage:workload:ingress_request_total:irate5m' }, | ||
{ record: 'cluster:usage:workload:kube_running_pod_ready:avg' }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least all of these rules are sent to Telemetry so they shouldn't be removed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @simonpasquier :)
These rules are not present in the config file "metrics.yaml" for telemetry client.
Is there other config files we should pay attention to?
5d91096
to
8212607
Compare
/retest |
Need to merge this PR to fix the test ci/prow/versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have recording rules like kube_running_pod_ready
that we shouldn't remove because they are used by other recording rules. I think that we've reached the limits of hack/check-rec-rule-usage.sh
, it would probably be easier and more accurate to parse rule files in Go to identify these cases.
max by (node, namespace, pod) ( | ||
label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)") | ||
)) | ||
record: 'node_namespace_pod:kube_pod_info:' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this recording metric is used by prometheus-adapter
node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>} |
If it hasn't been caught by the e2e tests, it would be worth extending them :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "extending them", do you mean remove all rules derived from "node_namespace_pod:kube_pod_info:" ? I have run e2e test in CMO and it does not raise error by removing "record:node_namespace_pod:kube_pod_info:".
I just find out this rule is also reference in record: 'cluster:cpu_usage_cores:sum' This record rule is also reference in telemetry client config.
I am going to put it back for safety 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that our e2e tests should be failing if by accident we remove a rule that is required by prometheus-adapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we create some tests in CMO checking the completeness of rules required for prometheus-adapter? or Prometheus-adapter shall provide a tool checking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have TestPodMetricsPresence
and TestNodeMetricsPresence
in the operator e2e tests and I would have expected these tests to fail if recorded metrics that are used by prometheus adapter are removed.
8212607
to
113f80c
Compare
{ | ||
name: 'openshift-ingress.rules', | ||
rules: [ | ||
{ record: 'code:cluster:ingress_http_request_count:rate5m:sum' }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
cluster-monitoring-operator/jsonnet/rules.libsonnet
Lines 468 to 471 in 6e48653
{ | |
expr: 'sum by (code) (rate(haproxy_server_http_responses_total[5m]) > 0)', | |
record: 'code:cluster:ingress_http_request_count:rate5m:sum', | |
}, |
/retest |
cluster-monitoring-operator/test/e2e/main_test.go Lines 139 to 147 in 5675d32
|
210904e
to
b65937a
Compare
"namespace:container_memory_usage_bytes:sum" has been added back. Thanks, @simonpasquier :) |
{ | ||
name: 'kube-prometheus-node-recording.rules', | ||
rules: [ | ||
{ record: 'instance:node_cpu:ratio' }, | ||
], | ||
}, | ||
{ | ||
name: 'node.rules', | ||
rules: [ | ||
{ record: 'node:node_num_cpu:sum' }, | ||
], | ||
}, | ||
{ | ||
name: 'openshift-kubernetes.rules', | ||
rules: [ | ||
{ record: 'namespace:container_spec_cpu_shares:sum' }, | ||
{ record: 'pod:container_memory_usage_bytes:sum' }, | ||
{ record: 'pod:container_spec_cpu_shares:sum' }, | ||
], | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment with a link to the BZ so we don't have to look into the Git history to understand why these rules are removed?
b65937a
to
179f55b
Compare
jsonnet/rules.libsonnet
Outdated
@@ -465,10 +465,6 @@ function(params) { | |||
{ | |||
name: 'openshift-ingress.rules', | |||
rules: [ | |||
{ | |||
expr: 'sum by (code) (rate(haproxy_server_http_responses_total[5m]) > 0)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a replacement for this impemented in any of the networking edge components?
Generally looks good, just wondering if there's a replacement for the networking edge related rules |
Thank you @RiRa12621 😀 |
179f55b
to
83502a6
Compare
Not able to find out its replacement, I have put the ingress rule back. @RiRa12621 |
lgtm from my end |
/retest |
@raptorsun: This pull request references Bugzilla bug 1996785, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@raptorsun: This pull request references Bugzilla bug 1996785, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/label qe-approved |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: raptorsun, simonpasquier The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required |
3 similar comments
/retest-required |
/retest-required |
/retest-required |
@raptorsun: All pull requests linked via external trackers have merged: Bugzilla bug 1996785 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The recording rules that are not unused in alerts, telemetry metrics, console, or dashboard definitions are removed.
The following rules are removed: