Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1996785: [MON-1536]Remove unused rules. #1316

Merged
merged 2 commits into from Sep 3, 2021

Conversation

raptorsun
Copy link
Contributor

@raptorsun raptorsun commented Aug 9, 2021

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

The recording rules that are not unused in alerts, telemetry metrics, console, or dashboard definitions are removed.

The following rules are removed:

  • build_error_rate
  • cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
  • code:registry_api_request_count:rate:sum
  • instance:node_cpu:ratio
  • kube_pod_status_ready:etcd:sum
  • kube_pod_status_ready:image_registry:sum
  • namespace:container_spec_cpu_shares:sum
  • node:node_num_cpu:sum
  • pod:container_spec_cpu_shares:sum

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 9, 2021
@openshift-ci openshift-ci bot requested review from bison and sthaha August 9, 2021 16:26
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2021
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2021
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2021
@paulfantom
Copy link
Contributor

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2021
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2021
@arajkumar
Copy link
Contributor

@raptorsun I will close https://bugzilla.redhat.com/show_bug.cgi?id=1986033 as a duplicate of this.

@raptorsun
Copy link
Contributor Author

@raptorsun I will close https://bugzilla.redhat.com/show_bug.cgi?id=1986033 as a duplicate of this.

Yes, please. This PR is removing the duplicated rules mentioned in the Bugzilla record. Thank you :)

Comment on lines 108 to 134
{ record: 'cluster:capacity_cpu_cores_hyperthread_enabled:sum' },
{ record: 'cluster:capacity_cpu_sockets_hyperthread_enabled:sum' },
{ record: 'cluster:container_cpu_usage:ratio' },
{ record: 'cluster:container_spec_cpu_shares:ratio' },
{ record: 'cluster:hyperthread_enabled_nodes' },
{ record: 'cluster:infra_nodes' },
{ record: 'cluster:master_infra_nodes' },
{ record: 'cluster:memory_usage:ratio' },
{ record: 'cluster:node_cpu:ratio' },
{ record: 'cluster:node_cpu:sum_rate5m' },
{ record: 'cluster:usage:containers:sum' },
{ record: 'cluster:usage:ingress_frontend_bytes_in:rate5m:sum' },
{ record: 'cluster:usage:ingress_frontend_bytes_out:rate5m:sum' },
{ record: 'cluster:usage:ingress_frontend_connections:sum' },
{ record: 'cluster:usage:kube_node_ready:avg5m' },
{ record: 'cluster:usage:kube_schedulable_node_ready_reachable:avg5m' },
{ record: 'cluster:usage:openshift:ingress_request_error:fraction5m' },
{ record: 'cluster:usage:openshift:ingress_request_total:irate5m' },
{ record: 'cluster:usage:openshift:kube_running_pod_ready:avg' },
{ record: 'cluster:usage:pods:terminal:workload:sum' },
{ record: 'cluster:usage:resources:sum' },
{ record: 'cluster:usage:workload:capacity_physical_cpu_core_seconds' },
{ record: 'cluster:usage:workload:capacity_physical_cpu_cores:max:5m' },
{ record: 'cluster:usage:workload:capacity_physical_cpu_cores:min:5m' },
{ record: 'cluster:usage:workload:ingress_request_error:fraction5m' },
{ record: 'cluster:usage:workload:ingress_request_total:irate5m' },
{ record: 'cluster:usage:workload:kube_running_pod_ready:avg' },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least all of these rules are sent to Telemetry so they shouldn't be removed...

Copy link
Contributor Author

@raptorsun raptorsun Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @simonpasquier :)
These rules are not present in the config file "metrics.yaml" for telemetry client.
Is there other config files we should pay attention to?

@raptorsun
Copy link
Contributor Author

/retest

@raptorsun
Copy link
Contributor Author

Need to merge this PR to fix the test ci/prow/versions

@raptorsun raptorsun changed the title [WIP] [MON-1536] Remove unused rules. [MON-1536] Remove unused rules. Aug 18, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2021
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have recording rules like kube_running_pod_ready that we shouldn't remove because they are used by other recording rules. I think that we've reached the limits of hack/check-rec-rule-usage.sh, it would probably be easier and more accurate to parse rule files in Go to identify these cases.

assets/cluster-monitoring-operator/prometheus-rule.yaml Outdated Show resolved Hide resolved
assets/cluster-monitoring-operator/prometheus-rule.yaml Outdated Show resolved Hide resolved
max by (node, namespace, pod) (
label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)")
))
record: 'node_namespace_pod:kube_pod_info:'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this recording metric is used by prometheus-adapter

node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}

If it hasn't been caught by the e2e tests, it would be worth extending them :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "extending them", do you mean remove all rules derived from "node_namespace_pod:kube_pod_info:" ? I have run e2e test in CMO and it does not raise error by removing "record:node_namespace_pod:kube_pod_info:".

I just find out this rule is also reference in record: 'cluster:cpu_usage_cores:sum' This record rule is also reference in telemetry client config.

I am going to put it back for safety 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that our e2e tests should be failing if by accident we remove a rule that is required by prometheus-adapter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we create some tests in CMO checking the completeness of rules required for prometheus-adapter? or Prometheus-adapter shall provide a tool checking this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have TestPodMetricsPresence and TestNodeMetricsPresence in the operator e2e tests and I would have expected these tests to fail if recorded metrics that are used by prometheus adapter are removed.

jsonnet/utils/sanitize-rules.libsonnet Outdated Show resolved Hide resolved
jsonnet/utils/sanitize-rules.libsonnet Outdated Show resolved Hide resolved
jsonnet/utils/sanitize-rules.libsonnet Outdated Show resolved Hide resolved
assets/control-plane/prometheus-rule.yaml Show resolved Hide resolved
{
name: 'openshift-ingress.rules',
rules: [
{ record: 'code:cluster:ingress_http_request_count:rate5m:sum' },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

{
expr: 'sum by (code) (rate(haproxy_server_http_responses_total[5m]) > 0)',
record: 'code:cluster:ingress_http_request_count:rate5m:sum',
},

@raptorsun
Copy link
Contributor Author

/retest

@simonpasquier
Copy link
Contributor

namespace:container_memory_usage_bytes:sum is used in the operator e2e tests but TBH I'm not sure about the purpose of the test...

// Once we have the need to test multiple recording rules, we can unite them in
// a single test function.
func TestMemoryUsageRecordingRule(t *testing.T) {
f.ThanosQuerierClient.WaitForQueryReturnGreaterEqualOne(
t,
time.Minute,
"count(namespace:container_memory_usage_bytes:sum)",
)
}

@raptorsun
Copy link
Contributor Author

"namespace:container_memory_usage_bytes:sum" has been added back. Thanks, @simonpasquier :)

Comment on lines +77 to +98
{
name: 'kube-prometheus-node-recording.rules',
rules: [
{ record: 'instance:node_cpu:ratio' },
],
},
{
name: 'node.rules',
rules: [
{ record: 'node:node_num_cpu:sum' },
],
},
{
name: 'openshift-kubernetes.rules',
rules: [
{ record: 'namespace:container_spec_cpu_shares:sum' },
{ record: 'pod:container_memory_usage_bytes:sum' },
{ record: 'pod:container_spec_cpu_shares:sum' },
],
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment with a link to the BZ so we don't have to look into the Git history to understand why these rules are removed?

@@ -465,10 +465,6 @@ function(params) {
{
name: 'openshift-ingress.rules',
rules: [
{
expr: 'sum by (code) (rate(haproxy_server_http_responses_total[5m]) > 0)',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a replacement for this impemented in any of the networking edge components?

@RiRa12621
Copy link
Contributor

Generally looks good, just wondering if there's a replacement for the networking edge related rules

@raptorsun
Copy link
Contributor Author

Thank you @RiRa12621 😀
I'm going to check the ingress rule with network team. It is the only thing to decide before merging.

@raptorsun
Copy link
Contributor Author

Not able to find out its replacement, I have put the ingress rule back. @RiRa12621
We are ready to merge this PR now.

@RiRa12621
Copy link
Contributor

lgtm from my end

@raptorsun
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 2, 2021

@raptorsun: This pull request references Bugzilla bug 1996785, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

Bug 1996785: [MON-1536]Remove unused rules.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 2, 2021

@raptorsun: This pull request references Bugzilla bug 1996785, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

Bug 1996785: [MON-1536]Remove unused rules.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@juzhao
Copy link

juzhao commented Sep 2, 2021

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Sep 2, 2021
@raptorsun
Copy link
Contributor Author

/retest

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 2, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: raptorsun, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [raptorsun,simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@raptorsun
Copy link
Contributor Author

/retest-required

3 similar comments
@raptorsun
Copy link
Contributor Author

/retest-required

@raptorsun
Copy link
Contributor Author

/retest-required

@raptorsun
Copy link
Contributor Author

/retest-required

@openshift-merge-robot openshift-merge-robot merged commit 3e3b1e3 into openshift:master Sep 3, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 3, 2021

@raptorsun: All pull requests linked via external trackers have merged:

Bugzilla bug 1996785 has been moved to the MODIFIED state.

In response to this:

Bug 1996785: [MON-1536]Remove unused rules.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants