Bug 1949589: allow high CPU alerts to be firing and pending #26102

deads2k · 2021-04-26T13:52:11Z

our e2e tests run very high parallelism on relatively small masters, so we see high CPU usage. This is distinct from the customer use-cases around overall size.

openshift-ci-robot · 2021-04-26T13:52:17Z

@deads2k: This pull request references Bugzilla bug 1949589, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1949589: allow high CPU alerts to be firing and pending

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tkashem · 2021-04-26T15:05:30Z

/lgtm

openshift-ci-robot · 2021-04-26T15:05:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/extended/prometheus/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2021-04-26T15:40:57Z

/retest

openshift-ci · 2021-04-26T16:43:06Z

@deads2k: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-csi	`12b022c`	link	`/test e2e-aws-csi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

deads2k · 2021-04-26T17:01:29Z

failures appears independent of the addition of a skip on alerts

/override ci/prow/e2e-metal-ipi-ovn-ipv6
/override ci/prow/e2e-aws-csi
/override ci/prow/e2e-gcp
/override ci/prow/e2e-aws-disruptive
/override ci/prow/e2e-agnostic-cmd
/override ci/prow/e2e-gcp-upgrade

openshift-ci-robot · 2021-04-26T17:01:43Z

@deads2k: Overrode contexts on behalf of deads2k: ci/prow/e2e-agnostic-cmd, ci/prow/e2e-aws-csi, ci/prow/e2e-aws-disruptive, ci/prow/e2e-gcp, ci/prow/e2e-gcp-upgrade, ci/prow/e2e-metal-ipi-ovn-ipv6

In response to this:

failures appears independent of the addition of a skip on alerts

/override ci/prow/e2e-metal-ipi-ovn-ipv6
/override ci/prow/e2e-aws-csi
/override ci/prow/e2e-gcp
/override ci/prow/e2e-aws-disruptive
/override ci/prow/e2e-agnostic-cmd
/override ci/prow/e2e-gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deads2k · 2021-04-26T17:01:46Z

/override ci/prow/e2e-gcp-disruptive

openshift-ci-robot · 2021-04-26T17:01:48Z

@deads2k: Overrode contexts on behalf of deads2k: ci/prow/e2e-gcp-disruptive

In response to this:

/override ci/prow/e2e-gcp-disruptive

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-04-26T17:02:16Z

@deads2k: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

openshift/cluster-kube-apiserver-operator#1114 is open

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 1949589 has not been moved to the MODIFIED state.

In response to this:

Bug 1949589: allow high CPU alerts to be firing and pending

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

We've allowed these for non-update jobs since 12b022c (allow high CPU alerts to be firing and pending, 2021-04-26, openshift#26102). But they show up in update jobs too. For example [1] included: alert ExtremelyHighIndividualControlPlaneCPU fired for 60 seconds with labels: {instance="ci-op-vjm670pq-1ff06-pn8bq-master-1", severity="critical"} alert HighOverallControlPlaneCPU fired for 240 seconds with labels: {severity="warning"} Searching for recent frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+.*High.*ControlPlaneCPU+fired+for' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 49 runs, 65% failed, 3% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1417199789052792832

We've allowed these for non-update jobs since 12b022c (allow high CPU alerts to be firing and pending, 2021-04-26, openshift#26102). But they show up in update jobs too. For example [1] included: alert ExtremelyHighIndividualControlPlaneCPU fired for 60 seconds with labels: {instance="ci-op-vjm670pq-1ff06-pn8bq-master-1", severity="critical"} alert HighOverallControlPlaneCPU fired for 240 seconds with labels: {severity="warning"} Searching for recent frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+.*High.*ControlPlaneCPU+fired+for' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 49 runs, 65% failed, 3% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact I don't know why this would be GCP-specific, but copy/pasting in a Matches block I found elsewhere in the file to limit the exception to GCP. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1417199789052792832

allow high CPU alerts to be firing and pending

12b022c

openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Apr 26, 2021

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Apr 26, 2021

openshift-ci-robot requested review from wangke19, paulfantom and s-urbaniak April 26, 2021 13:52

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 26, 2021

openshift-ci-robot assigned tkashem Apr 26, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2021

openshift-merge-robot merged commit eed11e7 into openshift:master Apr 26, 2021

wking mentioned this pull request Jul 20, 2021

test/e2e/upgrade/alert: Temporarily allow HighOverallControlPlaneCPU #26341

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1949589: allow high CPU alerts to be firing and pending #26102

Bug 1949589: allow high CPU alerts to be firing and pending #26102

deads2k commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

tkashem commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

deads2k commented Apr 26, 2021

openshift-ci bot commented Apr 26, 2021 •

edited

deads2k commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

deads2k commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

Bug 1949589: allow high CPU alerts to be firing and pending #26102

Bug 1949589: allow high CPU alerts to be firing and pending #26102

Conversation

deads2k commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

tkashem commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

deads2k commented Apr 26, 2021

openshift-ci bot commented Apr 26, 2021 • edited

deads2k commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

deads2k commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

openshift-ci-robot commented Apr 26, 2021

openshift-ci bot commented Apr 26, 2021 •

edited