New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1949589: allow high CPU alerts to be firing and pending #26102
Bug 1949589: allow high CPU alerts to be firing and pending #26102
Conversation
@deads2k: This pull request references Bugzilla bug 1949589, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, tkashem The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
@deads2k: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
failures appears independent of the addition of a skip on alerts /override ci/prow/e2e-metal-ipi-ovn-ipv6 |
@deads2k: Overrode contexts on behalf of deads2k: ci/prow/e2e-agnostic-cmd, ci/prow/e2e-aws-csi, ci/prow/e2e-aws-disruptive, ci/prow/e2e-gcp, ci/prow/e2e-gcp-upgrade, ci/prow/e2e-metal-ipi-ovn-ipv6 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override ci/prow/e2e-gcp-disruptive |
@deads2k: Overrode contexts on behalf of deads2k: ci/prow/e2e-gcp-disruptive In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@deads2k: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with Bugzilla bug 1949589 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We've allowed these for non-update jobs since 12b022c (allow high CPU alerts to be firing and pending, 2021-04-26, openshift#26102). But they show up in update jobs too. For example [1] included: alert ExtremelyHighIndividualControlPlaneCPU fired for 60 seconds with labels: {instance="ci-op-vjm670pq-1ff06-pn8bq-master-1", severity="critical"} alert HighOverallControlPlaneCPU fired for 240 seconds with labels: {severity="warning"} Searching for recent frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+.*High.*ControlPlaneCPU+fired+for' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 49 runs, 65% failed, 3% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1417199789052792832
We've allowed these for non-update jobs since 12b022c (allow high CPU alerts to be firing and pending, 2021-04-26, openshift#26102). But they show up in update jobs too. For example [1] included: alert ExtremelyHighIndividualControlPlaneCPU fired for 60 seconds with labels: {instance="ci-op-vjm670pq-1ff06-pn8bq-master-1", severity="critical"} alert HighOverallControlPlaneCPU fired for 240 seconds with labels: {severity="warning"} Searching for recent frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+.*High.*ControlPlaneCPU+fired+for' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 49 runs, 65% failed, 3% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1417199789052792832
We've allowed these for non-update jobs since 12b022c (allow high CPU alerts to be firing and pending, 2021-04-26, openshift#26102). But they show up in update jobs too. For example [1] included: alert ExtremelyHighIndividualControlPlaneCPU fired for 60 seconds with labels: {instance="ci-op-vjm670pq-1ff06-pn8bq-master-1", severity="critical"} alert HighOverallControlPlaneCPU fired for 240 seconds with labels: {severity="warning"} Searching for recent frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+.*High.*ControlPlaneCPU+fired+for' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 49 runs, 65% failed, 3% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact I don't know why this would be GCP-specific, but copy/pasting in a Matches block I found elsewhere in the file to limit the exception to GCP. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1417199789052792832
our e2e tests run very high parallelism on relatively small masters, so we see high CPU usage. This is distinct from the customer use-cases around overall size.