New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1991010: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up #638
Bug 1991010: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up #638
Conversation
Fixing a TODO from 02a389b (sync: Temporarily stop checking version in ClusterOperator.Status, 2019-01-23, openshift#98). The semantics of the 'operator' entry have been clear since at least openshift/api@40c55f8085 (config/v1/types_cluster_operator: ClusterOperatorStatus doc wordsmithing, 2019-10-28, openshift/api#501).
ClusterOperatorDown is based on cluster_operator_up, but we also have ClusterOperatorDegraded based on cluster_operator_conditions{condition="Degraded"}. Firing ClusterOperatorDown for operators which are Available=True and Degraded=True confuses users [1]. ClusterOperatorDegraded will also be firing. With this commit, I'm adjusting cluster_operator_up to only care about Available, to decouple the two alerts and bring them in line with their existing "has not been available" and "has been degraded" descriptions. However, IsOperatorStatusConditionTrue requires the condition to be present, and cluster_operator_conditions only creates entries when the conditions are present. To guard against the Degraded-unset condition in ClusterOperatorDegraded, I'm covering with an 'or' [2] and 'group by' [3] guard. So we should have the following cases: * Available unset or != True: ClusterOperatorDown will fire. * Available=True, Degraded unset or != False: ClusterOperatorDegraded will fire. Firing on unset is new in this commit. Not firing ClusterOperatorDown here is new in this commit. * Available=True, Degraded=False: Neither alert fires. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1834551#c0 [2]: https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators [3]: https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators
@wking: This pull request references Bugzilla bug 1991010, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
Failures should be fixed by openshift/release#20978. /retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 1991010 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 1991010 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Bringing #550 back to 4.7. By hand, since there are neighbor-line conflicts with #587, which landed first in 4.7, but is backported from changes that landed later in master/4.8.