New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.7] Bug 1930876: etcdInsufficientMembers is wrong when etcd is in a pod #1066
[release-4.7] Bug 1930876: etcdInsufficientMembers is wrong when etcd is in a pod #1066
Conversation
The upstream etcd alert is incorrect because it only excludes instance, but OpenShift runs etcd in a pod and therefore the pod label must be excluded. Exclude the upstream alert, improve the resiliency of the alert expression, target the alert to the expected job for the cluster etcd (job="etcd"), update the description and health text to include a clearer description of what insufficient members means and consequences and some impact actions, and separate the alert into its own rule group to prepare (in the future) of moving the alert into the cluster-etcd-operator repo. The alert now includes etcd_server_has_leader == 1 to ensure that if an instance from a previous quorum appears we will not consider it part of the majority calculation. This also flags when we can't establish quorum due to failures in communication between nodes (but not between monitoring and etcd).
@openshift-cherrypick-robot: Bugzilla bug 1929944 has been cloned as Bugzilla bug 1930876. Retitling PR to link against new bug. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@openshift-cherrypick-robot: This pull request references Bugzilla bug 1930876, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
cc @s-urbaniak I am going to pick up CHANGELOG in another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, openshift-cherrypick-robot, simonpasquier The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@openshift-cherrypick-robot: All pull requests linked via external trackers have merged: Bugzilla bug 1930876 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is an automated cherry-pick of #1064
/assign hexfusion