New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1734540: etcd should still alert when a member disappears from the endpoints #400
Bug 1734540: etcd should still alert when a member disappears from the endpoints #400
Conversation
|
@smarterclayton: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Still working on the alert, do not review |
2aed239
to
4ec92a4
Compare
|
@smarterclayton: This pull request references a valid Bugzilla bug. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@hexfusion as per discussion yesterday, this will allow SRE to use the alert vs having to add their own metric (tracking |
|
Ultimately this should end up upstream in https://github.com/etcd-io/etcd/tree/master/Documentation/etcd-mixin. Once we merged it there we only need to update the etcd dependency of the jsonnet configuration. |
|
Is the upstream going to be changed to be consistent with other alert name conventions? |
|
@smarterclayton: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
4ec92a4
to
e64e1e9
Compare
|
Opened etcd-io/etcd#10906 |
|
Upstream is merged etcd-io/etcd#10906 |
…points There is no alert that reports when an etcd quorum member is degraded. Add an alert that reports if any etcd member is down or potentially unreachable. Since it is possible if the etcd metrics are collected under a Kubernetes service and the etcd instance is removed from the endpoint (due to implementation details of how etcd is run), extend the alert to also capture failure rate of requests to that cluster.
e64e1e9
to
deb3405
Compare
|
Updated from latest etcd (I only bumped that dependency, because it picked up a more dramatic set). This includes the new alert |
|
@smarterclayton: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.1 |
|
@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/lgtm |
|
/approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, s-urbaniak, smarterclayton, squat The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@smarterclayton: All pull requests linked via external trackers have merged. The Bugzilla bug has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@smarterclayton: #400 failed to apply on top of branch "release-4.1": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |

Upon observation there are a number of clusters reporting a single degraded etcd member, but without an alert highlighting that behavior. Our current alerting reports when quorum is lost, but a single member being down is still important because you can no longer tolerate another master failure. Add alert
EtcdMembersDownthat counts the number of suspected down members.Further, since it is possible if the etcd metrics are collected under a Kubernetes service and the etcd instance is removed from the endpoint (due to implementation details of how etcd is run), extend the alert to also capture failure rate of requests to that cluster above a minimum threshold. This might potentially report a false positive, but a non-zero peer failure rate likely represents a serious failure. Use the max of either the metrics from an instance, or the service up == 0 measurement to get a best guess of the number of down members.
QUESTION: I followed the convention for the rest of the alerts, but noticed that etcd alerts are not consistent naming wise (camelCase not CamelCase). Is there a bug / item tracking fixing this?