Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1734540: etcd should still alert when a member disappears from the endpoints #400

Merged
merged 1 commit into from Jul 31, 2019

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jul 16, 2019

Upon observation there are a number of clusters reporting a single degraded etcd member, but without an alert highlighting that behavior. Our current alerting reports when quorum is lost, but a single member being down is still important because you can no longer tolerate another master failure. Add alert EtcdMembersDown that counts the number of suspected down members.

Further, since it is possible if the etcd metrics are collected under a Kubernetes service and the etcd instance is removed from the endpoint (due to implementation details of how etcd is run), extend the alert to also capture failure rate of requests to that cluster above a minimum threshold. This might potentially report a false positive, but a non-zero peer failure rate likely represents a serious failure. Use the max of either the metrics from an instance, or the service up == 0 measurement to get a best guess of the number of down members.

QUESTION: I followed the convention for the rest of the alerts, but noticed that etcd alerts are not consistent naming wise (camelCase not CamelCase). Is there a bug / item tracking fixing this?

@openshift-ci-robot
Copy link
Contributor

@smarterclayton: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1730413: etcd should still alert when a member disappears from the endpoints

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jul 16, 2019
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 16, 2019
@smarterclayton smarterclayton changed the title Bug 1730413: etcd should still alert when a member disappears from the endpoints WIP: Bug 1730413: etcd should still alert when a member disappears from the endpoints Jul 16, 2019
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 16, 2019
@smarterclayton
Copy link
Contributor Author

Still working on the alert, do not review

@smarterclayton smarterclayton changed the title WIP: Bug 1730413: etcd should still alert when a member disappears from the endpoints Bug 1730413: etcd should still alert when a member disappears from the endpoints Jul 17, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 17, 2019
@openshift-ci-robot
Copy link
Contributor

@smarterclayton: This pull request references a valid Bugzilla bug.

In response to this:

Bug 1730413: etcd should still alert when a member disappears from the endpoints

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton
Copy link
Contributor Author

@hexfusion as per discussion yesterday, this will allow SRE to use the alert vs having to add their own metric (tracking up{service="etcd"} by itself was not sufficient as we discovered in prod and in testing)

@metalmatze
Copy link
Contributor

Ultimately this should end up upstream in https://github.com/etcd-io/etcd/tree/master/Documentation/etcd-mixin. Once we merged it there we only need to update the etcd dependency of the jsonnet configuration.

@smarterclayton
Copy link
Contributor Author

Is the upstream going to be changed to be consistent with other alert name conventions?

@smarterclayton smarterclayton changed the title Bug 1730413: etcd should still alert when a member disappears from the endpoints Bug 1731233: etcd should still alert when a member disappears from the endpoints Jul 18, 2019
@openshift-ci-robot
Copy link
Contributor

@smarterclayton: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1731233: etcd should still alert when a member disappears from the endpoints

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 18, 2019
@smarterclayton
Copy link
Contributor Author

Opened etcd-io/etcd#10906

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2019
@hexfusion
Copy link
Contributor

Upstream is merged etcd-io/etcd#10906

…points

There is no alert that reports when an etcd quorum member is degraded.
Add an alert that reports if any etcd member is down or potentially
unreachable. Since it is possible if the etcd metrics are collected
under a Kubernetes service and the etcd instance is removed from the
endpoint (due to implementation details of how etcd is run), extend
the alert to also capture failure rate of requests to that cluster.
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2019
@smarterclayton
Copy link
Contributor Author

Updated from latest etcd (I only bumped that dependency, because it picked up a more dramatic set). This includes the new alert etcdMembersDown.

@smarterclayton smarterclayton changed the title Bug 1731233: etcd should still alert when a member disappears from the endpoints Bug 1734540: etcd should still alert when a member disappears from the endpoints Jul 30, 2019
@openshift-ci-robot
Copy link
Contributor

@smarterclayton: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1734540: etcd should still alert when a member disappears from the endpoints

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton
Copy link
Contributor Author

/cherrypick release-4.1

@openshift-cherrypick-robot

@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton
Copy link
Contributor Author

To recreate, oc debug node/MASTER_1, mv /etc/kubernetes/manifests/etcd-member.yaml /tmp/, go to dashboard:

image

@hexfusion
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 31, 2019
@s-urbaniak
Copy link
Contributor

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2019
Copy link
Contributor

@squat squat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, s-urbaniak, smarterclayton, squat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 936e15e into openshift:master Jul 31, 2019
@openshift-ci-robot
Copy link
Contributor

@smarterclayton: All pull requests linked via external trackers have merged. The Bugzilla bug has been moved to the MODIFIED state.

In response to this:

Bug 1734540: etcd should still alert when a member disappears from the endpoints

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@smarterclayton: #400 failed to apply on top of branch "release-4.1":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	assets/prometheus-k8s/rules.yaml
M	jsonnet/jsonnetfile.lock.json
M	pkg/manifests/bindata.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/manifests/bindata.go
CONFLICT (content): Merge conflict in pkg/manifests/bindata.go
Auto-merging jsonnet/jsonnetfile.lock.json
CONFLICT (content): Merge conflict in jsonnet/jsonnetfile.lock.json
Auto-merging assets/prometheus-k8s/rules.yaml
Patch failed at 0001 alerts: etcd should still alert when a member disappears from the endpoints

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants