Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etcd WithRequireLeader option to API watches #89881

Merged
merged 1 commit into from Apr 8, 2020
Merged

Add etcd WithRequireLeader option to API watches #89881

merged 1 commit into from Apr 8, 2020

Conversation

embano1
Copy link
Member

@embano1 embano1 commented Apr 6, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:
Watches against etcd in the API server can hang forever if the etcd
cluster loses quorum, e.g. the majority of nodes crashes. This fix
improves responsiveness (detection and reaction time) of API server
watches against etcd in some rare (but still possible) edge cases so
that watches are terminated with "etcdserver: no leader" (ErrNoLeader).

Implementation behavior described by @jingyih:

The etcd server waits until it cannot find a leader for 3 election
timeouts to cancel existing streams. 3 is currently a hard coded
constant. The election timeout defaults to 1000ms.

If the cluster is healthy, when the leader is stopped, the leadership
transfer should be smooth. (leader transfers its leadership before
stopping). If leader is hard killed, other servers will take an election
timeout to realize leader lost and start campaign.

For further details, discussion and validation see
#89488 (comment)
and etcd-io/etcd#8980.

Signed-off-by: Michael Gasch mgasch@vmware.com

Which issue(s) this PR fixes:
Fixes #89488

Does this PR introduce a user-facing change?:

"NONE"

/sig api-machinery

cc/ @dims @jingyih

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/apiserver labels Apr 6, 2020
Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2020
@jingyih
Copy link
Contributor

jingyih commented Apr 6, 2020

cc @wojtek-t @jpbetz

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Until I better understand the consequences of this change.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. labels Apr 6, 2020
@dims
Copy link
Member

dims commented Apr 7, 2020

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 7, 2020
@dims
Copy link
Member

dims commented Apr 7, 2020

/assign @wojtek-t
/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 7, 2020
Watches against etcd in the API server can hang forever if the etcd
cluster loses quorum, e.g. the majority of nodes crashes. This fix
improves responsiveness (detection and reaction time) of API server
watches against etcd in some rare (but still possible) edge cases so
that watches are terminated with `"etcdserver: no leader"
(ErrNoLeader)`.

Implementation behavior described by jingyih:

```
The etcd server waits until it cannot find a leader for 3 election
timeouts to cancel existing streams. 3 is currently a hard coded
constant. The election timeout defaults to 1000ms.

If the cluster is healthy, when the leader is stopped, the leadership
transfer should be smooth. (leader transfers its leadership before
stopping). If leader is hard killed, other servers will take an election
timeout to realize leader lost and start campaign.
```

For further details, discussion and validation see
#89488 (comment)
and etcd-io/etcd#8980.

Closes: #89488

Signed-off-by: Michael Gasch <mgasch@vmware.com>
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 7, 2020
@wojtek-t
Copy link
Member

wojtek-t commented Apr 8, 2020

/hold cancel
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 8, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, embano1, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2020
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 8, 2020

@embano1: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-node-e2e-containerd 45f4e1a137e378962f44ccd8620417f9e948f873 link /test pull-kubernetes-node-e2e-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 3c3fc80 into kubernetes:master Apr 8, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarification on watch when server (or store) is partitioned
6 participants