Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: run MemberAdd/Remove for etcd clients with exp-backoff retry #79677

Merged
merged 1 commit into from Jul 4, 2019

Conversation

@neolit123
Copy link
Member

commented Jul 3, 2019

What this PR does / why we need it:
When adding a new etcd member the etcd cluster can enter a state
of vote, where any new members added at the exact same time will
fail with an error right away.

Implement exponential backoff retry around the MemberAdd call.

This solves a kubeadm problem when concurrently joining
control-plane nodes with stacked etcd members.

From experiment, a few retries with milliseconds apart are
sufficient to achieve the concurrent join of a 3xCP cluster.

Apply the same backoff to MemberRemove in case the concurrent
removal of members fails for similar reasons.

Which issue(s) this PR fixes:

Fixes kubernetes/kubeadm#1646

Special notes for your reviewer:
^ see the issue about testing details.

Does this PR introduce a user-facing change?:

kubeadm: implement support for concurrent add/remove of stacked etcd members

NOTE: should be backported to 1.15.

/priority critical-urgent
/kind bug
/assign @timothysc @fabriziopandini
/cc @ereslibre @dlipovetsky
@kubernetes/sig-cluster-lifecycle-pr-reviews

When adding a new etcd member the etcd cluster can enter a state
of vote, where any new members added at the exact same time will
fail with an error right away.

Implement exponential backoff retry around the MemberAdd call.

This solves a kubeadm problem when concurrently joining
control-plane nodes with stacked etcd members.

From experiment, a few retries with milliseconds apart are
sufficient to achieve the concurrent join of a 3xCP cluster.

Apply the same backoff to MemberRemove in case the concurrent
removal of members fails for similar reasons.
var lastError error
var resp *clientv3.MemberAddResponse
err = wait.ExponentialBackoff(addRemoveBackoff, func() (bool, error) {
resp, err = cli.MemberAdd(context.Background(), []string{peerAddrs})

This comment has been minimized.

Copy link
@neolit123

neolit123 Jul 3, 2019

Author Member

please note that etcd does seem to have a number of different "retry clients", but those seem undocumented or at least i couldn't find the docs. https://github.com/etcd-io/etcd/blob/master/clientv3/retry.go#L159-L184

i did not investigate further as this solution:

  • gives us fine grained control over the backoff
  • worked in my tests
@neolit123

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

/hold
for review.

@neolit123

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

/test pull-kubernetes-e2e-kind
just in case, even if not joining concurrently for the time being.

@SataQiu

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

LGTM 👍

Copy link
Member

left a comment

Thanks @neolit123

cmd/kubeadm/app/util/etcd/etcd.go Show resolved Hide resolved
Copy link
Member

left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 3, 2019
@detiber

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

lgtm as well

Copy link
Member

left a comment

/lgtm

Thank you @neolit123, I think I never saw the blackout during node removal though (even on critical etcd shrinks). I'm fine being safe on both though, I might be misremembering :)

Copy link
Member

left a comment

@neolit123 thanks!
looking forward to replacing this code with etcdadm, but for the meantime this is ok.
/lgtm
/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SataQiu
SataQiu approved these changes Jul 4, 2019
@neolit123

This comment has been minimized.

Copy link
Member Author

commented Jul 4, 2019

@fabriziopandini

looking forward to replacing this code with etcdadm, but for the meantime this is ok.

actually, we are discussing using the same logic in etcdadm, because etcd simply does not support concurrent join to the best of our knowledge.

thanks for the +1s, will send a cherry pick soon.
/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit 7340b63 into kubernetes:master Jul 4, 2019
24 checks passed
24 checks passed
cla/linuxfoundation neolit123 authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
k8s-ci-robot added a commit that referenced this pull request Jul 10, 2019
…677-origin-release-1.15

Automated cherry pick of #79677: kubeadm: run MemberAdd/Remove for etcd clients with
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.