Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: Remove etcd members from the etcd cluster when reset the nodes #74112

Merged
merged 1 commit into from Feb 22, 2019

Conversation

@pytimer
Copy link
Contributor

pytimer commented Feb 15, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

/kind feature

What this PR does / why we need it:

Add feature that remove etcd members from the etcd cluster when reset the control plane nodes.

Which issue(s) this PR fixes:

Fixes kubernetes/kubeadm#1211
Ref: kubernetes/kubeadm#1373

Special notes for your reviewer:

Sending a PR to review.

When run kubeadm reset command, kubeadm will get local etcd data path, if path exists and error is nil, the second step is remove this etcd member from the etcd cluster, else nothing to do.

I don't know if there is anything missing, if this workflow wrong, hope point out. Thanks.

Does this PR introduce a user-facing change?:

kubeadm: remove local etcd members from the etcd cluster when kubeadm reset
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 15, 2019

Hi @pytimer. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yagonobre
Copy link
Member

yagonobre left a comment

Hi @pytimer, thanks for your pr.
I added some comments, but in general this looks good.
Also, I guess that we need a release note for this change.

/ok-to-test
/priority important-soon

@@ -144,6 +145,13 @@ func (r *Reset) Run(out io.Writer, client clientset.Interface) error {
fmt.Println("[reset] please manually reset etcd to prevent further issues")
}

if err == nil {

This comment has been minimized.

@yagonobre

yagonobre Feb 15, 2019

Member

Move this to line 143, and remove the first if

This comment has been minimized.

@pytimer

pytimer Feb 15, 2019

Author Contributor

Done.

@@ -144,6 +145,13 @@ func (r *Reset) Run(out io.Writer, client clientset.Interface) error {
fmt.Println("[reset] please manually reset etcd to prevent further issues")
}

if err == nil {
if err := removeEtcdMember(client); err != nil {
klog.Errorf("[reset] failed to removi etcd member from the cluster: %+v", err)

This comment has been minimized.

@yagonobre

yagonobre Feb 15, 2019

Member

s/removi/remove

This comment has been minimized.

@pytimer

pytimer Feb 15, 2019

Author Contributor

Done.

return err
}

// notifies the other members of the etcd cluster about the joining member

This comment has been minimized.

@yagonobre

yagonobre Feb 15, 2019

Member

s/joining/removing

This comment has been minimized.

@pytimer

pytimer Feb 15, 2019

Author Contributor

Done.

// when reset the control plane node.
func RemoveStackedEtcdMemberFromCluster(client clientset.Interface, cfg *kubeadmapi.InitConfiguration) error {
// creates an etcd client that connects to all the local/stacked etcd members
klog.V(1).Info("creating etcd client that connects to etcd pods")

This comment has been minimized.

@yagonobre

yagonobre Feb 15, 2019

Member

Add [etcd] prefix to all logs

This comment has been minimized.

@pytimer

pytimer Feb 15, 2019

Author Contributor

Done.

defer cli.Close()

// Remove a exist member from the cluster
ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)

This comment has been minimized.

@yagonobre

yagonobre Feb 15, 2019

Member

why use 20 here and 10 for GetMemberID, also can you use a single constant for all timeouts?

This comment has been minimized.

@pytimer

pytimer Feb 15, 2019

Author Contributor

@yagonobre 20 timeout is no special meaning, i modify this 10 seconds, consistent with GetMemberID. I see all functions about etcd not use single constant, so i don't add constant temporarily. If needs, i will add.

}
klog.V(2).Infof("the member id of etcd peer %s is %d", etcdPeerAddress, id)

klog.V(1).Infof("Removing etcd member: %s", etcdPeerAddress)

This comment has been minimized.

@yagonobre

yagonobre Feb 15, 2019

Member

Maybe we can add the MemberID for this log and remove the two klog.V(2), but I'm not sure, Wdyt?

This comment has been minimized.

@pytimer

pytimer Feb 15, 2019

Author Contributor

I remove the above klog.V(2).

@pytimer

This comment has been minimized.

Copy link
Contributor Author

pytimer commented Feb 15, 2019

/retest

@@ -139,6 +140,10 @@ func (r *Reset) Run(out io.Writer, client clientset.Interface) error {
etcdDataDir, err := getEtcdDataDir(etcdManifestPath, client)
if err == nil {
dirsToClean = append(dirsToClean, etcdDataDir)
if err := removeEtcdMember(client); err != nil {
klog.Errorf("[reset] failed to remove etcd member from the cluster: %+v", err)
return err

This comment has been minimized.

@ereslibre

ereslibre Feb 15, 2019

Member

Since reset is a best effort and we cannot assume anything about the current state of the system I'd say we don't want to return here. Log the error, but let's continue with the reset execution, so we can clean as much as possible.

This comment has been minimized.

@yagonobre

This comment has been minimized.

@pytimer

pytimer Feb 15, 2019

Author Contributor

Thanks for you response. There's a question that if remove etcd member error, maybe make etcd cluster not health, what to do in this case?

This comment has been minimized.

@yagonobre

yagonobre Feb 15, 2019

Member

IMO you should expect that it can happen if you are removing a control-plane

This comment has been minimized.

@pytimer

pytimer Feb 15, 2019

Author Contributor

If it happens, kubeadm reset workflow continue or aborted? Do we have a method deal with it?

This comment has been minimized.

@yagonobre

yagonobre Feb 15, 2019

Member

it would be nice, if it's a simple message.

This comment has been minimized.

@pytimer

pytimer Feb 16, 2019

Author Contributor

I update the code, and remove return error if something wrong.

This comment has been minimized.

@yagonobre

yagonobre Feb 16, 2019

Member

Thanks! Can you remove the wip and squash your commits?

This comment has been minimized.

@pytimer

pytimer Feb 16, 2019

Author Contributor

I want to add some testing. When testing done, i squash commits and remove the wip, is it ok?

This comment has been minimized.

@yagonobre

yagonobre Feb 16, 2019

Member

Yes, Go ahead!

@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Feb 16, 2019

@pytimer pytimer force-pushed the pytimer:kubeadm-reset branch from 93b6459 to 1d9049c Feb 16, 2019

@pytimer pytimer changed the title [WIP] kubeadm: Remove etcd members from the etcd cluster when reset the nodes kubeadm: Remove etcd members from the etcd cluster when reset the nodes Feb 16, 2019

@pytimer

This comment has been minimized.

Copy link
Contributor Author

pytimer commented Feb 21, 2019

/test pull-kubernetes-kubemark-e2e-gce-big

@pytimer

This comment has been minimized.

Copy link
Contributor Author

pytimer commented Feb 21, 2019

@fabriziopandini @neolit123 I change cmd/kubeadm/app/cmd/BUILD file about bazel according to the testing, hope you can reviews it and if something wrong, point out. Thanks.

@neolit123
Copy link
Member

neolit123 left a comment

found a couple of minor typos in the code comments.

return 0, nil
}

// RemoveMember notifies an existing etcd cluster that remove a exist member

This comment has been minimized.

@neolit123

neolit123 Feb 21, 2019

Member

// RemoveMember notifies an etcd cluster to remove an existing member

}
defer cli.Close()

// Remove a exist member from the cluster

This comment has been minimized.

@neolit123

neolit123 Feb 21, 2019

Member

// Remove an existing member from the cluster

@pytimer pytimer force-pushed the pytimer:kubeadm-reset branch from ff9fda6 to 83f5296 Feb 22, 2019

@pytimer

This comment has been minimized.

Copy link
Contributor Author

pytimer commented Feb 22, 2019

@neolit123 Thanks, i already updated.

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Feb 22, 2019

/retest

1 similar comment
@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Feb 22, 2019

/retest

@pytimer

This comment has been minimized.

Copy link
Contributor Author

pytimer commented Feb 22, 2019

@neolit123 I see the pull-kubernetes-integration testing result shows kubectl run --generator=job/v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead., is it related?

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Feb 22, 2019

is it related?

no. the test is flaking.

@fabriziopandini

This comment has been minimized.

Copy link
Contributor

fabriziopandini commented Feb 22, 2019

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 22, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 22, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini, pytimer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pytimer

This comment has been minimized.

Copy link
Contributor Author

pytimer commented Feb 22, 2019

/test pull-kubernetes-e2e-gce

@k8s-ci-robot k8s-ci-robot merged commit 9e53b85 into kubernetes:master Feb 22, 2019

15 of 16 checks passed

pull-kubernetes-e2e-gce Job triggered.
Details
cla/linuxfoundation pytimer authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped
tide In merge pool.
Details
@@ -146,6 +148,62 @@ type Member struct {
PeerURL string
}

// GetMemberID returns the member ID of the given peer URL
func (c Client) GetMemberID(peerURL string) (uint64, error) {

This comment has been minimized.

@tedyu

tedyu Feb 22, 2019

Contributor

The other func's are declared with (c *Client)
Should these two follow the same pattern ?

This comment has been minimized.

@pytimer

pytimer Feb 23, 2019

Author Contributor

@tedyu Thanks for you point out it.
Now there are different signatures in etcd.go. This PR implement remove local etc member from cluster, so keep function have the same signatures is not in the scope of this PR. I will open an issue to track it. You can attention it.

@miry

This comment has been minimized.

Copy link
Contributor

miry commented Feb 26, 2019

@pytimer is it possible to make it as phase, so i can clean etcd node without do kubeadm reset.

Example:

I have 3 master nodes. Then I terminated first node, and boot with same ip address. So I need to clean etcd member before do join(because there are still old members there).

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Feb 26, 2019

hm, then again, you can also just call ectdctl to do the same?

@miry

This comment has been minimized.

Copy link
Contributor

miry commented Feb 26, 2019

@neolit123 yep. I just want to use one tool, if it has such functionality. Now it looks ugly: https://github.com/jetthoughts/infrastructure/blob/k8s-v1.14/modules/k8s_master/data/master_init.tpl.sh#L33

@tedyu

This comment has been minimized.

Copy link
Contributor

tedyu commented Feb 26, 2019

@neolit123
Did you mean using etcdctl (instead of ectdctl) ?

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Feb 26, 2019

Did you mean using etcdctl (instead of ectdctl)

yes

@neolit123 yep. I just want to use one tool, if it has such functionality

could you please log an k/kubeadm issue about "reset phases", we already have phases for join and init, upgrade is planned, but no plan for release yet.

@pytimer

This comment has been minimized.

Copy link
Contributor Author

pytimer commented Feb 26, 2019

I think like as @neolit123 said, kubeadm maybe need add reset phase workflow to support it. Now you should use etcdctl to do it.

@miry

This comment has been minimized.

Copy link
Contributor

miry commented Feb 26, 2019

It could be also implemented when do join master node with experimental-control-plane. Similar what I am doing: check if it is local cluster and there is no etcd folder, then remove member. or ignore ownself ip when do etcd health check.

anyway thanks for help. will wait also for reset phases.

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Feb 26, 2019

please note that unless we have the ticket for reset phases, we are probably going to forget about it. :)

@pytimer

This comment has been minimized.

Copy link
Contributor Author

pytimer commented Feb 26, 2019

If kubeadm add reset phases workflow, I am so willing to add remove etcd in it.

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Feb 26, 2019

it can be considered as a proposal at this point and also has to be discussed in a kubeadm meeting.
but also, it will not be part of the 1.14 release.

@miry

This comment has been minimized.

Copy link
Contributor

miry commented Feb 26, 2019

@neolit123 thanks. created proposal: kubernetes/kubeadm#1425

@pytimer pytimer deleted the pytimer:kubeadm-reset branch Mar 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.