Unable to add new master/etcd node to cluster #3471

adoo123 · 2018-10-08T09:27:06Z

Current:

master1
master2
master3

[etcd]
master1
master2
master3

[kube-node]
node1
node2
node3

Added node3 to group etcd:

master1
master2
master3
master4
master5

[etcd]
master1
master2
master3
master4
master5

[kube-node]
node1
node2
node3

Now i have 3 master/etcd and 45 nodes, i've already refrenced #1122 but couldn't fix it.I extended etcd success but master failed.It shows "kubectl" error:

Unable to connect to the server: x509: certificate is valid for "new master ip"

And my extend command is：

ansible-playbook -i inventory/mycluster/host.ini cluster.yml -l master1,master2,master3,master4,master5

My kubernetes cluster version is 1.9.3,how to fix it?

The text was updated successfully, but these errors were encountered:

ykfq · 2019-04-01T10:15:17Z

The feature of scaling master nodes seems imperfect, but it's possible to scale etcd cluster seperatly, to do so, just add etcd nodes under [etcd] and rerun cluster.yml.

juliohm1978 · 2019-06-12T02:35:29Z

I'm facing the same issue with adding new masters. I'm using Kubespray v2.10.x and the reason it fails is that Kubespray does not update the apiserver certificates to add the new master to the SAN list.

You can check your certificate with

openssl x509 -text -noout -in /etc/kubernetes/ssl/apiserver.crt

... and the new master IP and hostname should be listed in the Subject Alternative Name section.

X509v3 Subject Alternative Name: 
                DNS:infra00-lab, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost, DNS:infra00-lab, DNS:lb-apiserver.kubernetes.local, IP Address:10.233.0.1, IP Address:172.31.134.110, IP Address:172.31.134.110, IP Address:10.233.0.1, IP Address:127.0.0.1, IP Address:172.31.134.110

The execution of cluster.yml adds the new master IP and hostname to /etc/kubernetes/kubeadm-config.yaml as expected. It seems, however, that Kubespray is not calling kubeadm to replace the certificate before trying to join the new master node. We fixed this by using kubeadm manually to recreate the certificate.

NOTE: The works for v2.10.x. I never tested this in older versions of Kubespray.

In your first master, recreate the apiserver certificate.

cd /etc/kubernetes/ssl
mv apiserver.crt apiserver.crt.old
mv apiserver.key apiserver.key.old

cd /etc/kubernetes
kubeadm init phase certs apiserver --config kubeadm-config.yaml

If you are doing this after you ended up with a broken master, be sure to run reset.yml using the parameter --limit=<broken_master_hostname> before continuing. If you take the precaution of recreating the certificate before adding the new master node, you won't need this.

Run cluster.yml to include the new master node. You should end up with a working cluster.

fejta-bot · 2019-09-10T03:09:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ppcololo · 2019-09-10T08:49:16Z

is it possible to add master? or replace failed master for new one?

juliohm1978 · 2019-09-10T17:18:32Z

You should be able to. In the past, we managed to replace all nodes in the cluster: master, etcd and workers. But.... there are some misteps you need to be carefull along the way. After a lot of experiments and retries in our lab environment, we came up with a few guidelines.

Adding/replacing a master node

1) Recreate apiserver certs manually to include the new master node in the cert SAN field.

For some reason, Kubespray will not update the apiserver certificate.

Edit /etc/kubernetes/kubeadm-config.yaml, include new host in certSANs list.

Use kubeadm to recreate the certs.

cd /etc/kubernetes/ssl
mv apiserver.crt apiserver.crt.old
mv apiserver.key apiserver.key.old

cd /etc/kubernetes
kubeadm init phase certs apiserver --config kubeadm-config.yaml

Check the certificate, new host needs to be there.

openssl x509 -text -noout -in /etc/kubernetes/ssl/apiserver.crt

2) Run `cluster.yml`

Add the new host to the inventory and run cluster.yml.

3) Restart kube-system/nginx-proxy

In all hosts, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted in order to reload.

# run in every host
docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restart

4) Remove old master nodes

If you are replacing a node, remove the old one from the inventory, and remove from the cluster runtime.

kubectl drain --force --ignore-daemonsets --grace-period 300 --timeout 360s --delete-local-data NODE_NAME

kubectl delete node NODE_NAME

After that, the old node can be safely shutdown. Also, make sure to restart nginx-proxy in all remaining nodes (step 3)

From any active master that remains in the cluster, re-upload kubeadm-config.yaml

kubeadm config upload from-file --config /etc/kubernetes/kubeadm-config.yaml

Adding/replacing a worker node

This should be the easiest.

1) Add new node to the inventory.

2) Run upgrade-cluster.yml

You can use --limit=node1 to limit Kubespray to avoid disturbing other nodes in the cluster.

3) Drain the node that will be removed

kubectl drain --force=true --grace-period=10 --ignore-daemonsets=true --timeout=0s --delete-local-data NODE_NAME

4) Run the `remove-node.yml` playbook

With the old node still in the inventory, run remove-node.yml. You need to pass -e node=NODE_NAME to the playbook to limit the execution to the node being removed.

5) Remove the node from the inventory

That's it.

Adding/Replacing an etcd node

You need to make sure there are always an odd number of etcd nodes in the cluster. In such a way, this is always a replace or scale up operation. Either add two new nodes or remove an old one.

1) Add the new node running `cluster.yml`.

Update the inventory and run cluster.yml passing --limit=etcd,kube-master -e ignore_assert_errors=yes.

Run upgrade-cluster.yml also passing --limit=etcd,kube-master -e ignore_assert_errors=yes. This is necessary to update all etcd configuration in the cluster.

At this point, you will have an even number of nodes. Everything should still be working, and you should only have problems if the cluster decides to elect a new etcd leader before you remove a node. Even so, running applications should continue to be available.

2) Remove an old etcd node

With the node still in the inventory, run remove-node.yml passing -e node=NODE_NAME as the name of the node that should be removed.

3) Make sure the remaining etcd members have their config updated

In each etcd host that remains in the cluster:

cat /etc/etcd.env | grep ETCD_INITIAL_CLUSTER

Only active etcd members should be in that list.

4) Remove old etcd members from the cluster runtime

Acquire a shell prompt into one of the etcd containers and use etcdctl to remove the old member.

# list all members
etcdctl member list 

# remove old member
etcdctl member remove MEMBER_ID

# careful!!! if you remove a wrong member you will be in trouble

# note: these command lines are actually much bigger, since you need to pass all certificates to etcdctl.

5) Make sure the apiserver config is correctly updated.

In every master node, edit /etc/kubernetes/manifests/kube-apiserver.yaml. Make sure only active etcd nodes are still present in the apiserver command line parameter --etcd-servers=....

6) Shutdown the old instance

fejta-bot · 2019-10-10T18:08:25Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-11-09T18:50:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-11-09T18:50:59Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yujunz · 2019-11-26T06:20:54Z

Can https://github.com/kubernetes-sigs/kubespray/blob/48a182844c9c3438e36c78cbc4518c962e0a9ab2/docs/recover-control-plane.md be applied for adding new master/etcd nodes? @qvicksilver

qvicksilver · 2019-11-26T08:01:25Z

@yujunz Not sure, haven't really tried that use case. Also I'm a bit unsure of the state of that playbook. Haven't had time to add it to CI. But please do try.

holmesb · 2020-01-21T18:00:31Z

The procedure to add\remove masters belongs in the readme, not hidden away in a comment in this issue.

floryut · 2020-04-10T13:43:56Z

The procedure to add\remove masters belongs in the readme, not hidden away in a comment in this issue.

To be sure everybody see this, this was PR in #5570 and you can now find it here https://kubespray.io/#/docs/nodes

maxisam · 2020-07-03T04:18:50Z

docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restar

I think this line doesn't work anymore, there is no k8s_nginx-proxy_nginx-proxy pod.

olegsidokhmetov · 2020-08-10T11:10:48Z

You should be able to. In the past, we managed to replace all nodes in the cluster: master, etcd and workers. But.... there are some misteps you need to be carefull along the way. After a lot of experiments and retries in our lab environment, we came up with a few guidelines.

Adding/replacing a master node

1) Recreate apiserver certs manually to include the new master node in the cert SAN field.

For some reason, Kubespray will not update the apiserver certificate.

Edit /etc/kubernetes/kubeadm-config.yaml, include new host in certSANs list.

Use kubeadm to recreate the certs.
cd /etc/kubernetes/ssl
mv apiserver.crt apiserver.crt.old
mv apiserver.key apiserver.key.old

cd /etc/kubernetes
kubeadm init phase certs apiserver --config kubeadm-config.yaml
Check the certificate, new host needs to be there.
openssl x509 -text -noout -in /etc/kubernetes/ssl/apiserver.crt
2) Run cluster.yml

Add the new host to the inventory and run cluster.yml.

3) Restart kube-system/nginx-proxy

In all hosts, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted in order to reload.
# run in every host
docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restart
4) Remove old master nodes

If you are replacing a node, remove the old one from the inventory, and remove from the cluster runtime.
kubectl drain --force --ignore-daemonsets --grace-period 300 --timeout 360s --delete-local-data NODE_NAME

kubectl delete node NODE_NAME
After that, the old node can be safely shutdown. Also, make sure to restart nginx-proxy in all remaining nodes (step 3)

From any active master that remains in the cluster, re-upload kubeadm-config.yaml
kubeadm config upload from-file --config /etc/kubernetes/kubeadm-config.yaml
Adding/replacing a worker node

This should be the easiest.

1) Add new node to the inventory.

2) Run upgrade-cluster.yml

You can use --limit=node1 to limit Kubespray to avoid disturbing other nodes in the cluster.

3) Drain the node that will be removed
kubectl drain --force=true --grace-period=10 --ignore-daemonsets=true --timeout=0s --delete-local-data NODE_NAME
4) Run the remove-node.yml playbook

With the old node still in the inventory, run remove-node.yml. You need to pass -e node=NODE_NAME to the playbook to limit the execution to the node being removed.

5) Remove the node from the inventory

That's it.

Adding/Replacing an etcd node

You need to make sure there are always an odd number of etcd nodes in the cluster. In such a way, this is always a replace or scale up operation. Either add two new nodes or remove an old one.

1) Add the new node running cluster.yml.

Update the inventory and run cluster.yml passing --limit=etcd,kube-master -e ignore_assert_errors=yes.

Run upgrade-cluster.yml also passing --limit=etcd,kube-master -e ignore_assert_errors=yes. This is necessary to update all etcd configuration in the cluster.

At this point, you will have an even number of nodes. Everything should still be working, and you should only have problems if the cluster decides to elect a new etcd leader before you remove a node. Even so, running applications should continue to be available.

2) Remove an old etcd node

With the node still in the inventory, run remove-node.yml passing -e node=NODE_NAME as the name of the node that should be removed.

3) Make sure the remaining etcd members have their config updated

In each etcd host that remains in the cluster:
cat /etc/etcd.env | grep ETCD_INITIAL_CLUSTER
Only active etcd members should be in that list.

4) Remove old etcd members from the cluster runtime

Acquire a shell prompt into one of the etcd containers and use etcdctl to remove the old member.
# list all members
etcdctl member list 

# remove old member
etcdctl member remove MEMBER_ID

# careful!!! if you remove a wrong member you will be in trouble

# note: these command lines are actually much bigger, since you need to pass all certificates to etcdctl.
5) Make sure the apiserver config is correctly updated.

In every master node, edit /etc/kubernetes/manifests/kube-apiserver.yaml. Make sure only active etcd nodes are still present in the apiserver command line parameter --etcd-servers=....

6) Shutdown the old instance

Hello!
I have some issue with this commands

quersys@node1:/etc/kubernetes$ sudo kubeadm init phase certs apiserver --config kubeadm-config.yaml
W0810 11:08:48.479307 31818 utils.go:26] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
W0810 11:08:48.479525 31818 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[certs] Using existing apiserver certificate and key on disk

juliohm1978 · 2020-08-10T11:52:46Z

What version of K8s are you using? It's been almost a year since I posted. Did something change in kubeadm since then?

I would start by searching for official instructios on how to renew and recreate certs.

https://kubernetes.io/docs/tasks/administer-cluster/

olegsidokhmetov · 2020-08-10T11:55:18Z

What version of K8s are you using? It's been almost a year since I posted. Did something change in kubeadm since then?

I would start by searching for official instructios on how to renew and recreate certs.

https://kubernetes.io/docs/tasks/administer-cluster/

quersys@node1:/etc/kubernetes$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

I tried to find information about 6 hours :(

juliohm1978 · 2020-08-10T15:09:40Z

W0810 11:08:48.479307 31818 utils.go:26] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
W0810 11:08:48.479525 31818 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]

Those look like warnings. Most people seem to ignore them. Are you sure no error messages appear as well? Does it hang and never return? If that's the case, I'd wait for a timeout to hopefully get some actual error messages.

olegsidokhmetov · 2020-08-10T15:30:25Z

Yeh, also I have timeout error with my node, when I use cluster.yml and tried to add this node to master 10 авг. 2020 г. 17:10 пользователь Julio H Morimoto написал: ``` W0810 11:08:48.479307 31818 utils.go:26] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10] W0810 11:08:48.479525 31818 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io] ``` Those look like warnings. Most people seem to ignore them. Are you sure no error messages appear as well? Does it hang and never return? If that's the case, I'd wait for a timeout to hopefully get some actual error messages.

…

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #3471 (comment)

juliohm1978 · 2020-08-10T21:57:55Z

Sounds like a conectivity problem or something that leads to it. If you can provide any further logs and relevant messages, it would be helpful.

olegsidokhmetov · 2020-08-11T06:19:23Z

Thanks!!!

I have my new node ip in apiserver.crt

DNS:node1, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost, DNS:node1, DNS:node3, DNS:lb-apiserver.kubernetes.local, DNS:node1.cluster.local, DNS:node3.cluster.local, IP Address:10.233.0.1, IP Address:172.26.1.225, IP Address:172.26.1.225, IP Address:10.233.0.1, IP Address:127.0.0.1, IP Address:172.26.1.225, IP Address:172.26.1.130

but when I use command ansible-playbook -i inventory/quersyscluster/hosts.yml cluster.yml I have problem connection "timeout"

juliohm1978 · 2020-08-11T11:52:14Z

Please post relevant log messages for more context. At this level, "connection timeout" is a broad error message.

dagorka · 2020-10-29T07:07:38Z

Hi,
I am interested in to replace the first master (and the others) in kubernetes cluster using kubespray scripts. Is it possible?

Story:
I have build k8s cluster using kubespray scripts on Openstack with an old centos7 image. Next I want to upgrade OS, eg. from 7.7 to 7.8. I have newer OS image on Openstack prepared. I am able to deploy new masters and new workers with newer OS image. But there is a problem with first master. I need to delete whole vm and bring the new one with new OS. Did you have similar problem?

I tried to force master2 to be the first one, but when I do a join task on new master (eg. master4) it looks like kubeadm still want to connect to master1 (6.0.1.57):

kubeadm join --config kubeadm-controlplane.yaml --ignore-preflight-errors=all
W1028 12:37:17.050916    1666 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
        [WARNING FileExisting-ebtables]: ebtables not found in system path
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get https://6.0.1.57:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s: dial tcp 6.0.1.57:6443: connect: no route to host
To see the stack trace of this error execute with --v=5 or higher

Present, eg.:
master1, centos7.7
master2, centos7.7
master3, centos7.7

worker1, centos7.7
worker2, centos7.7
worker3, centos7.7

Expected:
master2, centos7.8 - master2 becomes the first one
master3, centos7.8
master4, centos7.8

worker1, centos7.8
worker2, centos7.8
worker3, centos7.8

How did you manage with recreate first master?

@juliohm1978 , maybe can you help?

Thanks!

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 10, 2019

k8s-ci-robot closed this as completed Nov 9, 2019

ydye mentioned this issue Nov 14, 2019

Failed to scale worker nodes due to the different cgroup driver between docker and kubelet #5262

Closed

yujunz mentioned this issue Jan 22, 2020

Add document about adding/replacing a node #5570

Merged

This was referenced Jan 22, 2020

Missing guidance for how to add masters #5572

Closed

Adding master fails due to missing SAN cert host #5573

Closed

HoKim98 mentioned this issue Feb 10, 2023

Enable to sort kube_control_plane including the first node #9756

Closed

HoKim98 mentioned this issue Mar 8, 2023

Enable to sort kube_control_plane including the first node #9866

Closed

Unable to add new master/etcd node to cluster #3471

Unable to add new master/etcd node to cluster #3471

Comments

adoo123 commented Oct 8, 2018 • edited Loading

ykfq commented Apr 1, 2019

juliohm1978 commented Jun 12, 2019

fejta-bot commented Sep 10, 2019

ppcololo commented Sep 10, 2019

juliohm1978 commented Sep 10, 2019 • edited Loading

Adding/replacing a master node

1) Recreate apiserver certs manually to include the new master node in the cert SAN field.

2) Run cluster.yml

3) Restart kube-system/nginx-proxy

4) Remove old master nodes

Adding/replacing a worker node

1) Add new node to the inventory.

2) Run upgrade-cluster.yml

3) Drain the node that will be removed

4) Run the remove-node.yml playbook

5) Remove the node from the inventory

Adding/Replacing an etcd node

1) Add the new node running cluster.yml.

2) Remove an old etcd node

3) Make sure the remaining etcd members have their config updated

4) Remove old etcd members from the cluster runtime

5) Make sure the apiserver config is correctly updated.

6) Shutdown the old instance

fejta-bot commented Oct 10, 2019

fejta-bot commented Nov 9, 2019

k8s-ci-robot commented Nov 9, 2019

yujunz commented Nov 26, 2019

qvicksilver commented Nov 26, 2019

holmesb commented Jan 21, 2020 • edited Loading

floryut commented Apr 10, 2020

maxisam commented Jul 3, 2020

olegsidokhmetov commented Aug 10, 2020

Adding/replacing a master node

1) Recreate apiserver certs manually to include the new master node in the cert SAN field.

2) Run cluster.yml

3) Restart kube-system/nginx-proxy

4) Remove old master nodes

Adding/replacing a worker node

1) Add new node to the inventory.

2) Run upgrade-cluster.yml

3) Drain the node that will be removed

4) Run the remove-node.yml playbook

5) Remove the node from the inventory

Adding/Replacing an etcd node

1) Add the new node running cluster.yml.

2) Remove an old etcd node

3) Make sure the remaining etcd members have their config updated

4) Remove old etcd members from the cluster runtime

5) Make sure the apiserver config is correctly updated.

6) Shutdown the old instance

juliohm1978 commented Aug 10, 2020

olegsidokhmetov commented Aug 10, 2020 • edited Loading

juliohm1978 commented Aug 10, 2020

olegsidokhmetov commented Aug 10, 2020 via email

juliohm1978 commented Aug 10, 2020

olegsidokhmetov commented Aug 11, 2020 • edited Loading

juliohm1978 commented Aug 11, 2020

dagorka commented Oct 29, 2020

adoo123 commented Oct 8, 2018 •

edited

Loading

juliohm1978 commented Sep 10, 2019 •

edited

Loading

2) Run `cluster.yml`

4) Run the `remove-node.yml` playbook

1) Add the new node running `cluster.yml`.

holmesb commented Jan 21, 2020 •

edited

Loading

2) Run `cluster.yml`

4) Run the `remove-node.yml` playbook

1) Add the new node running `cluster.yml`.

olegsidokhmetov commented Aug 10, 2020 •

edited

Loading

olegsidokhmetov commented Aug 11, 2020 •

edited

Loading