Document node adding/removing/restoring #1122

rutsky · 2017-03-06T18:32:30Z

How the following operations can be done with Kargo?

Add new (non-master) node to cluster.
Remove (non-master) node from cluster.
Restore/recreate node in case of it's complete failure (if it's not enough to remove it and add again with the same role).

mattymo · 2017-03-06T19:12:32Z

This can be turned into some work items, but in words, the answers are:
1 - A new node can be added by adding it to the inventory and added to the kube-node group. You can deploy this node with --limit nodename to skip redeployment of all other nodes. It will get certificates generated for etcd and k8s via delegation.
2 - Remove: just turn it off. If you want to clean it up from kubernetes API, you can run kubectl delete node nodename
3 - Recreate just means updating inventory to reflect any IP changes. Its old certs will be transferred to the recreated host. The node will enter ready state without any manual steps.

ant31 · 2017-03-11T14:20:22Z

@rutsky would you mind add a doc to describe those actions once your confortable with them ?

rutsky · 2017-03-13T12:05:31Z

@ant31 I can write docs when will do these steps in practice.
Can't guarantee that will do this soon, so if anybody else volunteer I'd be happy review.

@mattymo in regard to removal I would suggest to do drain before shutting down machine.

hellwen · 2017-03-31T03:16:06Z

@mattymo i added a node using --limit nodename, but i have two problems.

new node's name not config in old master's /etc/hosts file

kubectl logs push-manage-4272813693-nrm7r
Error from server: Get https://node3:10250/containerLogs/prod/push-manage-4272813693-nrm7r/push-manage:  dial tcp: lookup node3 on 10.233.0.2:53: no such host

pod in the new node can not access the host outside of the cluster

kubectl exec busybox -- ping 192.168.11.207

no respone

hellwen · 2017-03-31T05:33:20Z

the second issue, i fixed by #1137

foxyriver · 2017-04-12T09:31:29Z

@hellwen when adding a node using --limit nodename, if we need to cache info about all existing nodes in ansible.cfg. Because i add a node using --limit, there are errors show the parameter was undefined.

hbokh · 2017-06-02T09:54:41Z

Please be a little more specific on how to add a new node and IF that's still working.
This is how I ran it, without using kargo-cli, but with ansible-playbook:

$ ansible-playbook -u hbokh -b --become-user=root \
-i /Users/hbokh/.kargo/inventory/inventory.cfg /Users/hbokh/.kargo/cluster.yml \
-e ansible_python_interpreter=/opt/bin/python --limit linux004

I'm seeing this kind of error here too: #788 (comment)

zouyee · 2017-06-05T02:09:08Z

when removed node, did not we need to remove the config of k8s agent such as kubelet, kube-proxy,etc.?

dabealu · 2017-06-11T18:10:16Z

I have related question: is it possible to add master/etcd node to existing cluster via kargo ?
i.e. i have initial cluster:

[kube-master]
node1

[etcd]
node1

and then want to scale master/etcd components:

[kube-master]
node1
node2

[etcd] # two etcd nodes just for example
node1
node2

Have no success adding new master node to inventory and run cluster.yml or scale.yml playbook.
Its crucial on production clusters when one master/etcd nodes failed and you need to replace it.
Would be great to have some docs about such scenarios, even if manual steps are required (for example to backup etcd and restore it on a new node).

dabealu · 2017-06-12T16:09:24Z

I'm gonna answer to my own question.
example initial hosts:

[kube-master]
node1

[etcd]
node1

[kube-node]
node1

then we add one more master node2:

[kube-master]
node1
node2

[etcd]
node1
node2

[kube-node]
node1
node2

run playbook, notice limiting to node2 only.
it will fail at etcd : Backup etcd v2 data step - its ok.

ansible-playbook -v -i hosts -l node2 cluster.yml

at node2:

rm -rf /var/lib/etcd/*
systemctl restart etcd

check etcd cluster health:

docker exec -ti etcd2 etcdctl \
  --cert-file /etc/ssl/etcd/ssl/member-node2.pem \
  --key-file /etc/ssl/etcd/ssl/member-node2-key.pem --ca-file /etc/ssl/etcd/ssl/ca.pem \
  --endpoints https://127.0.0.1:2379 cluster-health

run playbook again to finish node2 setup:

ansible-playbook -v -i hosts cluster.yml

check that node2 successfully added to cluster:

kubectl get node

Additionally there's a short howto delete one of master nodes
i.e. we want to delete node1:

# make node unschedulable
kubectl delete node node1

# get etcd node id which we want to remove
# we can do it from another node in case of node1 is down
docker exec -ti etcd1 etcdctl \
  --cert-file /etc/ssl/etcd/ssl/member-node1.pem \
  --key-file /etc/ssl/etcd/ssl/member-node1-key.pem --ca-file /etc/ssl/etcd/ssl/ca.pem \
  --endpoints https://127.0.0.1:2379 member list
  
# WARNING: preserve etcd quorum !
# i.e. if one of two nodes removed
# you end up with broken cluster

# remove etcd node
docker exec -ti etcd1 etcdctl \
  --cert-file /etc/ssl/etcd/ssl/member-node1.pem \
  --key-file /etc/ssl/etcd/ssl/member-node1-key.pem --ca-file /etc/ssl/etcd/ssl/ca.pem \
  --endpoints https://127.0.0.1:2379 member remove 1f1045449f5a28cb

# remove node1 from inventory file

# after running cluster playbook (this step isn't required)
# etcd nodes will be renamed
ansible-playbook -v -i hosts cluster.yml

I didn't find docs on this operations, so decided to share this.

aponomarenko · 2017-06-27T09:08:41Z

There is a blocker issue when adding new node to a cluster by latest ansible-playbook --limit option, see #1330 (comment). Can anybody help?

shadycuz · 2017-08-05T13:39:36Z

@mattymo Does not work for me =/

ansible-playbook -i inventory/inventory.cfg cluster.yml -u root --limit node02
...
TASK [kubernetes/preinstall : set_fact] ***************************************************************************************************************
Saturday 05 August 2017  09:38:03 -0400 (0:00:00.060)       0:00:17.940 *******
fatal: [node02]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: 'dict object' has no attribute 'ansible_default_ipv4'\n\nThe error appears to have been in '/home/ansible/.kubespray/roles/kubernetes/preinstall/tasks/set_facts.yml': line 14, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- set_fact:\n  ^ here\n"}
        to retry, use: --limit @/home/ansible/.kubespray/cluster.retry

PLAY RECAP ********************************************************************************************************************************************
node02                     : ok=15   changed=2    unreachable=0    failed=1

quraisyah · 2017-09-26T03:52:29Z

@dabealu that's great docs for adding nodes. it work for me. Thank you :)

the current host

node1
node2

[etcd]
node1
node2

[kube-node]
node1
node2

Then I add new node 'node3'

node1
node2
node3

[etcd]
node1
node2
node3

[kube-node]
node1
node2

Its successful added in the current cluster

ansible-playbook -v -i inventory/inventory.cfg -l node3 cluster.yml
...
TASK [kubernetes/preinstall : run xfs_growfs] *******************************************************
Tuesday 26 September 2017  10:36:46 +0800 (0:00:00.013)       0:48:14.511 ***** 

PLAY [kube-master[0]] *******************************************************************************
skipping: no hosts matched

PLAY RECAP ******************************************************************************************
node3                      : ok=404  changed=113  unreachable=0    failed=0

lebenitza · 2017-11-15T10:36:33Z

Thanks @dabealu for the info.
Usually if I run into ansible errors I try to also run:

ansible -i <your-inventory>/inventory.cfg all -m setup --user root

before running

ansible-playbook -i <your-inventory>/inventory.cfg cluster.yml -b -v --private-key=~/.ssh/id_rsa --user root

again. And usually it fixes my errors.

Keep in mind that when you use --limit the /etc/hosts files on the other cluster nodes are not updated, hence errors like this will appear:

kubectl logs push-manage-4272813693-nrm7r
Error from server: Get https://node3:10250/containerLogs/prod/push-manage-4272813693-nrm7r/push-manage:  dial tcp: lookup node3 on 10.233.0.2:53: no such host

as per @hellwen`s example

tetramin · 2018-01-25T16:06:00Z

I was able to add etcd into the existing cluster at an existing node.

For this, I added a node to the etcd group in inventory.cfg.

Current:

node1
node2
node3

[etcd]
node1
node2

[kube-node]
node1
node2
node3

Added node3 to group etcd:

node1
node2
node3

[etcd]
node1
node2
node3

[kube-node]
node1
node2
node3

Then I deleted the certificate files and keys from the first node (for the task: "Check_certs | Set 'gen_certs' to true"):

rm $etcd_cert_dir/node-*

Then I started the script and it was executed successfully.

clkao · 2018-03-19T16:15:41Z

Thanks to @tetramin for the tip. It appears the gen_cert should be checking for member-* in addition to node-*, for adding existing node to etcd group to work properly.

chestack · 2018-06-07T10:21:49Z

two problems about adding master

Failed to run kubectl command on new master, with error:

Unable to connect to the server: x509: certificate is valid for "new master ip"

workaround:

rm -rf /etc/kubernets/ssl on node-1 to re-generate certificates including "new master ip"

kubelet on old master/node failed to talk with apiserver, with error:

x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-ca")

workaround:

restart kubelet to reload new generated certificates files

MatthiasLohr · 2019-03-03T23:31:42Z

Thanks to @tetramin for the tip. It appears the gen_cert should be checking for member-* in addition to node-*, for adding existing node to etcd group to work properly.

I think this should also cover admin-* (for being able to add master nodes).

fejta-bot · 2019-04-11T03:55:53Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-07-10T05:42:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vishveshg · 2019-07-18T17:42:51Z

Thanks to @tetramin for the tip. It appears the gen_cert should be checking for member-* in addition to node-*, for adding existing node to etcd group to work properly.

I think this should also cover admin-* (for being able to add master nodes).

Repurposing an existing node to different role (worker to etcd or etcd to master..) by Kubespray is very painful. Best approach would be to scale down the cluster by remove-node.yml playbook, delete node-* cert of that particular node from /etc/ssl/etcd/ssl dir of etcd[0] and re-ran cluster.yml playbook to convert an old worker node to etcd node.

smimenon · 2019-08-07T15:41:38Z

I tried to scale first Master and Etcd nodes. I then tried to scale just the master nodes and I didn't have any success in either cases. I ran cluster.yaml after adding new master nodes and it fails at this task.
TASK [kubernetes/master : kubeadm | Init other uninitialized masters] **********. I

In the kibe-api server logs for new master nodes, it has following error:
I0806 23:00:10.238113 1 log.go:172] http: TLS handshake error from :43200: remote error: tls: bad certificate

Is there anything specific that needs to be done to scale master nodes? Other than updating inventory and running cluster.yaml?

fejta-bot · 2019-09-06T15:48:06Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Ankur7890 · 2019-09-13T05:41:43Z

Is there any feasibility of adding worker node without kubespray instead manually installing services of kubelet, kube-proxy, docker, flannel on node.

Please suggest.

vishveshg · 2019-09-13T05:49:05Z

Is there any feasibility of adding worker node without kubespray instead manually installing services of kubelet, kube-proxy, docker, flannel on node.

Please suggest.

Are you looking for running kubeadm manually? Refer the kubeadm documentation for details..

Ankur7890 · 2019-09-13T06:13:32Z

No

Is there any feasibility of adding worker node without kubespray instead manually installing services of kubelet, kube-proxy, docker, flannel on node.
Please suggest.

Are you looking for running kubeadm manually? Refer the kubeadm documentation for details..

No not kubeadm, it can be easily done with joining of nodes. Interested in cluster having multiple etcd/apiserver, can we add node manually there without using kube spray.

fejta-bot · 2019-10-13T07:08:02Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-10-13T07:08:09Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yukirii mentioned this issue Dec 4, 2017

How to recreate k8s-master with Kubespray? #2020

Closed

clkao mentioned this issue Mar 19, 2018

Does a playbook exist that can add etcd and master nodes to an existing cluster? #2241

Closed

ant31 mentioned this issue Aug 15, 2018

Extending our cluster deployed by kargo #788

Closed

ant31 added this to the 2.7 milestone Aug 15, 2018

Atoms added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2018

woopstar removed this from the 2.7 milestone Sep 28, 2018

adoo123 mentioned this issue Oct 8, 2018

Unable to add new master/etcd node to cluster #3471

Closed

qvicksilver mentioned this issue Jan 30, 2019

Documentation and playbook for recovering control plane from node failure #4146

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2019

k8s-ci-robot closed this as completed Oct 13, 2019

haminhcong mentioned this issue Apr 4, 2021

Kubespray periodic backup etcd data - disaster recovery cluster from etcd backups haminhcong/k8s-til#10

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document node adding/removing/restoring #1122

Document node adding/removing/restoring #1122

rutsky commented Mar 6, 2017

mattymo commented Mar 6, 2017

ant31 commented Mar 11, 2017

rutsky commented Mar 13, 2017

hellwen commented Mar 31, 2017

hellwen commented Mar 31, 2017

foxyriver commented Apr 12, 2017

hbokh commented Jun 2, 2017 •

edited

zouyee commented Jun 5, 2017

dabealu commented Jun 11, 2017

dabealu commented Jun 12, 2017 •

edited

aponomarenko commented Jun 27, 2017

shadycuz commented Aug 5, 2017 •

edited

quraisyah commented Sep 26, 2017 •

edited

lebenitza commented Nov 15, 2017

tetramin commented Jan 25, 2018

clkao commented Mar 19, 2018

chestack commented Jun 7, 2018

MatthiasLohr commented Mar 3, 2019

fejta-bot commented Apr 11, 2019

fejta-bot commented Jul 10, 2019

vishveshg commented Jul 18, 2019

smimenon commented Aug 7, 2019

fejta-bot commented Sep 6, 2019

Ankur7890 commented Sep 13, 2019

vishveshg commented Sep 13, 2019

Ankur7890 commented Sep 13, 2019

fejta-bot commented Oct 13, 2019

k8s-ci-robot commented Oct 13, 2019

Document node adding/removing/restoring #1122

Document node adding/removing/restoring #1122

Comments

rutsky commented Mar 6, 2017

mattymo commented Mar 6, 2017

ant31 commented Mar 11, 2017

rutsky commented Mar 13, 2017

hellwen commented Mar 31, 2017

hellwen commented Mar 31, 2017

foxyriver commented Apr 12, 2017

hbokh commented Jun 2, 2017 • edited

zouyee commented Jun 5, 2017

dabealu commented Jun 11, 2017

dabealu commented Jun 12, 2017 • edited

aponomarenko commented Jun 27, 2017

shadycuz commented Aug 5, 2017 • edited

quraisyah commented Sep 26, 2017 • edited

lebenitza commented Nov 15, 2017

tetramin commented Jan 25, 2018

clkao commented Mar 19, 2018

chestack commented Jun 7, 2018

MatthiasLohr commented Mar 3, 2019

fejta-bot commented Apr 11, 2019

fejta-bot commented Jul 10, 2019

vishveshg commented Jul 18, 2019

smimenon commented Aug 7, 2019

fejta-bot commented Sep 6, 2019

Ankur7890 commented Sep 13, 2019

vishveshg commented Sep 13, 2019

Ankur7890 commented Sep 13, 2019

fejta-bot commented Oct 13, 2019

k8s-ci-robot commented Oct 13, 2019

hbokh commented Jun 2, 2017 •

edited

dabealu commented Jun 12, 2017 •

edited

shadycuz commented Aug 5, 2017 •

edited

quraisyah commented Sep 26, 2017 •

edited