Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document node adding/removing/restoring #1122

Closed
rutsky opened this issue Mar 6, 2017 · 28 comments
Closed

Document node adding/removing/restoring #1122

rutsky opened this issue Mar 6, 2017 · 28 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@rutsky
Copy link
Contributor

rutsky commented Mar 6, 2017

How the following operations can be done with Kargo?

  1. Add new (non-master) node to cluster.
  2. Remove (non-master) node from cluster.
  3. Restore/recreate node in case of it's complete failure (if it's not enough to remove it and add again with the same role).
@mattymo
Copy link
Contributor

mattymo commented Mar 6, 2017

This can be turned into some work items, but in words, the answers are:
1 - A new node can be added by adding it to the inventory and added to the kube-node group. You can deploy this node with --limit nodename to skip redeployment of all other nodes. It will get certificates generated for etcd and k8s via delegation.
2 - Remove: just turn it off. If you want to clean it up from kubernetes API, you can run kubectl delete node nodename
3 - Recreate just means updating inventory to reflect any IP changes. Its old certs will be transferred to the recreated host. The node will enter ready state without any manual steps.

@ant31
Copy link
Contributor

ant31 commented Mar 11, 2017

@rutsky would you mind add a doc to describe those actions once your confortable with them ?

@rutsky
Copy link
Contributor Author

rutsky commented Mar 13, 2017

@ant31 I can write docs when will do these steps in practice.
Can't guarantee that will do this soon, so if anybody else volunteer I'd be happy review.

@mattymo in regard to removal I would suggest to do drain before shutting down machine.

@hellwen
Copy link

hellwen commented Mar 31, 2017

@mattymo i added a node using --limit nodename, but i have two problems.

  1. new node's name not config in old master's /etc/hosts file
kubectl logs push-manage-4272813693-nrm7r
Error from server: Get https://node3:10250/containerLogs/prod/push-manage-4272813693-nrm7r/push-manage:  dial tcp: lookup node3 on 10.233.0.2:53: no such host
  1. pod in the new node can not access the host outside of the cluster
kubectl exec busybox -- ping 192.168.11.207

no respone

@hellwen
Copy link

hellwen commented Mar 31, 2017

the second issue, i fixed by #1137

@foxyriver
Copy link
Contributor

@hellwen when adding a node using --limit nodename, if we need to cache info about all existing nodes in ansible.cfg. Because i add a node using --limit, there are errors show the parameter was undefined.

@hbokh
Copy link

hbokh commented Jun 2, 2017

Please be a little more specific on how to add a new node and IF that's still working.
This is how I ran it, without using kargo-cli, but with ansible-playbook:

$ ansible-playbook -u hbokh -b --become-user=root \
-i /Users/hbokh/.kargo/inventory/inventory.cfg /Users/hbokh/.kargo/cluster.yml \
-e ansible_python_interpreter=/opt/bin/python --limit linux004

I'm seeing this kind of error here too: #788 (comment)

@zouyee
Copy link
Member

zouyee commented Jun 5, 2017

when removed node, did not we need to remove the config of k8s agent such as kubelet, kube-proxy,etc.?

@dabealu
Copy link

dabealu commented Jun 11, 2017

I have related question: is it possible to add master/etcd node to existing cluster via kargo ?
i.e. i have initial cluster:

[kube-master]
node1

[etcd]
node1

and then want to scale master/etcd components:

[kube-master]
node1
node2

[etcd] # two etcd nodes just for example
node1
node2

Have no success adding new master node to inventory and run cluster.yml or scale.yml playbook.
Its crucial on production clusters when one master/etcd nodes failed and you need to replace it.
Would be great to have some docs about such scenarios, even if manual steps are required (for example to backup etcd and restore it on a new node).

@dabealu
Copy link

dabealu commented Jun 12, 2017

I'm gonna answer to my own question.
example initial hosts:

[kube-master]
node1

[etcd]
node1

[kube-node]
node1

then we add one more master node2:

[kube-master]
node1
node2

[etcd]
node1
node2

[kube-node]
node1
node2

run playbook, notice limiting to node2 only.
it will fail at etcd : Backup etcd v2 data step - its ok.

ansible-playbook -v -i hosts -l node2 cluster.yml

at node2:

rm -rf /var/lib/etcd/*
systemctl restart etcd

check etcd cluster health:

docker exec -ti etcd2 etcdctl \
  --cert-file /etc/ssl/etcd/ssl/member-node2.pem \
  --key-file /etc/ssl/etcd/ssl/member-node2-key.pem --ca-file /etc/ssl/etcd/ssl/ca.pem \
  --endpoints https://127.0.0.1:2379 cluster-health

run playbook again to finish node2 setup:

ansible-playbook -v -i hosts cluster.yml

check that node2 successfully added to cluster:

kubectl get node

Additionally there's a short howto delete one of master nodes
i.e. we want to delete node1:

# make node unschedulable
kubectl delete node node1

# get etcd node id which we want to remove
# we can do it from another node in case of node1 is down
docker exec -ti etcd1 etcdctl \
  --cert-file /etc/ssl/etcd/ssl/member-node1.pem \
  --key-file /etc/ssl/etcd/ssl/member-node1-key.pem --ca-file /etc/ssl/etcd/ssl/ca.pem \
  --endpoints https://127.0.0.1:2379 member list
  
# WARNING: preserve etcd quorum !
# i.e. if one of two nodes removed
# you end up with broken cluster

# remove etcd node
docker exec -ti etcd1 etcdctl \
  --cert-file /etc/ssl/etcd/ssl/member-node1.pem \
  --key-file /etc/ssl/etcd/ssl/member-node1-key.pem --ca-file /etc/ssl/etcd/ssl/ca.pem \
  --endpoints https://127.0.0.1:2379 member remove 1f1045449f5a28cb

# remove node1 from inventory file

# after running cluster playbook (this step isn't required)
# etcd nodes will be renamed
ansible-playbook -v -i hosts cluster.yml

I didn't find docs on this operations, so decided to share this.

@aponomarenko
Copy link

There is a blocker issue when adding new node to a cluster by latest ansible-playbook --limit option, see #1330 (comment). Can anybody help?

@shadycuz
Copy link

shadycuz commented Aug 5, 2017

@mattymo Does not work for me =/

ansible-playbook -i inventory/inventory.cfg cluster.yml -u root --limit node02
...
TASK [kubernetes/preinstall : set_fact] ***************************************************************************************************************
Saturday 05 August 2017  09:38:03 -0400 (0:00:00.060)       0:00:17.940 *******
fatal: [node02]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: 'dict object' has no attribute 'ansible_default_ipv4'\n\nThe error appears to have been in '/home/ansible/.kubespray/roles/kubernetes/preinstall/tasks/set_facts.yml': line 14, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- set_fact:\n  ^ here\n"}
        to retry, use: --limit @/home/ansible/.kubespray/cluster.retry

PLAY RECAP ********************************************************************************************************************************************
node02                     : ok=15   changed=2    unreachable=0    failed=1

@quraisyah
Copy link

quraisyah commented Sep 26, 2017

@dabealu that's great docs for adding nodes. it work for me. Thank you :)

the current host

node1
node2

[etcd]
node1
node2

[kube-node]
node1
node2

Then I add new node 'node3'

node1
node2
node3

[etcd]
node1
node2
node3

[kube-node]
node1
node2

Its successful added in the current cluster

ansible-playbook -v -i inventory/inventory.cfg -l node3 cluster.yml
...
TASK [kubernetes/preinstall : run xfs_growfs] *******************************************************
Tuesday 26 September 2017  10:36:46 +0800 (0:00:00.013)       0:48:14.511 ***** 

PLAY [kube-master[0]] *******************************************************************************
skipping: no hosts matched

PLAY RECAP ******************************************************************************************
node3                      : ok=404  changed=113  unreachable=0    failed=0   

@lebenitza
Copy link

Thanks @dabealu for the info.
Usually if I run into ansible errors I try to also run:

ansible -i <your-inventory>/inventory.cfg all -m setup --user root

before running

ansible-playbook -i <your-inventory>/inventory.cfg cluster.yml -b -v --private-key=~/.ssh/id_rsa --user root

again. And usually it fixes my errors.

Keep in mind that when you use --limit the /etc/hosts files on the other cluster nodes are not updated, hence errors like this will appear:

kubectl logs push-manage-4272813693-nrm7r
Error from server: Get https://node3:10250/containerLogs/prod/push-manage-4272813693-nrm7r/push-manage:  dial tcp: lookup node3 on 10.233.0.2:53: no such host

as per @hellwen`s example

@tetramin
Copy link

I was able to add etcd into the existing cluster at an existing node.

For this, I added a node to the etcd group in inventory.cfg.

Current:

node1
node2
node3

[etcd]
node1
node2

[kube-node]
node1
node2
node3

Added node3 to group etcd:

node1
node2
node3

[etcd]
node1
node2
node3

[kube-node]
node1
node2
node3

Then I deleted the certificate files and keys from the first node (for the task: "Check_certs | Set 'gen_certs' to true"):

rm $etcd_cert_dir/node-*

Then I started the script and it was executed successfully.

@clkao
Copy link
Contributor

clkao commented Mar 19, 2018

Thanks to @tetramin for the tip. It appears the gen_cert should be checking for member-* in addition to node-*, for adding existing node to etcd group to work properly.

@chestack
Copy link

chestack commented Jun 7, 2018

two problems about adding master

  1. Failed to run kubectl command on new master, with error:

Unable to connect to the server: x509: certificate is valid for "new master ip"

workaround:

rm -rf /etc/kubernets/ssl on node-1 to re-generate certificates including "new master ip"

  1. kubelet on old master/node failed to talk with apiserver, with error:

x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-ca")

workaround:

restart kubelet to reload new generated certificates files

@ant31 ant31 added this to the 2.7 milestone Aug 15, 2018
@Atoms Atoms added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2018
@woopstar woopstar removed this from the 2.7 milestone Sep 28, 2018
@MatthiasLohr
Copy link

Thanks to @tetramin for the tip. It appears the gen_cert should be checking for member-* in addition to node-*, for adding existing node to etcd group to work properly.

I think this should also cover admin-* (for being able to add master nodes).

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 11, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2019
@vishveshg
Copy link

Thanks to @tetramin for the tip. It appears the gen_cert should be checking for member-* in addition to node-*, for adding existing node to etcd group to work properly.

I think this should also cover admin-* (for being able to add master nodes).

Repurposing an existing node to different role (worker to etcd or etcd to master..) by Kubespray is very painful. Best approach would be to scale down the cluster by remove-node.yml playbook, delete node-* cert of that particular node from /etc/ssl/etcd/ssl dir of etcd[0] and re-ran cluster.yml playbook to convert an old worker node to etcd node.

@smimenon
Copy link

smimenon commented Aug 7, 2019

I tried to scale first Master and Etcd nodes. I then tried to scale just the master nodes and I didn't have any success in either cases. I ran cluster.yaml after adding new master nodes and it fails at this task.
TASK [kubernetes/master : kubeadm | Init other uninitialized masters] **********. I

In the kibe-api server logs for new master nodes, it has following error:
I0806 23:00:10.238113 1 log.go:172] http: TLS handshake error from :43200: remote error: tls: bad certificate

Is there anything specific that needs to be done to scale master nodes? Other than updating inventory and running cluster.yaml?

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2019
@Ankur7890
Copy link

Is there any feasibility of adding worker node without kubespray instead manually installing services of kubelet, kube-proxy, docker, flannel on node.

Please suggest.

@vishveshg
Copy link

Is there any feasibility of adding worker node without kubespray instead manually installing services of kubelet, kube-proxy, docker, flannel on node.

Please suggest.

Are you looking for running kubeadm manually? Refer the kubeadm documentation for details..

@Ankur7890
Copy link

No

Is there any feasibility of adding worker node without kubespray instead manually installing services of kubelet, kube-proxy, docker, flannel on node.
Please suggest.

Are you looking for running kubeadm manually? Refer the kubeadm documentation for details..

No not kubeadm, it can be easily done with joining of nodes. Interested in cluster having multiple etcd/apiserver, can we add node manually there without using kube spray.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests