Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) #10560

ccaillet1974 · 2023-10-26T09:08:31Z

Environment:

Baremetal cluster upgrading member from bullseye to bookworm, the following version are exectuted on the kubespray deployment machine
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 5.10.0-23-amd64 x86_64
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Version of Ansible (ansible --version):
ansible [core 2.14.11]
config file = /homes/totof/myk8s/kubespray-master/ansible.cfg
configured module search path = ['/homes/totof/myk8s/kubespray-master/library']
ansible python module location = /usr/local/lib/python3.9/dist-packages/ansible
ansible collection location = /homes/totof/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/local/bin/ansible
python version = 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (/usr/bin/python3)
jinja version = 3.1.2
libyaml = True
Version of Python (python --version):
Python 3.9.2

Kubespray version (commit) (git rev-parse --short HEAD):
7dcc22f

Network plugin used:
cilium

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
inventory-variables.txt

Command used to invoke ansible:
ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K --limit=kube_control_plane cluster.yml

Output of ansible run:

TASK [etcd : Configure | Ensure etcd is running] ***********************************************************************************************************************************************************************************************
ok: [lyo0-k8s-testm02]
ok: [lyo0-k8s-testm01]
fatal: [lyo0-k8s-testm00]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because a timeout was exceeded.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}
Thursday 26 October 2023  10:40:33 +0200 (0:01:30.885)       0:09:46.376 ******
Thursday 26 October 2023  10:40:33 +0200 (0:00:00.071)       0:09:46.448 ******
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (4 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (3 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (2 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (1 retries left).

TASK [etcd : Configure | Wait for etcd cluster to be healthy] **********************************************************************************************************************************************************************************
fatal: [lyo0-k8s-testm01]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.042724", "end": "2023-10-26 10:41:24.339102", "msg": "non-zero return code", "rc": 1, "start": "2023-10-26 10:41:19.296378", "stderr": "{\"level\":\"warn\",\"ts\":\"2023-10-26T10:41:24.329966+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000344fc0/10.141.10.65:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2023-10-26T10:41:24.329966+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000344fc0/10.141.10.65:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************
lyo0-k8s-testm00           : ok=496  changed=47   unreachable=0    failed=1    skipped=494  rescued=0    ignored=1
lyo0-k8s-testm01           : ok=508  changed=9    unreachable=0    failed=1    skipped=574  rescued=0    ignored=0
lyo0-k8s-testm02           : ok=484  changed=10   unreachable=0    failed=0    skipped=501  rescued=0    ignored=0

Thursday 26 October 2023  10:41:24 +0200 (0:00:51.257)       0:10:37.705 ******
===============================================================================
etcd : Configure | Ensure etcd is running ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 90.89s
etcd : Configure | Wait for etcd cluster to be healthy --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 51.26s
etcd : Gen_certs | Write etcd member/admin and kube_control_plane client certs to other etcd nodes ------------------------------------------------------------------------------------------------------------------------------------- 18.48s
download : Download_container | Download image if required ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.25s
etcd : Gen_certs | Write node certs to other etcd nodes -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 12.02s
container-engine/runc : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.76s
download : Download_file | Download item ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.66s
container-engine/containerd : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.66s
etcdctl_etcdutl : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.57s
container-engine/crictl : Download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.57s
container-engine/nerdctl : Download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.50s
container-engine/crictl : Extract_file | Unpacking archive ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.28s
etcd : Gen_certs | Gather etcd member/admin and kube_control_plane client certs from first etcd node ------------------------------------------------------------------------------------------------------------------------------------ 6.18s
etcdctl_etcdutl : Extract_file | Unpacking archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.08s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 5.57s
container-engine/nerdctl : Extract_file | Unpacking archive ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.56s
container-engine/containerd : Containerd | Unpack containerd archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.54s
container-engine/validate-container-engine : Populate service facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.50s
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.45s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 5.15s

Description of the problem :
I've run the following command to remove one control_plane node which is currently in bullseye (Debian 11)
ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K remove-node.yml -e node=lyo0-k8s-testm00.

I previously change the order on the inventory file as described in docs/nodes.md. After that I freshly installed the node with bookworm. And after that I try to add again my node which is now on bookworm with the command
ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K --limit=kube_control_plane cluster.yml

Here are the logs on the new control_plane node for etcd entries :

2023-10-26T11:03:32.800850+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.800034+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b is starting a new election at term 1"}
2023-10-26T11:03:32.801193+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801124+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b became pre-candidate at term 1"}
2023-10-26T11:03:32.801382+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.80132+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b received MsgPreVoteResp from 7d81e23d9d41da1b
at term 1"}
2023-10-26T11:03:32.801584+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801524+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote reques
t to c101cbbb43bf28a0 at term 1"}
2023-10-26T11:03:32.801780+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801721+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote reques
t to eb484ede068d3a18 at term 1"}
2023-10-26T11:03:34.838611+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838023+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT
_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839025+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838013+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAP
SHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839093+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838049+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT
_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839234+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838979+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAP
SHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}

The command failed at etcd stage (deployed as host method) withe the message described bellow. I've already done this kind of operation but only for changing hardware for control plane nodes and with kubespray release-2.22 branch and all is working well.

Any assistance will be appreciated for resolving this etcd problem

The text was updated successfully, but these errors were encountered:

ccaillet1974 · 2023-10-26T11:12:57Z

ADDENDUM :

Same problem when trying to add the node in bullseye. For testing I reinstalled the node in bullseye and I had the same issue when trying to add my node in control-plane with the command described bellow.

blackluck · 2023-10-26T11:19:39Z

Hello, not clear for me, master nodes also etcd nodes or not? If it is then maybe you should also use etcd group in limit like: '--limit=kube_control_plane,etcd'

ccaillet1974 · 2023-10-26T11:21:46Z

Ok ... I'll test the adding part with etcd limit but for me control_plane include etcd ... give infos in about 20 minutes

EDIT : Same error with etcd included in limit parameter.

blackluck · 2023-10-26T11:58:53Z

Doc also mentions for etcd nodes you need to set -e ignore_assert_errors=yes

ccaillet1974 · 2023-10-26T13:44:07Z

Same with the -e ignore_asset_errors=yes

As I said earlier it seems that there is a problem with cert generation because I've "bad tls certificate" on logs on my new node

EDIT : this process to change a node in the control_plane worked well with release-2.22 branch ... maybe a regression ?

EDIT 2 :

logs on other nodes :

Oct 26 16:17:10 lyo0-k8s-testm02 etcd[1375044]: {"level":"warn","ts":"2023-10-26T16:17:10.13056+0200","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.141.10.64:53502","server-name":"","error":"tls: failed to verify client certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"}

Oct 26 16:19:08 lyo0-k8s-testm01 etcd[1374901]: {"level":"warn","ts":"2023-10-26T16:19:08.146019+0200","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.141.10.64:60964","server-name":"","error":"tls: failed to verify client certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"}

It seems that there is a big problem on certificate generation for the new node IP : 10.141.10.64 :'(

blackluck · 2023-10-26T19:01:43Z

Is that possible that you run it before changed host order in inventory also? That could cause generating new certs if run on empty master (because that's the first node)

ccaillet1974 · 2023-10-27T06:45:12Z

Yes i'll test it now

EDIT : same result when new node is in first place

TASK [etcd : Configure | Wait for etcd cluster to be healthy] **********************************************************************************************************************************************************************************
fatal: [lyo0-k8s-testm00]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.033871", "end": "2023-10-27 08:56:01.179553", "msg": "non-zero return code", "rc": 1, "start": "2023-10-27 08:55:56.145682", "stderr": "{\"level\":\"warn\",\"ts\":\"2023-10-27T08:56:01.175361+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000388fc0/10.141.10.64:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection closed before server preface received\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2023-10-27T08:56:01.175361+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000388fc0/10.141.10.64:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection closed before server preface received\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************
lyo0-k8s-testm00           : ok=489  changed=13   unreachable=0    failed=1    skipped=593  rescued=0    ignored=0
lyo0-k8s-testm01           : ok=470  changed=12   unreachable=0    failed=0    skipped=515  rescued=0    ignored=0
lyo0-k8s-testm02           : ok=471  changed=13   unreachable=0    failed=0    skipped=514  rescued=0    ignored=0

Friday 27 October 2023  08:56:01 +0200 (0:00:42.656)       0:08:10.829 ********
===============================================================================
etcd : Configure | Wait for etcd cluster to be healthy --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 42.66s
etcd : Gen_certs | Write etcd member/admin and kube_control_plane client certs to other etcd nodes ------------------------------------------------------------------------------------------------------------------------------------- 16.84s
etcd : Gen_certs | Write node certs to other etcd nodes -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 11.85s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 7.45s
container-engine/containerd : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 7.37s
download : Download_file | Download item ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.90s
container-engine/runc : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.86s
container-engine/nerdctl : Download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.76s
etcdctl_etcdutl : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.73s
container-engine/crictl : Download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.72s
etcdctl_etcdutl : Extract_file | Unpacking archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.39s
container-engine/crictl : Extract_file | Unpacking archive ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.35s
container-engine/validate-container-engine : Populate service facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.90s
container-engine/nerdctl : Extract_file | Unpacking archive ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.73s
etcd : Gen_certs | Gather etcd member/admin and kube_control_plane client certs from first etcd node ------------------------------------------------------------------------------------------------------------------------------------ 5.61s
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.40s
container-engine/containerd : Containerd | Unpack containerd archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.70s
container-engine/runc : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.61s
container-engine/containerd : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.49s
etcdctl_etcdutl : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.41s

Here is the syslog on new node when try to start etcd

Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.032429+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b is starting a new election at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033375+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b became pre-candidate at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033601+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b received MsgPreVoteResp from 7d81e23d9d41da1b at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033838+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote request to c101cbbb43bf28a0 at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.034049+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote request to eb484ede068d3a18 at term 1"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.067894+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.069697+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.072727+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.074858+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.048835+0200","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"7d81e23d9d41da1b","local-member-attributes":"{Name:etcd1 ClientURLs:[https://10.141.10.64:2379]}","request-path":"/0/members/7d81e23d9d41da1b/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.073383+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.074327+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.074809+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.075396+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: start operation timed out. Terminating.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: Failed with result 'timeout'.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: Failed to start etcd.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: Consumed 18.670s CPU time.

blackluck · 2023-10-27T11:02:06Z

Sorry I wasn't asking you to do it this way.
I was asking if you already did it earlier, because if you try to add an empty new node as it is the first master, then kubespray won't find certs on that and handle it as kinda installing a new cluster so creating new certs that copy to other masters just probably won't restart components on them, so until they run they will have in memory the old certs, but on filesystem new ones. And if it try to join a new master it will copy the new certs which won't match the still running components.
I would say check if there is a backup of certs and they changed or not.

ccaillet1974 · 2023-10-27T11:10:25Z

As said earlier the "new node" is done by : remove the node (with the remove-node.yml) after that I upgrade my node from bullseye to bookworm then I add again my node in the cluster with the appropriate command.

This process have been already done with kubespray (branch release-2.22) at this time I'd changed my control_plane node on another cluster from VM to baremetal servers using the process described in docs/nodes.md and all worked perfectly.

And the node IS not the first aster because nodes.md describe how to remove/add the first control_plane node and I followed this documentation.

For me there is some problems in release-2.23 branch .. don't try with master branch.

Actually I was working on another solution for upgrading my nodes :
1- drain the node
2- upgrading via apt full-upgrade
3- reboot the node
4- uncordon the node

FingerlessGlov3s · 2023-12-07T09:56:20Z

Actually I was working on another solution for upgrading my nodes
1- drain the node
2- upgrading via apt full-upgrade
3- reboot the node
4- uncordon the node

Any update on this? or did you go about upgrading the OS itself differently?

ccaillet1974 · 2023-12-07T09:59:59Z

Hi,

Sorry for the delay :)

All is working, with this method. I've also tested when all node are on the same distro version (all in debian bookworm) and now deleting control_plane node and readding it works.

So maybe the issue is due to the version mismatch between control_plane nodes, my two cents :)

Regards

FingerlessGlov3s · 2023-12-07T11:12:55Z

Hi,

Sorry for the delay :)

All is working, with this method. I've also tested when all node are on the same distro version (all in debian bookworm) and now deleting control_plane node and readding it works.

So maybe the issue is due to the version mismatch between control_plane nodes, my two cents :)

Regards

So whats your process now?

remove the node, OS upgrade, then add it back?

ccaillet1974 · 2023-12-07T11:42:55Z

Yes it is — Cordialement Christophe Caillet

…

________________________________ De : FingerlessGloves ***@***.***> Envoyé : Thursday, December 7, 2023 12:13:06 PM À : kubernetes-sigs/kubespray ***@***.***> Cc : CAILLET, Christophe ***@***.***>; Author ***@***.***> Objet : Re: [kubernetes-sigs/kubespray] Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) (Issue #10560) Hi, Sorry for the delay :) All is working, with this method. I've also tested when all node are on the same distro version (all in debian bookworm) and now deleting control_plane node and readding it works. So maybe the issue is due to the version mismatch between control_plane nodes, my two cents :) Regards So whats your process now? remove the node, OS upgrade, then add it back? — Reply to this email directly, view it on GitHub<#10560 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AW2PVQPP3TFJOWRPV5UX66LYIGQEFAVCNFSM6AAAAAA6Q2CVDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBVGE2TGMJRGA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

VannTen · 2024-02-08T09:02:49Z

Possibly related #upgrade #10808

k8s-triage-robot · 2024-05-08T09:56:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-06-07T10:36:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-07-07T10:58:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-07-07T10:58:47Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ccaillet1974 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2023

ccaillet1974 mentioned this issue Feb 8, 2024

Update control plane node replacement steps in doc #10808

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 7, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) #10560

Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) #10560

ccaillet1974 commented Oct 26, 2023 •

edited

Loading

ccaillet1974 commented Oct 26, 2023

blackluck commented Oct 26, 2023

ccaillet1974 commented Oct 26, 2023 •

edited

Loading

blackluck commented Oct 26, 2023

ccaillet1974 commented Oct 26, 2023 •

edited

Loading

blackluck commented Oct 26, 2023

ccaillet1974 commented Oct 27, 2023 •

edited

Loading

blackluck commented Oct 27, 2023

ccaillet1974 commented Oct 27, 2023

FingerlessGlov3s commented Dec 7, 2023

ccaillet1974 commented Dec 7, 2023

FingerlessGlov3s commented Dec 7, 2023

ccaillet1974 commented Dec 7, 2023 via email

VannTen commented Feb 8, 2024

k8s-triage-robot commented May 8, 2024

k8s-triage-robot commented Jun 7, 2024

k8s-triage-robot commented Jul 7, 2024

k8s-ci-robot commented Jul 7, 2024

Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) #10560

Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) #10560

Comments

ccaillet1974 commented Oct 26, 2023 • edited Loading

ccaillet1974 commented Oct 26, 2023

blackluck commented Oct 26, 2023

ccaillet1974 commented Oct 26, 2023 • edited Loading

blackluck commented Oct 26, 2023

ccaillet1974 commented Oct 26, 2023 • edited Loading

blackluck commented Oct 26, 2023

ccaillet1974 commented Oct 27, 2023 • edited Loading

blackluck commented Oct 27, 2023

ccaillet1974 commented Oct 27, 2023

FingerlessGlov3s commented Dec 7, 2023

ccaillet1974 commented Dec 7, 2023

FingerlessGlov3s commented Dec 7, 2023

ccaillet1974 commented Dec 7, 2023 via email

VannTen commented Feb 8, 2024

k8s-triage-robot commented May 8, 2024

k8s-triage-robot commented Jun 7, 2024

k8s-triage-robot commented Jul 7, 2024

k8s-ci-robot commented Jul 7, 2024

ccaillet1974 commented Oct 26, 2023 •

edited

Loading

ccaillet1974 commented Oct 26, 2023 •

edited

Loading

ccaillet1974 commented Oct 26, 2023 •

edited

Loading

ccaillet1974 commented Oct 27, 2023 •

edited

Loading