Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) #10560

Closed
ccaillet1974 opened this issue Oct 26, 2023 · 18 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ccaillet1974
Copy link

ccaillet1974 commented Oct 26, 2023

Environment:

  • Baremetal cluster upgrading member from bullseye to bookworm, the following version are exectuted on the kubespray deployment machine

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Linux 5.10.0-23-amd64 x86_64
    PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
    NAME="Debian GNU/Linux"
    VERSION_ID="11"
    VERSION="11 (bullseye)"
    VERSION_CODENAME=bullseye
    ID=debian
    HOME_URL="https://www.debian.org/"
    SUPPORT_URL="https://www.debian.org/support"
    BUG_REPORT_URL="https://bugs.debian.org/"

  • Version of Ansible (ansible --version):
    ansible [core 2.14.11]
    config file = /homes/totof/myk8s/kubespray-master/ansible.cfg
    configured module search path = ['/homes/totof/myk8s/kubespray-master/library']
    ansible python module location = /usr/local/lib/python3.9/dist-packages/ansible
    ansible collection location = /homes/totof/.ansible/collections:/usr/share/ansible/collections
    executable location = /usr/local/bin/ansible
    python version = 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (/usr/bin/python3)
    jinja version = 3.1.2
    libyaml = True

  • Version of Python (python --version):
    Python 3.9.2

Kubespray version (commit) (git rev-parse --short HEAD):
7dcc22f

Network plugin used:
cilium

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
inventory-variables.txt

Command used to invoke ansible:
ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K --limit=kube_control_plane cluster.yml

Output of ansible run:

TASK [etcd : Configure | Ensure etcd is running] ***********************************************************************************************************************************************************************************************
ok: [lyo0-k8s-testm02]
ok: [lyo0-k8s-testm01]
fatal: [lyo0-k8s-testm00]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because a timeout was exceeded.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}
Thursday 26 October 2023  10:40:33 +0200 (0:01:30.885)       0:09:46.376 ******
Thursday 26 October 2023  10:40:33 +0200 (0:00:00.071)       0:09:46.448 ******
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (4 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (3 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (2 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (1 retries left).

TASK [etcd : Configure | Wait for etcd cluster to be healthy] **********************************************************************************************************************************************************************************
fatal: [lyo0-k8s-testm01]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.042724", "end": "2023-10-26 10:41:24.339102", "msg": "non-zero return code", "rc": 1, "start": "2023-10-26 10:41:19.296378", "stderr": "{\"level\":\"warn\",\"ts\":\"2023-10-26T10:41:24.329966+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000344fc0/10.141.10.65:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2023-10-26T10:41:24.329966+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000344fc0/10.141.10.65:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************
lyo0-k8s-testm00           : ok=496  changed=47   unreachable=0    failed=1    skipped=494  rescued=0    ignored=1
lyo0-k8s-testm01           : ok=508  changed=9    unreachable=0    failed=1    skipped=574  rescued=0    ignored=0
lyo0-k8s-testm02           : ok=484  changed=10   unreachable=0    failed=0    skipped=501  rescued=0    ignored=0

Thursday 26 October 2023  10:41:24 +0200 (0:00:51.257)       0:10:37.705 ******
===============================================================================
etcd : Configure | Ensure etcd is running ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 90.89s
etcd : Configure | Wait for etcd cluster to be healthy --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 51.26s
etcd : Gen_certs | Write etcd member/admin and kube_control_plane client certs to other etcd nodes ------------------------------------------------------------------------------------------------------------------------------------- 18.48s
download : Download_container | Download image if required ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.25s
etcd : Gen_certs | Write node certs to other etcd nodes -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 12.02s
container-engine/runc : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.76s
download : Download_file | Download item ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.66s
container-engine/containerd : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.66s
etcdctl_etcdutl : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.57s
container-engine/crictl : Download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.57s
container-engine/nerdctl : Download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.50s
container-engine/crictl : Extract_file | Unpacking archive ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.28s
etcd : Gen_certs | Gather etcd member/admin and kube_control_plane client certs from first etcd node ------------------------------------------------------------------------------------------------------------------------------------ 6.18s
etcdctl_etcdutl : Extract_file | Unpacking archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.08s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 5.57s
container-engine/nerdctl : Extract_file | Unpacking archive ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.56s
container-engine/containerd : Containerd | Unpack containerd archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.54s
container-engine/validate-container-engine : Populate service facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.50s
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.45s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 5.15s

Description of the problem :
I've run the following command to remove one control_plane node which is currently in bullseye (Debian 11)
ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K remove-node.yml -e node=lyo0-k8s-testm00.

I previously change the order on the inventory file as described in docs/nodes.md. After that I freshly installed the node with bookworm. And after that I try to add again my node which is now on bookworm with the command
ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K --limit=kube_control_plane cluster.yml

Here are the logs on the new control_plane node for etcd entries :

2023-10-26T11:03:32.800850+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.800034+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b is starting a new election at term 1"}
2023-10-26T11:03:32.801193+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801124+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b became pre-candidate at term 1"}
2023-10-26T11:03:32.801382+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.80132+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b received MsgPreVoteResp from 7d81e23d9d41da1b
at term 1"}
2023-10-26T11:03:32.801584+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801524+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote reques
t to c101cbbb43bf28a0 at term 1"}
2023-10-26T11:03:32.801780+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801721+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote reques
t to eb484ede068d3a18 at term 1"}
2023-10-26T11:03:34.838611+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838023+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT
_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839025+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838013+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAP
SHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839093+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838049+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT
_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839234+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838979+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAP
SHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}

The command failed at etcd stage (deployed as host method) withe the message described bellow. I've already done this kind of operation but only for changing hardware for control plane nodes and with kubespray release-2.22 branch and all is working well.

Any assistance will be appreciated for resolving this etcd problem

@ccaillet1974 ccaillet1974 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2023
@ccaillet1974
Copy link
Author

ADDENDUM :

Same problem when trying to add the node in bullseye. For testing I reinstalled the node in bullseye and I had the same issue when trying to add my node in control-plane with the command described bellow.

@blackluck
Copy link

Hello, not clear for me, master nodes also etcd nodes or not? If it is then maybe you should also use etcd group in limit like: '--limit=kube_control_plane,etcd'

@ccaillet1974
Copy link
Author

ccaillet1974 commented Oct 26, 2023

Ok ... I'll test the adding part with etcd limit but for me control_plane include etcd ... give infos in about 20 minutes

EDIT : Same error with etcd included in limit parameter.

@blackluck
Copy link

Doc also mentions for etcd nodes you need to set -e ignore_assert_errors=yes

@ccaillet1974
Copy link
Author

ccaillet1974 commented Oct 26, 2023

Same with the -e ignore_asset_errors=yes

As I said earlier it seems that there is a problem with cert generation because I've "bad tls certificate" on logs on my new node

EDIT : this process to change a node in the control_plane worked well with release-2.22 branch ... maybe a regression ?

EDIT 2 :

logs on other nodes :

Oct 26 16:17:10 lyo0-k8s-testm02 etcd[1375044]: {"level":"warn","ts":"2023-10-26T16:17:10.13056+0200","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.141.10.64:53502","server-name":"","error":"tls: failed to verify client certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"}
Oct 26 16:19:08 lyo0-k8s-testm01 etcd[1374901]: {"level":"warn","ts":"2023-10-26T16:19:08.146019+0200","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.141.10.64:60964","server-name":"","error":"tls: failed to verify client certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"}

It seems that there is a big problem on certificate generation for the new node IP : 10.141.10.64 :'(

@blackluck
Copy link

Is that possible that you run it before changed host order in inventory also? That could cause generating new certs if run on empty master (because that's the first node)

@ccaillet1974
Copy link
Author

ccaillet1974 commented Oct 27, 2023

Yes i'll test it now

EDIT : same result when new node is in first place

TASK [etcd : Configure | Wait for etcd cluster to be healthy] **********************************************************************************************************************************************************************************
fatal: [lyo0-k8s-testm00]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.033871", "end": "2023-10-27 08:56:01.179553", "msg": "non-zero return code", "rc": 1, "start": "2023-10-27 08:55:56.145682", "stderr": "{\"level\":\"warn\",\"ts\":\"2023-10-27T08:56:01.175361+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000388fc0/10.141.10.64:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection closed before server preface received\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2023-10-27T08:56:01.175361+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000388fc0/10.141.10.64:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection closed before server preface received\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************
lyo0-k8s-testm00           : ok=489  changed=13   unreachable=0    failed=1    skipped=593  rescued=0    ignored=0
lyo0-k8s-testm01           : ok=470  changed=12   unreachable=0    failed=0    skipped=515  rescued=0    ignored=0
lyo0-k8s-testm02           : ok=471  changed=13   unreachable=0    failed=0    skipped=514  rescued=0    ignored=0

Friday 27 October 2023  08:56:01 +0200 (0:00:42.656)       0:08:10.829 ********
===============================================================================
etcd : Configure | Wait for etcd cluster to be healthy --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 42.66s
etcd : Gen_certs | Write etcd member/admin and kube_control_plane client certs to other etcd nodes ------------------------------------------------------------------------------------------------------------------------------------- 16.84s
etcd : Gen_certs | Write node certs to other etcd nodes -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 11.85s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 7.45s
container-engine/containerd : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 7.37s
download : Download_file | Download item ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.90s
container-engine/runc : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.86s
container-engine/nerdctl : Download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.76s
etcdctl_etcdutl : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.73s
container-engine/crictl : Download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.72s
etcdctl_etcdutl : Extract_file | Unpacking archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.39s
container-engine/crictl : Extract_file | Unpacking archive ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.35s
container-engine/validate-container-engine : Populate service facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.90s
container-engine/nerdctl : Extract_file | Unpacking archive ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.73s
etcd : Gen_certs | Gather etcd member/admin and kube_control_plane client certs from first etcd node ------------------------------------------------------------------------------------------------------------------------------------ 5.61s
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.40s
container-engine/containerd : Containerd | Unpack containerd archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.70s
container-engine/runc : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.61s
container-engine/containerd : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.49s
etcdctl_etcdutl : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.41s

Here is the syslog on new node when try to start etcd

Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.032429+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b is starting a new election at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033375+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b became pre-candidate at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033601+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b received MsgPreVoteResp from 7d81e23d9d41da1b at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033838+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote request to c101cbbb43bf28a0 at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.034049+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote request to eb484ede068d3a18 at term 1"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.067894+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.069697+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.072727+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.074858+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.048835+0200","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"7d81e23d9d41da1b","local-member-attributes":"{Name:etcd1 ClientURLs:[https://10.141.10.64:2379]}","request-path":"/0/members/7d81e23d9d41da1b/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.073383+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.074327+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.074809+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.075396+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: start operation timed out. Terminating.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: Failed with result 'timeout'.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: Failed to start etcd.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: Consumed 18.670s CPU time.

@blackluck
Copy link

Sorry I wasn't asking you to do it this way.
I was asking if you already did it earlier, because if you try to add an empty new node as it is the first master, then kubespray won't find certs on that and handle it as kinda installing a new cluster so creating new certs that copy to other masters just probably won't restart components on them, so until they run they will have in memory the old certs, but on filesystem new ones. And if it try to join a new master it will copy the new certs which won't match the still running components.
I would say check if there is a backup of certs and they changed or not.

@ccaillet1974
Copy link
Author

As said earlier the "new node" is done by : remove the node (with the remove-node.yml) after that I upgrade my node from bullseye to bookworm then I add again my node in the cluster with the appropriate command.

This process have been already done with kubespray (branch release-2.22) at this time I'd changed my control_plane node on another cluster from VM to baremetal servers using the process described in docs/nodes.md and all worked perfectly.

And the node IS not the first aster because nodes.md describe how to remove/add the first control_plane node and I followed this documentation.

For me there is some problems in release-2.23 branch .. don't try with master branch.

Actually I was working on another solution for upgrading my nodes :
1- drain the node
2- upgrading via apt full-upgrade
3- reboot the node
4- uncordon the node

@FingerlessGlov3s
Copy link

Actually I was working on another solution for upgrading my nodes
1- drain the node
2- upgrading via apt full-upgrade
3- reboot the node
4- uncordon the node

Any update on this? or did you go about upgrading the OS itself differently?

@ccaillet1974
Copy link
Author

Hi,

Sorry for the delay :)

All is working, with this method. I've also tested when all node are on the same distro version (all in debian bookworm) and now deleting control_plane node and readding it works.

So maybe the issue is due to the version mismatch between control_plane nodes, my two cents :)

Regards

@FingerlessGlov3s
Copy link

Hi,

Sorry for the delay :)

All is working, with this method. I've also tested when all node are on the same distro version (all in debian bookworm) and now deleting control_plane node and readding it works.

So maybe the issue is due to the version mismatch between control_plane nodes, my two cents :)

Regards

So whats your process now?

remove the node, OS upgrade, then add it back?

@ccaillet1974
Copy link
Author

ccaillet1974 commented Dec 7, 2023 via email

@VannTen
Copy link
Contributor

VannTen commented Feb 8, 2024

Possibly related #upgrade #10808

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 7, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants