[release-1.24] Simultaneously started K3s servers may race to create CA certificates when using external SQL #7224

brandond · 2023-04-05T18:07:58Z

Simultaneously started K3s servers may race to create CA certificates when using external SQL #7185 Backport

VestigeJ · 2023-04-14T03:24:54Z

##Environment Details
Reproduced (on the first try!) using VERSION=v1.24.12+k3s1
Best effort validation using RC VERSION=

Infrastructure

Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.14.21-150400.24.11-default x86_64 GNU/Linux 
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"

Cluster Configuration:

5 server nodes simultaneous joins onto postgres 14 DB hosted by t2.micro Ubuntu 22.04

Config.yaml:

write-kubeconfig-mode: 644
debug: true
token: calcifiedlies
selinux: true
protect-kernel-defaults: true
datastore-endpoint: postgres://k3s:carsap@veryboogeymanofyou:5432/kubernetes

Unable to reproduce consistently with five server nodes

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ sudo INSTALL_K3S_VERSION=v1.24.12+k3s1 INSTALL_K3S_EXEC=server ./install-k3s.sh (simultaneous across all nodes) 
$ sudo journalctl -u k3s | grep -i "ecdsa" | grep -i "unable" //run to catch the error in logs. 
$ sudo INSTALL_K3S_VERSION=v1.24.13-rc1+k3s1 INSTALL_K3S_EXEC=server ./install-k3s.sh 
$ set_kubefig
$ kgn
$ sudo journalctl -u k3s | grep -i "ecdsa" | grep -i "unable" 
$ get_report //generate this template

Results:

Thankfully on the v1.24.12+k3s1 I was able to trigger this on the first attempt using an external DB.

$ sudo journalctl -u k3s | grep -i "ecdsa" | grep -i "unable"

Apr 14 03:07:33 ip-LOCALIP k3s[14493]: time="2023-04-14T03:07:33Z" level=warning msg="unable to verify existing certificate: x509: certificate signed by unknown authority (possibly because of \"x509: ECDSA verification failure\" while trying to verify candidate authority certificate \"k3s-server-ca@1681441464\") - signing operation may change certificate issuer"

This seems to still be an issue on v1.24.13-rc1+k3s1

$ kgn

NAME               STATUS     ROLES                  AGE   VERSION
ip-172-31-16-12    Ready      control-plane,master   17s   v1.24.13-rc1+k3s1
ip-172-31-20-46    Ready      control-plane,master   17s   v1.24.13-rc1+k3s1
ip-172-31-18-41    Ready      control-plane,master   16s   v1.24.13-rc1+k3s1
ip-172-31-20-241   NotReady   control-plane,master   6s    v1.24.13-rc1+k3s1
ip-172-31-17-133   NotReady   control-plane,master   4s    v1.24.13-rc1+k3s1

We may need an rc2 for k3s on this race condition on the v1.24 branch

simultaneous installation across five nodes

$ sudo INSTALL_K3S_VERSION=v1.24.13-rc1+k3s1 INSTALL_K3S_EXEC=server ./install-k3s.sh

[INFO]  Using v1.24.13-rc1+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.24.13-rc1+k3s1/sha256sum-amd64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.24.13-rc1+k3s1/k3s
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, already exists
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s

$ set_kubefig //export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
$ kgn //kubectl get nodes
Unable to connect to the server: x509: certificate signed by unknown authority

After a couple of minutes (nearly literally 2 minutes 30 seconds) the state seems to resolve and the cluster does begin to report as Ready

$ kgn

NAME               STATUS   ROLES                  AGE     VERSION
ip-172-31-20-241   Ready    control-plane,master   2m18s   v1.24.13-rc1+k3s1
ip-172-31-17-133   Ready    control-plane,master   2m16s   v1.24.13-rc1+k3s1
ip-172-31-18-41    Ready    control-plane,master   2m28s   v1.24.13-rc1+k3s1
ip-172-31-16-12    Ready    control-plane,master   2m29s   v1.24.13-rc1+k3s1
ip-172-31-20-46    Ready    control-plane,master   2m29s   v1.24.13-rc1+k3s1

Something is still off with helm deployments in this cluster


$ kg events -n kube-system
LAST SEEN   TYPE      REASON              OBJECT                                         MESSAGE
14m         Normal    LeaderElection      lease/kube-controller-manager                  ip-172-31-20-241_fb50e714-4beb-4ee7-9aba-7bc997126b18 became leader
13m         Normal    LeaderElection      lease/kube-controller-manager                  ip-172-31-17-133_0d3f8a4a-a55e-4d90-9a95-cd1541b68594 became leader
12m         Normal    LeaderElection      lease/kube-controller-manager                  ip-172-31-20-241_24401192-95c9-419e-a7c7-c2b3e342df06 became leader
11m         Normal    LeaderElection      lease/kube-controller-manager                  ip-172-31-17-133_df492eb5-5d71-4afd-bef4-ce6575954a57 became leader
10m         Normal    LeaderElection      lease/k3s-cloud-controller-manager             ip-172-31-20-46_f8bafeda-6a69-4960-9380-e9ce425cbf2f became leader
10m         Normal    ApplyJob            helmchart/traefik-crd                          Applying HelmChart using Job kube-system/helm-install-traefik-crd
10m         Normal    ApplyingManifest    addon/ccm                                      Applying manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    AppliedManifest     addon/ccm                                      Applied manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    ApplyingManifest    addon/coredns                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    AppliedManifest     addon/coredns                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    ApplyingManifest    addon/local-storage                            Applying manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    AppliedManifest     addon/local-storage                            Applied manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    ApplyingManifest    addon/aggregated-metrics-reader                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    AppliedManifest     addon/aggregated-metrics-reader                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    ApplyingManifest    addon/auth-delegator                           Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    AppliedManifest     addon/auth-delegator                           Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    ApplyingManifest    addon/auth-reader                              Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    ApplyJob            helmchart/traefik                              Applying HelmChart using Job kube-system/helm-install-traefik
10m         Normal    AppliedManifest     addon/auth-reader                              Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    ApplyingManifest    addon/metrics-apiservice                       Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    AppliedManifest     addon/metrics-apiservice                       Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-deployment                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-deployment                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-service                   Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-service                   Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    LeaderElection      lease/kube-scheduler                           ip-172-31-18-41_11e62f28-2b7d-440c-9be8-84ee0a595249 became leader
10m         Normal    ApplyingManifest    addon/resource-reader                          Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    ApplyingManifest    addon/ccm                                      Applying manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    AppliedManifest     addon/ccm                                      Applied manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    ApplyingManifest    addon/coredns                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    AppliedManifest     addon/resource-reader                          Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    AppliedManifest     addon/coredns                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    ApplyingManifest    addon/local-storage                            Applying manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    AppliedManifest     addon/local-storage                            Applied manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    ApplyingManifest    addon/aggregated-metrics-reader                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    AppliedManifest     addon/aggregated-metrics-reader                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    ApplyingManifest    addon/auth-delegator                           Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    AppliedManifest     addon/auth-delegator                           Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    ApplyingManifest    addon/rolebindings                             Applying manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    ApplyingManifest    addon/auth-reader                              Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    AppliedManifest     addon/auth-reader                              Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    AppliedManifest     addon/rolebindings                             Applied manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    ApplyingManifest    addon/traefik                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    ApplyingManifest    addon/metrics-apiservice                       Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    AppliedManifest     addon/traefik                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    AppliedManifest     addon/metrics-apiservice                       Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-deployment                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-deployment                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-service                   Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-service                   Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    ApplyingManifest    addon/resource-reader                          Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    AppliedManifest     addon/resource-reader                          Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    ApplyingManifest    addon/rolebindings                             Applying manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    AppliedManifest     addon/rolebindings                             Applied manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    ApplyingManifest    addon/traefik                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    AppliedManifest     addon/traefik                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    LeaderElection      lease/kube-controller-manager                  ip-172-31-18-41_3dcf110d-57c6-43e7-af20-e0138a3c974e became leader
10m         Normal    ApplyingManifest    addon/ccm                                      Applying manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    AppliedManifest     addon/ccm                                      Applied manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    ApplyingManifest    addon/coredns                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    ApplyingManifest    addon/ccm                                      Applying manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    AppliedManifest     addon/ccm                                      Applied manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    ApplyingManifest    addon/coredns                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    AppliedManifest     addon/coredns                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    ApplyingManifest    addon/ccm                                      Applying manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    ApplyingManifest    addon/local-storage                            Applying manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    AppliedManifest     addon/coredns                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    ApplyingManifest    addon/local-storage                            Applying manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    AppliedManifest     addon/ccm                                      Applied manifest at "/var/lib/rancher/k3s/server/manifests/ccm.yaml"
10m         Normal    ApplyingManifest    addon/coredns                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    AppliedManifest     addon/local-storage                            Applied manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    AppliedManifest     addon/local-storage                            Applied manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    ApplyingManifest    addon/aggregated-metrics-reader                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    ApplyingManifest    addon/aggregated-metrics-reader                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    AppliedManifest     addon/aggregated-metrics-reader                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    ApplyingManifest    addon/auth-delegator                           Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    AppliedManifest     addon/aggregated-metrics-reader                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    ApplyingManifest    addon/auth-delegator                           Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    AppliedManifest     addon/auth-delegator                           Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    ApplyingManifest    addon/auth-reader                              Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    AppliedManifest     addon/auth-delegator                           Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    AppliedManifest     addon/coredns                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/coredns.yaml"
10m         Normal    ApplyingManifest    addon/auth-reader                              Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    ApplyingManifest    addon/local-storage                            Applying manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    AppliedManifest     addon/local-storage                            Applied manifest at "/var/lib/rancher/k3s/server/manifests/local-storage.yaml"
10m         Normal    AppliedManifest     addon/auth-reader                              Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    ApplyingManifest    addon/aggregated-metrics-reader                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    AppliedManifest     addon/aggregated-metrics-reader                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
10m         Normal    ApplyingManifest    addon/auth-delegator                           Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    AppliedManifest     addon/auth-reader                              Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    AppliedManifest     addon/auth-delegator                           Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
10m         Normal    ApplyingManifest    addon/auth-reader                              Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    AppliedManifest     addon/auth-reader                              Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
10m         Normal    ApplyingManifest    addon/metrics-apiservice                       Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    ApplyingManifest    addon/metrics-apiservice                       Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    ApplyingManifest    addon/metrics-apiservice                       Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    AppliedManifest     addon/metrics-apiservice                       Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    AppliedManifest     addon/metrics-apiservice                       Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    AppliedManifest     addon/metrics-apiservice                       Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-deployment                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-deployment                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-deployment                Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-deployment                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-deployment                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-deployment                Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-service                   Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-service                   Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    ApplyingManifest    addon/metrics-server-service                   Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-service                   Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-service                   Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    ApplyingManifest    addon/resource-reader                          Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    ApplyingManifest    addon/resource-reader                          Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    AppliedManifest     addon/metrics-server-service                   Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
10m         Normal    AppliedManifest     addon/resource-reader                          Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    AppliedManifest     addon/resource-reader                          Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    ApplyingManifest    addon/resource-reader                          Applying manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    ApplyingManifest    addon/rolebindings                             Applying manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    ApplyingManifest    addon/rolebindings                             Applying manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    AppliedManifest     addon/resource-reader                          Applied manifest at "/var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
10m         Normal    AppliedManifest     addon/rolebindings                             Applied manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    AppliedManifest     addon/rolebindings                             Applied manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    ApplyingManifest    addon/rolebindings                             Applying manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    ApplyingManifest    addon/traefik                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    ApplyingManifest    addon/traefik                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    AppliedManifest     addon/rolebindings                             Applied manifest at "/var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
10m         Normal    AppliedManifest     addon/traefik                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    AppliedManifest     addon/traefik                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    ApplyingManifest    addon/traefik                                  Applying manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
10m         Normal    AppliedManifest     addon/traefik                                  Applied manifest at "/var/lib/rancher/k3s/server/manifests/traefik.yaml"
9m55s       Normal    ScalingReplicaSet   deployment/metrics-server                      Scaled up replica set metrics-server-667586758d to 1
9m55s       Normal    ScalingReplicaSet   deployment/coredns                             Scaled up replica set coredns-74448699cf to 1
9m55s       Normal    ScalingReplicaSet   deployment/local-path-provisioner              Scaled up replica set local-path-provisioner-597bc7dccd to 1
9m54s       Normal    SuccessfulCreate    replicaset/metrics-server-667586758d           Created pod: metrics-server-667586758d-bm6kt
9m54s       Normal    SuccessfulCreate    job/helm-install-traefik                       Created pod: helm-install-traefik-jjpzx
9m54s       Normal    Scheduled           pod/helm-install-traefik-jjpzx                 Successfully assigned kube-system/helm-install-traefik-jjpzx to ip-172-31-17-133
9m54s       Normal    SuccessfulCreate    job/helm-install-traefik-crd                   Created pod: helm-install-traefik-crd-854vz
9m54s       Normal    SuccessfulCreate    replicaset/coredns-74448699cf                  Created pod: coredns-74448699cf-bvx98
9m54s       Normal    SuccessfulCreate    replicaset/local-path-provisioner-597bc7dccd   Created pod: local-path-provisioner-597bc7dccd-f5qh9
9m54s       Normal    Scheduled           pod/metrics-server-667586758d-bm6kt            Successfully assigned kube-system/metrics-server-667586758d-bm6kt to ip-172-31-16-12
9m54s       Normal    Scheduled           pod/helm-install-traefik-crd-854vz             Successfully assigned kube-system/helm-install-traefik-crd-854vz to ip-172-31-17-133
9m54s       Normal    Scheduled           pod/local-path-provisioner-597bc7dccd-f5qh9    Successfully assigned kube-system/local-path-provisioner-597bc7dccd-f5qh9 to ip-172-31-20-46
9m54s       Normal    Scheduled           pod/coredns-74448699cf-bvx98                   Successfully assigned kube-system/coredns-74448699cf-bvx98 to ip-172-31-18-41
9m54s       Normal    Pulling             pod/helm-install-traefik-crd-854vz             Pulling image "rancher/klipper-helm:v0.7.7-build20230403"
9m53s       Normal    Pulling             pod/helm-install-traefik-jjpzx                 Pulling image "rancher/klipper-helm:v0.7.7-build20230403"
9m53s       Normal    Pulling             pod/metrics-server-667586758d-bm6kt            Pulling image "rancher/mirrored-metrics-server:v0.6.2"
9m53s       Normal    Pulling             pod/coredns-74448699cf-bvx98                   Pulling image "rancher/mirrored-coredns-coredns:1.10.1"
9m53s       Normal    Pulling             pod/local-path-provisioner-597bc7dccd-f5qh9    Pulling image "rancher/local-path-provisioner:v0.0.24"
9m52s       Normal    Pulled              pod/coredns-74448699cf-bvx98                   Successfully pulled image "rancher/mirrored-coredns-coredns:1.10.1" in 1.323326681s
9m52s       Normal    Created             pod/coredns-74448699cf-bvx98                   Created container coredns
9m51s       Normal    Started             pod/coredns-74448699cf-bvx98                   Started container coredns
9m51s       Normal    Pulled              pod/local-path-provisioner-597bc7dccd-f5qh9    Successfully pulled image "rancher/local-path-provisioner:v0.0.24" in 1.342378526s
9m51s       Normal    Created             pod/local-path-provisioner-597bc7dccd-f5qh9    Created container local-path-provisioner
9m51s       Normal    Started             pod/local-path-provisioner-597bc7dccd-f5qh9    Started container local-path-provisioner
9m51s       Normal    Pulled              pod/metrics-server-667586758d-bm6kt            Successfully pulled image "rancher/mirrored-metrics-server:v0.6.2" in 1.97347935s
9m51s       Normal    Created             pod/metrics-server-667586758d-bm6kt            Created container metrics-server
9m51s       Normal    Started             pod/metrics-server-667586758d-bm6kt            Started container metrics-server
9m51s       Warning   Unhealthy           pod/metrics-server-667586758d-bm6kt            Readiness probe failed: Get "https://10.42.1.2:10250/readyz": dial tcp 10.42.1.2:10250: connect: connection refused
9m50s       Normal    Pulled              pod/helm-install-traefik-jjpzx                 Successfully pulled image "rancher/klipper-helm:v0.7.7-build20230403" in 3.896749205s
9m50s       Normal    Pulled              pod/helm-install-traefik-crd-854vz             Successfully pulled image "rancher/klipper-helm:v0.7.7-build20230403" in 3.917132925s
9m36s       Warning   Unhealthy           pod/metrics-server-667586758d-bm6kt            Readiness probe failed: HTTP probe failed with statuscode: 500
92s         Normal    Pulled              pod/helm-install-traefik-crd-854vz             Container image "rancher/klipper-helm:v0.7.7-build20230403" already present on machine
92s         Normal    Created             pod/helm-install-traefik-crd-854vz             Created container helm
92s         Normal    Started             pod/helm-install-traefik-crd-854vz             Started container helm
82s         Normal    Pulled              pod/helm-install-traefik-jjpzx                 Container image "rancher/klipper-helm:v0.7.7-build20230403" already present on machine
82s         Normal    Created             pod/helm-install-traefik-jjpzx                 Created container helm
82s         Normal    Started             pod/helm-install-traefik-jjpzx                 Started container helm
47s         Warning   BackOff             pod/helm-install-traefik-jjpzx                 Back-off restarting failed container
44s         Warning   BackOff             pod/helm-install-traefik-crd-854vz             Back-off restarting failed container

Any thoughts or additional opinion here @brandond ?

brandond · 2023-04-14T05:07:55Z

@VestigeJ please grab the following from all the nodes:

kubectl get node -o yaml
kubectl get lease -A
K3s journald log
Contents of /var/lib/rancher/k3s/server/TLS

VestigeJ · 2023-04-17T16:21:28Z

Just documenting - these logs were sent via Slack Friday morning

brandond · 2023-04-17T21:00:20Z

It looks to me like the race condition in question has been resolved - all cluster members have the correct CA certificates which is what this issue is scoped to fixing. We can see that the server in question found the bootstrap key locked, and properly waited for another server to populate it:

Apr 14 03:14:40 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:40Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Apr 14 03:14:40 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:40Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/0085d5372d5c39a4b3a6b12330b17b21cc76d5faef3e4785cf3a1a85722607b6"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Starting k3s v1.24.13-rc1+k3s1 (3f79b289)"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Configuring postgres database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Database tables and indexes are up to date"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Kine available at unix://kine.sock"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Bootstrap key is locked - waiting for data to be populated by another server"
Apr 14 03:14:44 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:44Z" level=info msg="Reconciling bootstrap data between datastore and disk"
Apr 14 03:14:44 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:44Z" level=debug msg="/var/lib/rancher/k3s/server/cred directory is empty"
Apr 14 03:14:44 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:44Z" level=debug msg="One or more certificate directories do not exist; writing data to disk from datastore"

It looks like there is a bit of thrashing amongst the leader-elected controllers as they all try to take leases at the same time. The database struggles a bit and there are a bunch of "Slow SQL" warnings until things settle out, at which point everything returns to normal.

Apr 14 03:14:47 ip-172-31-16-12 k3s[20398]: E0414 03:14:47.668324 20398 controller.go:166] Unable to perform initial Kubernetes service initialization: Service "kubernetes" is invalid: spec.clusterIPs: Invalid value: []string{"10.43.0.1"}: failed to allocate IP 10.43.0.1: provided IP is already allocated
Apr 14 03:18:47 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:18:47Z" level=info msg="Slow SQL (started: 2023-04-14 03:18:46.792299158 +0000 UTC m=+243.789407949) (total time: 1.054899928s): SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE $1 GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR $2 ) AS lkv ORDER BY lkv.theid ASC : [[/registry/health false]]"
Apr 14 03:19:42 ip-172-31-16-12 k3s[20398]: E0414 03:19:42.059477 20398 leaderelection.go:334] error initially creating leader election record: leases.coordination.k8s.io "k3s" already exists

Eventually things do settle out though, once the apiserver struggles through the slow SQL warnings to finish initializing:

Apr 14 03:19:38 ip-172-31-16-12 k3s[20398]: W0414 03:19:38.728657   20398 lease.go:235] Resetting endpoints for master service "kubernetes" to [172.31.20.241 172.31.17.133 172.31.16.12]
Apr 14 03:19:38 ip-172-31-16-12 k3s[20398]: I0414 03:19:38.745968   20398 controller.go:611] quota admission added evaluator for: endpoints
Apr 14 03:19:38 ip-172-31-16-12 k3s[20398]: I0414 03:19:38.787125   20398 controller.go:611] quota admission added evaluator for: endpointslices.discovery.k8s.io
Apr 14 03:19:39 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:19:39Z" level=info msg="Waiting for cloud-controller-manager privileges to become available"
Apr 14 03:19:39 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:19:39Z" level=info msg="Kube API server is now running"
Apr 14 03:19:39 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:19:39Z" level=info msg="ETCD server is now running"
Apr 14 03:19:39 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:19:39Z" level=info msg="k3s is up and running"
Apr 14 03:19:39 ip-172-31-16-12 systemd[1]: Started Lightweight Kubernetes.

I would consider this as validated successfully. You might re-run the test again with another more performant datastore (either higher capacity postgres, or mysql/mariadb) to see if things come up faster - but the issue in question has been resolved.

I am not sure why the helm install pod is stuck; it looks like something went wrong with kube-proxy on that node, as it is unable to reach the in-cluster apiserver endpoint. At this point it does not appear to be related to the issue with the cluster CA certificates that we're trying to validate, rather something that failed due to the datastore being resource constrained during startup. If that is reproducible, we should track it in a separate issue.

VestigeJ · 2023-04-18T18:33:05Z

Confirming this worked as expected by removing the extra server arg from the subsequent control plane nodes and solely targeting the database endpoint for cluster joining.

brandond self-assigned this Apr 5, 2023

brandond added this to the v1.24.13+k3s1 milestone Apr 5, 2023

brandond mentioned this issue Apr 5, 2023

[release-1.24] Backport version bumps and bugfixes #7229

Merged

ShylajaDevadiga assigned VestigeJ Apr 6, 2023

VestigeJ closed this as completed Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-1.24] Simultaneously started K3s servers may race to create CA certificates when using external SQL #7224

[release-1.24] Simultaneously started K3s servers may race to create CA certificates when using external SQL #7224

brandond commented Apr 5, 2023

VestigeJ commented Apr 14, 2023 •

edited

Loading

Unable to reproduce consistently with five server nodes

brandond commented Apr 14, 2023 •

edited

Loading

VestigeJ commented Apr 17, 2023

brandond commented Apr 17, 2023 •

edited

Loading

VestigeJ commented Apr 18, 2023

[release-1.24] Simultaneously started K3s servers may race to create CA certificates when using external SQL #7224

[release-1.24] Simultaneously started K3s servers may race to create CA certificates when using external SQL #7224

Comments

brandond commented Apr 5, 2023

VestigeJ commented Apr 14, 2023 • edited Loading

Unable to reproduce consistently with five server nodes

brandond commented Apr 14, 2023 • edited Loading

VestigeJ commented Apr 17, 2023

brandond commented Apr 17, 2023 • edited Loading

VestigeJ commented Apr 18, 2023

VestigeJ commented Apr 14, 2023 •

edited

Loading

brandond commented Apr 14, 2023 •

edited

Loading

brandond commented Apr 17, 2023 •

edited

Loading