-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-1.24] Simultaneously started K3s servers may race to create CA certificates when using external SQL #7224
Comments
##Environment Details Infrastructure
Node(s) CPU architecture, OS, and version:
Cluster Configuration: 5 server nodes simultaneous joins onto postgres 14 DB hosted by t2.micro Ubuntu 22.04 Config.yaml:
Unable to reproduce consistently with five server nodes
Results: Thankfully on the v1.24.12+k3s1 I was able to trigger this on the first attempt using an external DB. $ sudo journalctl -u k3s | grep -i "ecdsa" | grep -i "unable"
This seems to still be an issue on v1.24.13-rc1+k3s1 $ kgn
We may need an rc2 for k3s on this race condition on the v1.24 branch simultaneous installation across five nodes $ sudo INSTALL_K3S_VERSION=v1.24.13-rc1+k3s1 INSTALL_K3S_EXEC=server ./install-k3s.sh
$ set_kubefig //export KUBECONFIG=/etc/rancher/k3s/k3s.yaml After a couple of minutes (nearly literally 2 minutes 30 seconds) the state seems to resolve and the cluster does begin to report as Ready $ kgn
Something is still off with helm deployments in this cluster
Any thoughts or additional opinion here @brandond ? |
@VestigeJ please grab the following from all the nodes:
|
Just documenting - these logs were sent via Slack Friday morning |
It looks to me like the race condition in question has been resolved - all cluster members have the correct CA certificates which is what this issue is scoped to fixing. We can see that the server in question found the bootstrap key locked, and properly waited for another server to populate it: Apr 14 03:14:40 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:40Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Apr 14 03:14:40 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:40Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/0085d5372d5c39a4b3a6b12330b17b21cc76d5faef3e4785cf3a1a85722607b6"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Starting k3s v1.24.13-rc1+k3s1 (3f79b289)"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Configuring postgres database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Database tables and indexes are up to date"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Kine available at unix://kine.sock"
Apr 14 03:14:43 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:43Z" level=info msg="Bootstrap key is locked - waiting for data to be populated by another server"
Apr 14 03:14:44 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:44Z" level=info msg="Reconciling bootstrap data between datastore and disk"
Apr 14 03:14:44 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:44Z" level=debug msg="/var/lib/rancher/k3s/server/cred directory is empty"
Apr 14 03:14:44 ip-172-31-16-12 k3s[20398]: time="2023-04-14T03:14:44Z" level=debug msg="One or more certificate directories do not exist; writing data to disk from datastore" It looks like there is a bit of thrashing amongst the leader-elected controllers as they all try to take leases at the same time. The database struggles a bit and there are a bunch of "Slow SQL" warnings until things settle out, at which point everything returns to normal.
Eventually things do settle out though, once the apiserver struggles through the slow SQL warnings to finish initializing:
I would consider this as validated successfully. You might re-run the test again with another more performant datastore (either higher capacity postgres, or mysql/mariadb) to see if things come up faster - but the issue in question has been resolved. I am not sure why the helm install pod is stuck; it looks like something went wrong with kube-proxy on that node, as it is unable to reach the in-cluster apiserver endpoint. At this point it does not appear to be related to the issue with the cluster CA certificates that we're trying to validate, rather something that failed due to the datastore being resource constrained during startup. If that is reproducible, we should track it in a separate issue. |
Confirming this worked as expected by removing the extra server arg from the subsequent control plane nodes and solely targeting the database endpoint for cluster joining. |
The text was updated successfully, but these errors were encountered: