Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Fix race condition when joining nodes #72030
What type of PR is this?
What this PR does / why we need it:
To fix this problem, not only wait for the kubelet's kubeconfig file to be
Which issue(s) this PR fixes:
Does this PR introduce a user-facing change?:
Hi @ereslibre. Thanks for your PR.
I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with
Once the patch is verified, the new status will be reflected by the
I understand the commands that are listed here.
Despite we were checking for the kubelet kubeconfig file to be present, the kubelet first writes this file and then the certificates the kubeconfig file refers to. This represents a race condition in kubeadm in which when we confirm that the kubelet's kubeconfig file is present we continue creating a clientset out of it. However, the clientset creation will ensure that the certificates the kubeconfig file refers to exist on the filesystem. To fix this problem, not only wait for the kubelet's kubeconfig file to be present, but also ensure that we can create a clientset ouf of it on our polling process, while we wait for the kubelet to have performed the TLS bootstrap.
fabriziopandini left a comment
@ereslibre Thanks! For this PR
However, I kindly ask opinion from @luxas / @timothysc as well, because I think that if there is a flake in the kubelet TLS bootstrap might be better to get it fixed in kubelet as well.
@fabriziopandini As far as I could see from the kubelet bootstrap logic:
This is what I understood from a first inspection of the source code, if I'm wrong please correct me.
I don't think the way the kubelet bootstraps is flake itself per-se, but how kubeadm is solely waiting for the kubelets final kubeconfig to be present, and assuming that at that very moment it exists we can continue creating a clientset out of it (when the kubelet might not yet have written the certificates to disk and created the symlink for the current). The kubelet always bootstraps just fine, but is kubeadm the one that aborts the execution when trying to create the clientset; causing the cri annotation to not be created on the new node (and potentially further operations that could be added in the future to not happen after that failure).
rosti left a comment
Thanks for the proposed solution @ereslibre !
For the sake of completion, apart from the error itself, here's more evidence about the times these files have been created:
As you can see
[APPROVALNOTIFIER] This PR is APPROVED
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing