New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet does not restart or reregister in response to removed Node API object #71398
Comments
if no one plan to work on it, i'd like to take this. |
I'm a little unsure about whether we should be re-registering the Node. My guess is that if a Node was deleted, it was usually because the administrator is preparing to terminate the underlying machine. I guess the scenario we're thinking about here is that the node controller removed it mistakenly? I don't know if there's some other scenario? We had a bug in kops where when a machine is deleted from a GCE MIG it often comes back with the same name. So if the machine came back faster than the Node controller could remove the Node, then the new machine would assume the previous identity. And - critically - it wouldn't remove any taints, and of course we had cordoned it prior to shutdown. (In theory this can happen on AWS also, but it's less likely) Maybe there's a better fix than the one we did in kops. e.g. maybe the kubelet would know whether it was the "same machine" or not and remove its own taints on boot, but I'm not sure. I think this is also a sig-cluster-lifecycle issue therefore. /sig cluster-lifecycle |
since if the kubelet process restarted, it would re-register the Node object, the current behavior is fragile at best, and perpetually attempting and failing to report status really seems like a bug. |
wouldn't the proper order be to:
Deleting the node object first is racy if the kubelet process happens to restart.
accidental deletion, or assumption that a deleted Node object will get re-registered and updated by the kubelet (since that's what the kubelet currently does in some cases) |
@liggitt I agree - deleting the Node first breaks if the kubelet restarts (but less pathologically than all your new nodes coming up cordoned, IMO - the node that is about to be terminated gets uncordoned and presumably gets pods scheduled to it that will be short-lived). Terminating the machine and then deleting the Node is also potentially problematic - e.g. if we fail to delete the Node (or get delayed and delete the new Node). None of these are likely, but we needed to do something to stop nodes with the same Name coming back cordoned. I do think we should figure out what we want to do here - maybe involving the kubelet checking the instance id or machine id. I'm hoping that this overlaps with the cluster-api work (cc @roberthbailey ), which is why I looped in sig-cluster-lifecycle also. |
I think that @justinsb is saying that both orders are racy (at least in cloud environments) for different reasons.
This is certainly true for certain types of deletions. If you "recreate" a VM that is part of a MIG it comes back with the same name. Also if you delete and recreate a preembtible VM (on GCE) it will come back with the same name (and a different IP address). Not having deleted the node object in both of those cases can be problematic if the new kubelet / machine assume the identity (and any running pods) from the old kubelet / machine. /cc @dchen1107 |
Since this issue is still open I presume there was no conclusion to this topic? If there was, could someone post an update here? Personally I like the "exit after a period of time or number of retries" @liggitt proposed, with the default being 0 to retry forever to keep the current behavior, while still making it possible to have the node re-registered after some time when the kubelet restarts, if that is what the admin wants. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/triage accepted |
You can set up your cluster with a finalizer on a node, so that deleting the Node triggers deletion of the underlying computer. For example, Karpenter (a cluster autoscaler) does this. In that case, the kubelet won't need to reregister because the computer shutdown that Karpenter triggers will stop the kubelet and the running containers. However, if you don't define such a finalizer, the kubelet could / should attempt to reregister. It'd also be nice to have a metric available for the number of times that the kubelet has seen its node object deleted (etc). |
💭 we could make a per-node addon that sets set itself as a finalizer for the associated node? The addon could detect pending node deletion, and trigger a graceful kubelet shutdown before finally allowing the node to delete from the API. If the OS then restarts the kubelet because that's what the sysadmin has configured, I'd be happy for that kubelet to register as a new node (presumably by the same name) and take things from there. |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/triage accepted I think this issue is still relevant. We've observed a scenario on EKS in which:
If kubelet were to re-register with the API server in this scenario, everything would be fine. 😄 |
Once a kubelet has started up, if its Node API object is removed, the kubelet perpetually attempts and fails to update the status on the now-missing Node object.
I would have expected it to do one of the following:
/kind bug
/sig node
The text was updated successfully, but these errors were encountered: