Kubelet does not restart or reregister in response to removed Node API object #71398

liggitt · 2018-11-24T15:22:36Z

Once a kubelet has started up, if its Node API object is removed, the kubelet perpetually attempts and fails to update the status on the now-missing Node object.

I would have expected it to do one of the following:

exit after a period of time or number of retries
re-register the Node object

/kind bug
/sig node

Pingan2017 · 2018-11-26T02:56:11Z

if no one plan to work on it, i'd like to take this.

justinsb · 2018-11-26T15:13:55Z

I'm a little unsure about whether we should be re-registering the Node. My guess is that if a Node was deleted, it was usually because the administrator is preparing to terminate the underlying machine. I guess the scenario we're thinking about here is that the node controller removed it mistakenly? I don't know if there's some other scenario?

We had a bug in kops where when a machine is deleted from a GCE MIG it often comes back with the same name. So if the machine came back faster than the Node controller could remove the Node, then the new machine would assume the previous identity. And - critically - it wouldn't remove any taints, and of course we had cordoned it prior to shutdown.

(In theory this can happen on AWS also, but it's less likely)

Maybe there's a better fix than the one we did in kops. e.g. maybe the kubelet would know whether it was the "same machine" or not and remove its own taints on boot, but I'm not sure.

I think this is also a sig-cluster-lifecycle issue therefore.

/sig cluster-lifecycle

liggitt · 2018-11-26T15:19:00Z

since if the kubelet process restarted, it would re-register the Node object, the current behavior is fragile at best, and perpetually attempting and failing to report status really seems like a bug.

liggitt · 2018-11-26T15:33:05Z

if a Node was deleted, it was usually because the administrator is preparing to terminate the underlying machine

wouldn't the proper order be to:

terminate the machine (or in a cloud environment, have the cloud provider terminate the instance)
delete the Node object (or in a cloud environment, let the node controller delete the Node object)

Deleting the node object first is racy if the kubelet process happens to restart.

I guess the scenario we're thinking about here is that the node controller removed it mistakenly? I don't know if there's some other scenario?

accidental deletion, or assumption that a deleted Node object will get re-registered and updated by the kubelet (since that's what the kubelet currently does in some cases)

justinsb · 2018-11-26T16:16:11Z

@liggitt I agree - deleting the Node first breaks if the kubelet restarts (but less pathologically than all your new nodes coming up cordoned, IMO - the node that is about to be terminated gets uncordoned and presumably gets pods scheduled to it that will be short-lived).

Terminating the machine and then deleting the Node is also potentially problematic - e.g. if we fail to delete the Node (or get delayed and delete the new Node).

None of these are likely, but we needed to do something to stop nodes with the same Name coming back cordoned.

I do think we should figure out what we want to do here - maybe involving the kubelet checking the instance id or machine id. I'm hoping that this overlaps with the cluster-api work (cc @roberthbailey ), which is why I looped in sig-cluster-lifecycle also.

roberthbailey · 2018-11-27T15:46:59Z

Deleting the node object first is racy if the kubelet process happens to restart.

I think that @justinsb is saying that both orders are racy (at least in cloud environments) for different reasons.

Terminating the machine and then deleting the Node is also potentially problematic - e.g. if we fail to delete the Node (or get delayed and delete the new Node).

This is certainly true for certain types of deletions. If you "recreate" a VM that is part of a MIG it comes back with the same name. Also if you delete and recreate a preembtible VM (on GCE) it will come back with the same name (and a different IP address). Not having deleted the node object in both of those cases can be problematic if the new kubelet / machine assume the identity (and any running pods) from the old kubelet / machine.

/cc @dchen1107

erwbgy · 2019-02-03T19:35:48Z

Since this issue is still open I presume there was no conclusion to this topic? If there was, could someone post an update here?

Personally I like the "exit after a period of time or number of retries" @liggitt proposed, with the default being 0 to retry forever to keep the current behavior, while still making it possible to have the node re-registered after some time when the kubelet restarts, if that is what the admin wants.

fejta-bot · 2019-05-04T19:44:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-06-03T20:28:31Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-07-03T21:16:08Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-07-03T21:16:15Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SergeyKanzhelev · 2022-04-06T17:44:54Z

/triage accepted

sftim · 2022-12-20T12:05:00Z

I'm a little unsure about whether we should be re-registering the Node. My guess is that if a Node was deleted, it was usually because the administrator is preparing to terminate the underlying machine. I guess the scenario we're thinking about here is that the node controller removed it mistakenly? I don't know if there's some other scenario?

You can set up your cluster with a finalizer on a node, so that deleting the Node triggers deletion of the underlying computer. For example, Karpenter (a cluster autoscaler) does this.

In that case, the kubelet won't need to reregister because the computer shutdown that Karpenter triggers will stop the kubelet and the running containers.

However, if you don't define such a finalizer, the kubelet could / should attempt to reregister. It'd also be nice to have a metric available for the number of times that the kubelet has seen its node object deleted (etc).

sftim · 2022-12-20T12:08:33Z

💭 we could make a per-node addon that sets set itself as a finalizer for the associated node? The addon could detect pending node deletion, and trigger a graceful kubelet shutdown before finally allowing the node to delete from the API.

If the OS then restarts the kubelet because that's what the sysadmin has configured, I'd be happy for that kubelet to register as a new node (presumably by the same name) and take things from there.

k8s-triage-robot · 2024-01-19T22:05:50Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

cartermckinnon · 2024-02-27T19:13:23Z

/triage accepted

I think this issue is still relevant. We've observed a scenario on EKS in which:

A node joins the cluster.
A few minutes later, the aws cloud node lifecycle controller receives an empty response from ec2:DescribeInstances (due to eventual consistency) and determines the instance no longer exists, so it deletes the Node object.
The instance hangs around forever, costing money and doing nothing.

If kubelet were to re-register with the API server in this scenario, everything would be fine. 😄

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 24, 2018

liggitt mentioned this issue Nov 24, 2018

When kubelet synchronizes pod state from API server, should it detect if it is registered to the cluster #68871

Closed

liggitt added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Nov 24, 2018

Pingan2017 mentioned this issue Nov 26, 2018

re-register the Node if Node API object removed #71407

Closed

k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Nov 26, 2018

liggitt mentioned this issue Dec 18, 2018

Node being deleted when network outage encountered #71562

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 3, 2019

k8s-ci-robot closed this as completed Jul 3, 2019

liggitt mentioned this issue Jul 2, 2020

Automatically re-register node when node is removed from API server #92769

Closed

adammw mentioned this issue Jul 13, 2020

[WIP] Add maxRegistrationAttempts flag to kubelet #93037

Closed

adisky mentioned this issue Jun 24, 2021

Kubelet doesn't attempt to re-register #9085

Open

liggitt reopened this Apr 5, 2022

liggitt added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 5, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 5, 2022

SergeyKanzhelev added this to Triage in SIG Node Bugs Apr 6, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2022

SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Apr 6, 2022

tzneal mentioned this issue Feb 15, 2023

Kubelet does not try to register again if its Node object is deleted #115761

Closed

ruiwen-zhao mentioned this issue Feb 17, 2023

kubelet should not renew the list if the object was deleted after registration #115760

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 27, 2024

cartermckinnon linked a pull request Feb 27, 2024 that will close this issue

(kubelet) Re-register with API server when Node is not found #123535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet does not restart or reregister in response to removed Node API object #71398

Kubelet does not restart or reregister in response to removed Node API object #71398

liggitt commented Nov 24, 2018 •

edited

Pingan2017 commented Nov 26, 2018

justinsb commented Nov 26, 2018

liggitt commented Nov 26, 2018

liggitt commented Nov 26, 2018

justinsb commented Nov 26, 2018

roberthbailey commented Nov 27, 2018

erwbgy commented Feb 3, 2019

fejta-bot commented May 4, 2019

fejta-bot commented Jun 3, 2019

fejta-bot commented Jul 3, 2019

k8s-ci-robot commented Jul 3, 2019

SergeyKanzhelev commented Apr 6, 2022

sftim commented Dec 20, 2022

sftim commented Dec 20, 2022

k8s-triage-robot commented Jan 19, 2024

cartermckinnon commented Feb 27, 2024

Kubelet does not restart or reregister in response to removed Node API object #71398

Kubelet does not restart or reregister in response to removed Node API object #71398

Comments

liggitt commented Nov 24, 2018 • edited

Pingan2017 commented Nov 26, 2018

justinsb commented Nov 26, 2018

liggitt commented Nov 26, 2018

liggitt commented Nov 26, 2018

justinsb commented Nov 26, 2018

roberthbailey commented Nov 27, 2018

erwbgy commented Feb 3, 2019

fejta-bot commented May 4, 2019

fejta-bot commented Jun 3, 2019

fejta-bot commented Jul 3, 2019

k8s-ci-robot commented Jul 3, 2019

SergeyKanzhelev commented Apr 6, 2022

sftim commented Dec 20, 2022

sftim commented Dec 20, 2022

k8s-triage-robot commented Jan 19, 2024

cartermckinnon commented Feb 27, 2024

liggitt commented Nov 24, 2018 •

edited