Propose to taint node "shutdown" condition #58635

jingxu97 · 2018-01-22T21:39:53Z

Today there is only one node condition to represent whether kubelet is healthy (updating node status). Node controller will mark node as "NotReady" after kubelet stops updating for a while. The reason for kubelet not responding could be

Network disconnect
Node is shun down (stopped) or paused
Kubelet is killed (could not response)

Problem

In many cases, when a node becomes "NotReady", pods will be evicted, and new pods will be started on other nodes (failover for high availability). However, if node is shut down, the volumes attached to the nodes are still attached. Volume controller should try to detach volumes when pods are killed from the old node. But there is a safety check before detaching to see whether volume is still mounted. This check is done by a VolumeInUse field in node status which is updated by kubelet. In case of node shut down, this field will not be updated by kubelet so that the safety check will fail and assume the volume cannot be detached because of existing mounts. The new pods will be pending for a while because volume cannot be attached to the other node. Although the reality is that volume can be detached in this situation. (Note: some cloud provider will treat stopped node as not exist in cloud provider so that node controller will delete the node API object. But still detach will not happen because of the mount safety check. Ref #46442)

Proposal

We propose to taint the node with "Stopped" condition. When kubelet stops updating, node controller checks cloud provider and see whether the node is in "stopped" state (Note: different cloud provider might use different term to represent this state, e.g., terminated). When volume controller gets a node update/delete event, it can check the node taint spec and mark the volumes attached to that node are safe to detach (volumes can be be removed from the list of VolumeInUse (meaning the volumes are mounted). When detach is issued, the operation can be proceeded without delay.

To determine the "stopped" condition, I lists a few cloud provider's explanation of stopping an instance.

GCE

Stopping an instance causes Compute Engine to send the ACPI Power Off signal to the instance. Modern guest operating systems are configured to perform a clean shutdown before powering off in response to the power off signal. Compute Engine waits a short time for the guest to finish shutting down and then transitions the instance to the TERMINATED state.
You can stop an instance temporarily so you can come back to it at a later time. A stopped instance does not incur charges, but all of the resources that are attached to the instance will still be charged. Alternatively, if you are done using an instance, delete the instance and its resources to stop incurring charges.
You can permanently delete an instance to remove the instance and the associated resources from your project. If the instance is part of an instance group, the group might try to recreate the instance to maintain a certain group size.

AWS

You can only stop an Amazon EBS-backed instance. The instance performs a normal shutdown and stops running; its status changes to stopping and then stopped. Any Amazon EBS volumes remain attached to the instance, and their data persists. In most cases, the instance is migrated to a new underlying host computer when it's started.
You can delete your instance when you no longer need it. This is referred to as terminating your instance. When an instance terminates, the data on any instance store volumes associated with that instance is deleted. If your instance is in an Auto Scaling group, the Auto Scaling service marks the stopped instance as unhealthy, and may terminate it and launch a replacement instance.

OpenStack

Admins can pause and unpause a Nova compute instance. When an instance is paused, the entire state of the instance is kept in RAM. Pausing an instance will disable access to that instance, but won't free up any of its resources. Another option is to suspend, and then resume, an instance. Like paused OpenStack instances, a suspended instance keeps its current state, but it is written to storage.
A third option is to shelve OpenStack instances.A shelved instance is actually shut down, which is not the case for suspended or paused instances. If admins decide they no longer need a shelved instance, they can remove it, which ensures that it doesn't maintain any hypervisor-level resources in use.
The last option is to stop an instance in Nova, which will disconnect all of its associated resources. This means admins can't restore a stopped instance to its previous state. This option is only useful for OpenStack instances that an organization no longer needs. In all other cases, admins should shelve, suspend or pause the instance.

So basically a "stopped" node means that it is shut down like an operating system shut down situation. All mounts are gone when machine is shut down.

cc @yujuhong @dchen1107 @Random-Liu

k82cn · 2018-01-23T05:40:30Z

xref #49798

yujuhong · 2018-01-23T21:45:07Z

I talked to @bsalamat and @dchen1107 and I think it's advised to use taints instead of conditions (ref: #57690).

That said, I find the definition of "stopped" a bit vague. @jingxu97, could you define what "stopped" means, whether a node can recover from the state, and if it can, what do we expect the state of the recovered node to be. Also, it'd be nice to see how "stopped" would be interpreted in different cloud providers while adhering to the definition.

zetaab · 2018-01-24T11:55:36Z

@yujuhong stopped means for instance virtual machine in openstack which is stopped (not running currently). Currently this thing is not supported at all, and if we stop instance that node is deleted from kubernetes but volumes are still there as attached.

Same thing exist also at least in AWS, the instance state might be shutdowned. It is automatically deleted from kube cluster but volumes are not detached immediately.

jingxu97 · 2018-01-24T21:29:37Z

@yujuhong @bsalamat and @dchen1107 I updated the proposal based on the feedback. PTAL Thanks!

yastij · 2018-01-24T23:25:03Z

@jingxu97 @yujuhong +1 the way to go is taint, as work is being done to move our internal logic to taints instead of condition.

Also Looping @augabet @etiennecoutaud and @jhorwit2 for the wg-cloudprovider side

bsalamat · 2018-01-25T23:27:06Z

There are some ambiguities here for me. To clarify things and see if my understanding is correct, let's look at different scenarios one at a time (I use node and the underlying VM of the node interchangeably here):

Node has network issues and cannot be reached: This is recoverable and I don't think "stopped" would be the right taint for it. It is after all not stopped. It is not responding to network requests. Our control plane can check with the cloud provider API whether the underlying VM of the node is still running.
Kubelet has died or not responding. This is more or less similar to the above case. I don't think "stopped" is the right taint.
Node is paused. I think "stopped" (or "paused") taint would be appropriate here.
Node is shutdown. IIUC, when a node's underlying VM is terminated, the node is deleted from Kubernetes. So, there is nothing to taint and we must detach the volumes. Am I missing something?

So, among the four cases, one of them looks to me like a candidate for "paused" taint.

jingxu97 · 2018-01-25T23:48:50Z

@bsalamat Thanks for your comments. For scenarios 3 and 4, I think there is some confusion.
3. Node is paused (sleep/suspend): actually in this case, since OS is not shutdown, the mounts will still exist, so for our case, we cannot detach volumes, which means we should not taint the node with situation
4. Node is shutdown: when node is shutdown, actually the node is still exist in cloud provider (e.g., GCE, AWS), their status is terminated/stopped. But see issue #46442, the cloud provider code in kubernetes handles this situation differently. Some provider (GCE) will not delete the node from Kuberentes, some others will. We want to taint the node even if they will be deleted, so that volume controller can check the node taint and know whether mounts are still exist or not. When node is deleted, detach will be triggered, but the problem is that we don't know whether mounts still exist or not and will not issue detach until some timeout. If node is taint with "shutdown/stopped", we can be sure. It is ok that node will be deleted as long as it is tainted before.
(Note, for GCE, if node is managed by instance group, the shutdown node will be actually deleted from cloud provider and a new one comes up to replace it)

Please let me know if you have questions about it. Thanks!

bsalamat · 2018-01-26T00:31:40Z

@jingxu97 Thanks for clarification. I guess I now understand it better. So, the taint will be applied only to scenario #4 where the node is shutdown, because the node is not deleted from K8s clusters in all cloud providers. This sounds good to me. Maybe, we want to call the taint "shutdown" then?

yujuhong · 2018-01-26T00:51:25Z

I've already seen many different operations like "suspend", "pause", "stop", "shelve"...
Regardless what the final name of the taint is, I think it's fair to say that the taint should clearly state that the node should be in the shutdown state (e.g., volumes unmounted), etc.

yastij · 2018-01-26T19:38:04Z

@jingxu97 @bsalamat @yujuhong - before moving on with this, we should clarify how we should handle quick reboot use cases, where users rely on the fact volumes are still attached.

One solution is to add a flag to handle this and not break expected behaviour.

jingxu97 · 2018-01-26T20:04:02Z

@yastij For quick reboot case, the following will happen, I think.

After node is shutdown for about half a minute, node controller will check cloud provider whether node still exist or not. If node at this moment, is already restarted, node controller will not do anything (except updating node status)
If node is not yet up when node controller checks cloud provider, currently, for some cloud provider (AWS etc) node object will be deleted from API server when node is not in running state. Pods might be soon garbage collected because node no long exists (Again, if node comes back soon, Pod garbage collector might not delete pods). When pods are deleted, volume controller will try to detach volume but it will fail because it assumes mounts are still exist. For some other cloud provider (GCE etc), node object will not be deleted if node still exist in cloud provider (the state is changed to terminated or something). Then after that, depending on the taint and eviction policy, pods might be evicted after sometime. When pods are evicted, volume detach operation again will fail first because controller cannot determine mounts exist or not.

Now with node taint "shutdown" state, the above scenario will be the same, if node is rebooted very quickly (less than 30 seconds or so), nothing will happen. If node is rebooted after a couple of minutes, depending on the cloud provider and also the timing of node reboot, pods might be garbage collected/evicted. With node taint, detach operation can be triggered because controller can know the mounts are gone and safe to detach.

So my point is "taint" will not change the current behavior of pods deletion. It can help volume controller to determine whether to detach volume safely.

yastij · 2018-01-26T23:06:02Z

@jingxu97 - ok I'm fine with, so basically what we have to do:

Create the shutdown taint
Let the cloud/node_controller taint the nodes that have been terminated
Make volume controller look for shutdown taint and unmount volumes

Will this be transparent to CSI or should it watch for the taint and issue corresponding call to the plugin ?

jingxu97 · 2018-01-26T23:18:04Z

This will be transparent to CSI. Only volume controller needs some changes to check the taint so that detach will be issued without delay.

yastij · 2018-01-26T23:34:32Z

@jingxu97 - SGTM

jberkus · 2018-02-02T22:31:02Z

@jingxu97 if this change is going to be in 1.10, please add kind/ and priority/ labels to it. Also, this looks remarkably like a feature which wasn't submitted for feature freeze. Please correct, thanks!

jingxu97 · 2018-02-02T22:54:49Z

We are trying to find some help implementing this. Can I still submit a feature for this? Thanks!

yastij · 2018-02-02T23:04:44Z

I can help with this

jingxu97 · 2018-02-02T23:15:14Z

@yastij that's great! Please let me know if you need anything. As @jberkus mentioned, I think we might need to submit a feature.

yastij · 2018-02-02T23:31:09Z

@jingxu97 - I can submit it tomorrow and start working on a PR, is it targeting 1.10 ?

jingxu97 · 2018-02-02T23:49:42Z

Yes, we want to target 1.10.

yastij · 2018-12-28T10:49:39Z

/remove-lifecycle rotten

fejta-bot · 2019-03-28T11:22:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

yastij · 2019-03-28T18:21:49Z

/remove-lifecycle stale

fejta-bot · 2019-06-26T18:26:57Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

yastij · 2019-06-26T18:43:14Z

/remove-lifecycle stale

fejta-bot · 2019-09-24T18:53:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-10-24T19:37:39Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

cofyc · 2019-10-24T19:44:09Z

/remove-lifecycle stale

cofyc · 2019-10-24T19:45:27Z

/remove-lifecycle rotten

fejta-bot · 2020-01-22T19:46:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-02-21T20:28:35Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-03-22T21:13:02Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-03-22T21:13:18Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 22, 2018

jingxu97 added area/node-lifecycle Issues or PRs related to Node lifecycle sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 22, 2018

k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 22, 2018

jingxu97 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Jan 22, 2018

jingxu97 mentioned this issue Jan 23, 2018

decrease pod with pvc failover time in case of node is shutdowned. #58169

Closed

jingxu97 changed the title ~~Propose to add "stopped" condition for NodeConditionType~~ Propose to taint node "stopped" condition Jan 24, 2018

jingxu97 added this to the v1.10 milestone Jan 29, 2018

jingxu97 changed the title ~~Propose to taint node "stopped" condition~~ Propose to taint node "shutdown" condition Feb 1, 2018

yastij mentioned this issue Feb 3, 2018

Node Failure Recover with ForceEvictione Taint kubernetes/enhancements#551

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 28, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 28, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 28, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 28, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2019

dashpole mentioned this issue Jun 26, 2019

add non-graceful node shutdown KEP kubernetes/enhancements#1116

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 24, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 24, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2020

k8s-ci-robot closed this as completed Mar 22, 2020

SergeyKanzhelev mentioned this issue Sep 25, 2020

Add node shutdown KEP kubernetes/enhancements#2001

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propose to taint node "shutdown" condition #58635

Propose to taint node "shutdown" condition #58635

jingxu97 commented Jan 22, 2018 •

edited

Loading

k82cn commented Jan 23, 2018

yujuhong commented Jan 23, 2018

zetaab commented Jan 24, 2018 •

edited

Loading

jingxu97 commented Jan 24, 2018

yastij commented Jan 24, 2018 •

edited

Loading

bsalamat commented Jan 25, 2018

jingxu97 commented Jan 25, 2018

bsalamat commented Jan 26, 2018

yujuhong commented Jan 26, 2018

yastij commented Jan 26, 2018 •

edited

Loading

jingxu97 commented Jan 26, 2018

yastij commented Jan 26, 2018

jingxu97 commented Jan 26, 2018

yastij commented Jan 26, 2018

jberkus commented Feb 2, 2018

jingxu97 commented Feb 2, 2018

yastij commented Feb 2, 2018

jingxu97 commented Feb 2, 2018

yastij commented Feb 2, 2018

jingxu97 commented Feb 2, 2018

yastij commented Dec 28, 2018

fejta-bot commented Mar 28, 2019

yastij commented Mar 28, 2019

fejta-bot commented Jun 26, 2019

yastij commented Jun 26, 2019

fejta-bot commented Sep 24, 2019

fejta-bot commented Oct 24, 2019

cofyc commented Oct 24, 2019

cofyc commented Oct 24, 2019

fejta-bot commented Jan 22, 2020

fejta-bot commented Feb 21, 2020

fejta-bot commented Mar 22, 2020

k8s-ci-robot commented Mar 22, 2020

Propose to taint node "shutdown" condition #58635

Propose to taint node "shutdown" condition #58635

Comments

jingxu97 commented Jan 22, 2018 • edited Loading

Problem

Proposal

GCE

AWS

OpenStack

k82cn commented Jan 23, 2018

yujuhong commented Jan 23, 2018

zetaab commented Jan 24, 2018 • edited Loading

jingxu97 commented Jan 24, 2018

yastij commented Jan 24, 2018 • edited Loading

bsalamat commented Jan 25, 2018

jingxu97 commented Jan 25, 2018

bsalamat commented Jan 26, 2018

yujuhong commented Jan 26, 2018

yastij commented Jan 26, 2018 • edited Loading

jingxu97 commented Jan 26, 2018

yastij commented Jan 26, 2018

jingxu97 commented Jan 26, 2018

yastij commented Jan 26, 2018

jberkus commented Feb 2, 2018

jingxu97 commented Feb 2, 2018

yastij commented Feb 2, 2018

jingxu97 commented Feb 2, 2018

yastij commented Feb 2, 2018

jingxu97 commented Feb 2, 2018

yastij commented Dec 28, 2018

fejta-bot commented Mar 28, 2019

yastij commented Mar 28, 2019

fejta-bot commented Jun 26, 2019

yastij commented Jun 26, 2019

fejta-bot commented Sep 24, 2019

fejta-bot commented Oct 24, 2019

cofyc commented Oct 24, 2019

cofyc commented Oct 24, 2019

fejta-bot commented Jan 22, 2020

fejta-bot commented Feb 21, 2020

fejta-bot commented Mar 22, 2020

k8s-ci-robot commented Mar 22, 2020

jingxu97 commented Jan 22, 2018 •

edited

Loading

zetaab commented Jan 24, 2018 •

edited

Loading

yastij commented Jan 24, 2018 •

edited

Loading

yastij commented Jan 26, 2018 •

edited

Loading