-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose to taint node "shutdown" condition #58635
Comments
xref #49798 |
I talked to @bsalamat and @dchen1107 and I think it's advised to use taints instead of conditions (ref: #57690). That said, I find the definition of "stopped" a bit vague. @jingxu97, could you define what "stopped" means, whether a node can recover from the state, and if it can, what do we expect the state of the recovered node to be. Also, it'd be nice to see how "stopped" would be interpreted in different cloud providers while adhering to the definition. |
@yujuhong stopped means for instance virtual machine in openstack which is stopped (not running currently). Currently this thing is not supported at all, and if we stop instance that node is deleted from kubernetes but volumes are still there as attached. Same thing exist also at least in AWS, the instance state might be shutdowned. It is automatically deleted from kube cluster but volumes are not detached immediately. |
@yujuhong @bsalamat and @dchen1107 I updated the proposal based on the feedback. PTAL Thanks! |
@jingxu97 @yujuhong +1 the way to go is taint, as work is being done to move our internal logic to taints instead of condition. Also Looping @augabet @etiennecoutaud and @jhorwit2 for the wg-cloudprovider side |
There are some ambiguities here for me. To clarify things and see if my understanding is correct, let's look at different scenarios one at a time (I use node and the underlying VM of the node interchangeably here):
So, among the four cases, one of them looks to me like a candidate for "paused" taint. |
@bsalamat Thanks for your comments. For scenarios 3 and 4, I think there is some confusion. Please let me know if you have questions about it. Thanks! |
I've already seen many different operations like "suspend", "pause", "stop", "shelve"... |
@yastij For quick reboot case, the following will happen, I think.
Now with node taint "shutdown" state, the above scenario will be the same, if node is rebooted very quickly (less than 30 seconds or so), nothing will happen. If node is rebooted after a couple of minutes, depending on the cloud provider and also the timing of node reboot, pods might be garbage collected/evicted. With node taint, detach operation can be triggered because controller can know the mounts are gone and safe to detach. So my point is "taint" will not change the current behavior of pods deletion. It can help volume controller to determine whether to detach volume safely. |
@jingxu97 - ok I'm fine with, so basically what we have to do:
Will this be transparent to CSI or should it watch for the taint and issue corresponding call to the plugin ? |
This will be transparent to CSI. Only volume controller needs some changes to check the taint so that detach will be issued without delay. |
@jingxu97 - SGTM |
@jingxu97 if this change is going to be in 1.10, please add kind/ and priority/ labels to it. Also, this looks remarkably like a feature which wasn't submitted for feature freeze. Please correct, thanks! |
We are trying to find some help implementing this. Can I still submit a feature for this? Thanks! |
I can help with this |
@jingxu97 - I can submit it tomorrow and start working on a PR, is it targeting 1.10 ? |
Yes, we want to target 1.10. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Today there is only one node condition to represent whether kubelet is healthy (updating node status). Node controller will mark node as "NotReady" after kubelet stops updating for a while. The reason for kubelet not responding could be
Problem
In many cases, when a node becomes "NotReady", pods will be evicted, and new pods will be started on other nodes (failover for high availability). However, if node is shut down, the volumes attached to the nodes are still attached. Volume controller should try to detach volumes when pods are killed from the old node. But there is a safety check before detaching to see whether volume is still mounted. This check is done by a VolumeInUse field in node status which is updated by kubelet. In case of node shut down, this field will not be updated by kubelet so that the safety check will fail and assume the volume cannot be detached because of existing mounts. The new pods will be pending for a while because volume cannot be attached to the other node. Although the reality is that volume can be detached in this situation. (Note: some cloud provider will treat stopped node as not exist in cloud provider so that node controller will delete the node API object. But still detach will not happen because of the mount safety check. Ref #46442)
Proposal
We propose to taint the node with "Stopped" condition. When kubelet stops updating, node controller checks cloud provider and see whether the node is in "stopped" state (Note: different cloud provider might use different term to represent this state, e.g., terminated). When volume controller gets a node update/delete event, it can check the node taint spec and mark the volumes attached to that node are safe to detach (volumes can be be removed from the list of VolumeInUse (meaning the volumes are mounted). When detach is issued, the operation can be proceeded without delay.
To determine the "stopped" condition, I lists a few cloud provider's explanation of stopping an instance.
GCE
Stopping an instance causes Compute Engine to send the ACPI Power Off signal to the instance. Modern guest operating systems are configured to perform a clean shutdown before powering off in response to the power off signal. Compute Engine waits a short time for the guest to finish shutting down and then transitions the instance to the TERMINATED state.
You can stop an instance temporarily so you can come back to it at a later time. A stopped instance does not incur charges, but all of the resources that are attached to the instance will still be charged. Alternatively, if you are done using an instance, delete the instance and its resources to stop incurring charges.
You can permanently delete an instance to remove the instance and the associated resources from your project. If the instance is part of an instance group, the group might try to recreate the instance to maintain a certain group size.
AWS
You can only stop an Amazon EBS-backed instance. The instance performs a normal shutdown and stops running; its status changes to stopping and then stopped. Any Amazon EBS volumes remain attached to the instance, and their data persists. In most cases, the instance is migrated to a new underlying host computer when it's started.
You can delete your instance when you no longer need it. This is referred to as terminating your instance. When an instance terminates, the data on any instance store volumes associated with that instance is deleted. If your instance is in an Auto Scaling group, the Auto Scaling service marks the stopped instance as unhealthy, and may terminate it and launch a replacement instance.
OpenStack
Admins can pause and unpause a Nova compute instance. When an instance is paused, the entire state of the instance is kept in RAM. Pausing an instance will disable access to that instance, but won't free up any of its resources. Another option is to suspend, and then resume, an instance. Like paused OpenStack instances, a suspended instance keeps its current state, but it is written to storage.
A third option is to shelve OpenStack instances.A shelved instance is actually shut down, which is not the case for suspended or paused instances. If admins decide they no longer need a shelved instance, they can remove it, which ensures that it doesn't maintain any hypervisor-level resources in use.
The last option is to stop an instance in Nova, which will disconnect all of its associated resources. This means admins can't restore a stopped instance to its previous state. This option is only useful for OpenStack instances that an organization no longer needs. In all other cases, admins should shelve, suspend or pause the instance.
So basically a "stopped" node means that it is shut down like an operating system shut down situation. All mounts are gone when machine is shut down.
cc @yujuhong @dchen1107 @Random-Liu
The text was updated successfully, but these errors were encountered: