Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose to taint node "shutdown" condition #58635

Closed
jingxu97 opened this issue Jan 22, 2018 · 58 comments · Fixed by #59323 or #60009
Closed

Propose to taint node "shutdown" condition #58635

jingxu97 opened this issue Jan 22, 2018 · 58 comments · Fixed by #59323 or #60009
Labels
area/cloudprovider area/node-lifecycle Issues or PRs related to Node lifecycle lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/incomplete-labels sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@jingxu97
Copy link
Contributor

jingxu97 commented Jan 22, 2018

Today there is only one node condition to represent whether kubelet is healthy (updating node status). Node controller will mark node as "NotReady" after kubelet stops updating for a while. The reason for kubelet not responding could be

  1. Network disconnect
  2. Node is shun down (stopped) or paused
  3. Kubelet is killed (could not response)

Problem

In many cases, when a node becomes "NotReady", pods will be evicted, and new pods will be started on other nodes (failover for high availability). However, if node is shut down, the volumes attached to the nodes are still attached. Volume controller should try to detach volumes when pods are killed from the old node. But there is a safety check before detaching to see whether volume is still mounted. This check is done by a VolumeInUse field in node status which is updated by kubelet. In case of node shut down, this field will not be updated by kubelet so that the safety check will fail and assume the volume cannot be detached because of existing mounts. The new pods will be pending for a while because volume cannot be attached to the other node. Although the reality is that volume can be detached in this situation. (Note: some cloud provider will treat stopped node as not exist in cloud provider so that node controller will delete the node API object. But still detach will not happen because of the mount safety check. Ref #46442)

Proposal

We propose to taint the node with "Stopped" condition. When kubelet stops updating, node controller checks cloud provider and see whether the node is in "stopped" state (Note: different cloud provider might use different term to represent this state, e.g., terminated). When volume controller gets a node update/delete event, it can check the node taint spec and mark the volumes attached to that node are safe to detach (volumes can be be removed from the list of VolumeInUse (meaning the volumes are mounted). When detach is issued, the operation can be proceeded without delay.

To determine the "stopped" condition, I lists a few cloud provider's explanation of stopping an instance.

GCE

  • Stopping an instance causes Compute Engine to send the ACPI Power Off signal to the instance. Modern guest operating systems are configured to perform a clean shutdown before powering off in response to the power off signal. Compute Engine waits a short time for the guest to finish shutting down and then transitions the instance to the TERMINATED state.

  • You can stop an instance temporarily so you can come back to it at a later time. A stopped instance does not incur charges, but all of the resources that are attached to the instance will still be charged. Alternatively, if you are done using an instance, delete the instance and its resources to stop incurring charges.

  • You can permanently delete an instance to remove the instance and the associated resources from your project. If the instance is part of an instance group, the group might try to recreate the instance to maintain a certain group size.

AWS

  • You can only stop an Amazon EBS-backed instance. The instance performs a normal shutdown and stops running; its status changes to stopping and then stopped. Any Amazon EBS volumes remain attached to the instance, and their data persists. In most cases, the instance is migrated to a new underlying host computer when it's started.

  • You can delete your instance when you no longer need it. This is referred to as terminating your instance. When an instance terminates, the data on any instance store volumes associated with that instance is deleted. If your instance is in an Auto Scaling group, the Auto Scaling service marks the stopped instance as unhealthy, and may terminate it and launch a replacement instance.

OpenStack

  • Admins can pause and unpause a Nova compute instance. When an instance is paused, the entire state of the instance is kept in RAM. Pausing an instance will disable access to that instance, but won't free up any of its resources. Another option is to suspend, and then resume, an instance. Like paused OpenStack instances, a suspended instance keeps its current state, but it is written to storage.

  • A third option is to shelve OpenStack instances.A shelved instance is actually shut down, which is not the case for suspended or paused instances. If admins decide they no longer need a shelved instance, they can remove it, which ensures that it doesn't maintain any hypervisor-level resources in use.

  • The last option is to stop an instance in Nova, which will disconnect all of its associated resources. This means admins can't restore a stopped instance to its previous state. This option is only useful for OpenStack instances that an organization no longer needs. In all other cases, admins should shelve, suspend or pause the instance.

So basically a "stopped" node means that it is shut down like an operating system shut down situation. All mounts are gone when machine is shut down.

cc @yujuhong @dchen1107 @Random-Liu

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 22, 2018
@jingxu97 jingxu97 added area/node-lifecycle Issues or PRs related to Node lifecycle sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 22, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 22, 2018
@jingxu97 jingxu97 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Jan 22, 2018
@k82cn
Copy link
Member

k82cn commented Jan 23, 2018

xref #49798

@yujuhong
Copy link
Contributor

I talked to @bsalamat and @dchen1107 and I think it's advised to use taints instead of conditions (ref: #57690).

That said, I find the definition of "stopped" a bit vague. @jingxu97, could you define what "stopped" means, whether a node can recover from the state, and if it can, what do we expect the state of the recovered node to be. Also, it'd be nice to see how "stopped" would be interpreted in different cloud providers while adhering to the definition.

@zetaab
Copy link
Member

zetaab commented Jan 24, 2018

@yujuhong stopped means for instance virtual machine in openstack which is stopped (not running currently). Currently this thing is not supported at all, and if we stop instance that node is deleted from kubernetes but volumes are still there as attached.

Same thing exist also at least in AWS, the instance state might be shutdowned. It is automatically deleted from kube cluster but volumes are not detached immediately.

@jingxu97 jingxu97 changed the title Propose to add "stopped" condition for NodeConditionType Propose to taint node "stopped" condition Jan 24, 2018
@jingxu97
Copy link
Contributor Author

@yujuhong @bsalamat and @dchen1107 I updated the proposal based on the feedback. PTAL Thanks!

@yastij
Copy link
Member

yastij commented Jan 24, 2018

@jingxu97 @yujuhong +1 the way to go is taint, as work is being done to move our internal logic to taints instead of condition.

Also Looping @augabet @etiennecoutaud and @jhorwit2 for the wg-cloudprovider side

@bsalamat
Copy link
Member

There are some ambiguities here for me. To clarify things and see if my understanding is correct, let's look at different scenarios one at a time (I use node and the underlying VM of the node interchangeably here):

  1. Node has network issues and cannot be reached: This is recoverable and I don't think "stopped" would be the right taint for it. It is after all not stopped. It is not responding to network requests. Our control plane can check with the cloud provider API whether the underlying VM of the node is still running.
  2. Kubelet has died or not responding. This is more or less similar to the above case. I don't think "stopped" is the right taint.
  3. Node is paused. I think "stopped" (or "paused") taint would be appropriate here.
  4. Node is shutdown. IIUC, when a node's underlying VM is terminated, the node is deleted from Kubernetes. So, there is nothing to taint and we must detach the volumes. Am I missing something?

So, among the four cases, one of them looks to me like a candidate for "paused" taint.

@jingxu97
Copy link
Contributor Author

@bsalamat Thanks for your comments. For scenarios 3 and 4, I think there is some confusion.
3. Node is paused (sleep/suspend): actually in this case, since OS is not shutdown, the mounts will still exist, so for our case, we cannot detach volumes, which means we should not taint the node with situation
4. Node is shutdown: when node is shutdown, actually the node is still exist in cloud provider (e.g., GCE, AWS), their status is terminated/stopped. But see issue #46442, the cloud provider code in kubernetes handles this situation differently. Some provider (GCE) will not delete the node from Kuberentes, some others will. We want to taint the node even if they will be deleted, so that volume controller can check the node taint and know whether mounts are still exist or not. When node is deleted, detach will be triggered, but the problem is that we don't know whether mounts still exist or not and will not issue detach until some timeout. If node is taint with "shutdown/stopped", we can be sure. It is ok that node will be deleted as long as it is tainted before.
(Note, for GCE, if node is managed by instance group, the shutdown node will be actually deleted from cloud provider and a new one comes up to replace it)

Please let me know if you have questions about it. Thanks!

@bsalamat
Copy link
Member

@jingxu97 Thanks for clarification. I guess I now understand it better. So, the taint will be applied only to scenario #4 where the node is shutdown, because the node is not deleted from K8s clusters in all cloud providers. This sounds good to me. Maybe, we want to call the taint "shutdown" then?

@yujuhong
Copy link
Contributor

I've already seen many different operations like "suspend", "pause", "stop", "shelve"...
Regardless what the final name of the taint is, I think it's fair to say that the taint should clearly state that the node should be in the shutdown state (e.g., volumes unmounted), etc.

@yastij
Copy link
Member

yastij commented Jan 26, 2018

@jingxu97 @bsalamat @yujuhong - before moving on with this, we should clarify how we should handle quick reboot use cases, where users rely on the fact volumes are still attached.

One solution is to add a flag to handle this and not break expected behaviour.

@jingxu97
Copy link
Contributor Author

@yastij For quick reboot case, the following will happen, I think.

  1. After node is shutdown for about half a minute, node controller will check cloud provider whether node still exist or not. If node at this moment, is already restarted, node controller will not do anything (except updating node status)
  2. If node is not yet up when node controller checks cloud provider, currently, for some cloud provider (AWS etc) node object will be deleted from API server when node is not in running state. Pods might be soon garbage collected because node no long exists (Again, if node comes back soon, Pod garbage collector might not delete pods). When pods are deleted, volume controller will try to detach volume but it will fail because it assumes mounts are still exist. For some other cloud provider (GCE etc), node object will not be deleted if node still exist in cloud provider (the state is changed to terminated or something). Then after that, depending on the taint and eviction policy, pods might be evicted after sometime. When pods are evicted, volume detach operation again will fail first because controller cannot determine mounts exist or not.

Now with node taint "shutdown" state, the above scenario will be the same, if node is rebooted very quickly (less than 30 seconds or so), nothing will happen. If node is rebooted after a couple of minutes, depending on the cloud provider and also the timing of node reboot, pods might be garbage collected/evicted. With node taint, detach operation can be triggered because controller can know the mounts are gone and safe to detach.

So my point is "taint" will not change the current behavior of pods deletion. It can help volume controller to determine whether to detach volume safely.

@yastij
Copy link
Member

yastij commented Jan 26, 2018

@jingxu97 - ok I'm fine with, so basically what we have to do:

  1. Create the shutdown taint
  2. Let the cloud/node_controller taint the nodes that have been terminated
  3. Make volume controller look for shutdown taint and unmount volumes

Will this be transparent to CSI or should it watch for the taint and issue corresponding call to the plugin ?

@jingxu97
Copy link
Contributor Author

This will be transparent to CSI. Only volume controller needs some changes to check the taint so that detach will be issued without delay.

@yastij
Copy link
Member

yastij commented Jan 26, 2018

@jingxu97 - SGTM

@jingxu97 jingxu97 added this to the v1.10 milestone Jan 29, 2018
@jingxu97 jingxu97 changed the title Propose to taint node "stopped" condition Propose to taint node "shutdown" condition Feb 1, 2018
@jberkus
Copy link

jberkus commented Feb 2, 2018

@jingxu97 if this change is going to be in 1.10, please add kind/ and priority/ labels to it. Also, this looks remarkably like a feature which wasn't submitted for feature freeze. Please correct, thanks!

@jingxu97
Copy link
Contributor Author

jingxu97 commented Feb 2, 2018

We are trying to find some help implementing this. Can I still submit a feature for this? Thanks!

@yastij
Copy link
Member

yastij commented Feb 2, 2018

I can help with this

@jingxu97
Copy link
Contributor Author

jingxu97 commented Feb 2, 2018

@yastij that's great! Please let me know if you need anything. As @jberkus mentioned, I think we might need to submit a feature.

@yastij
Copy link
Member

yastij commented Feb 2, 2018

@jingxu97 - I can submit it tomorrow and start working on a PR, is it targeting 1.10 ?

@jingxu97
Copy link
Contributor Author

jingxu97 commented Feb 2, 2018

Yes, we want to target 1.10.

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 28, 2018
@yastij
Copy link
Member

yastij commented Dec 28, 2018

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 28, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 28, 2019
@yastij
Copy link
Member

yastij commented Mar 28, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 28, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2019
@yastij
Copy link
Member

yastij commented Jun 26, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 24, 2019
@cofyc
Copy link
Member

cofyc commented Oct 24, 2019

/remove-lifecycle stale

@cofyc
Copy link
Member

cofyc commented Oct 24, 2019

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 24, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloudprovider area/node-lifecycle Issues or PRs related to Node lifecycle lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/incomplete-labels sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.