Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node reboot cause unable to mark volume as detached normally in GKE #52735

Closed
jingxu97 opened this issue Sep 19, 2017 · 6 comments
Closed

Node reboot cause unable to mark volume as detached normally in GKE #52735

jingxu97 opened this issue Sep 19, 2017 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/removed sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@jingxu97
Copy link
Contributor

In our current GKE environment, when an instance is stopped, crashes, or deleted, a new instances will be automatically recreated so it can resume its processing tasks. Because node is deleted from cloud provider, all the volumes attached to the instance will be released (detached) automatically by cloud provider. Node object will be deleted from API server by node controller after it detects that node is deleted from cloud provider. Pods running on that nodes will be considered as orphaned pods and get deleted too. Volume manager has a control loop to periodically check the whether the volume is still attached to the node periodically.

In this scenario of node is deleted before pods are deleted, current volume manager could not handle it very well and cause unnecessary delay or even failures for volume operations. The following sequence of events will happen.

  1. Node controller notices that kubelet is not updating the node status and verify that the node is no longer exist in cloud provider, deleting it. (At this moment, volumes attached to that node is no longer attached, and also mounts)
  2. Node delete event is received by attach_detach controller, but it does not update desired state because it still has volumes attached. About the same time, reconciler verifies that the volume is no longer attached to the node, mark it as detached in actual state.
  3. Compared between desired and actual state, reconciler decides to attach the volume back to the node. When the new instance is up (could be in a few second), reconciler could attach the volume back the the node and then update volume as attached and mounted (by default it assumed as mounted for the first attach operation) in actual state. https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/cache/actual_state_of_world.go#L336
  4. Garbage collector delete Pods on that node, so desired state is updated to delete the volumes to attach to the node. Reconciler tries to detach but fails at safety check. It assumes that the volume is still mounted.
  5. Pod is recreated and scheduled to a different node2, reconciler fails to attach volume to node2 since it is currently still attached to the deleted node.
  6. Until 6 minutes timeout, reconcider decides to detach the volume and finally volume can be attached to the new node2.
    Note that around or after step 4, the new node is added back (normally a few minutes after node is deleted), volume is shown as not mounted from VolumeInUse field. At the stage, reconciler could pass the safety check and trigger detach. However, it might still get delayed unless we solve the next issue 6)

The main problem here is that when node is deleted before pods are deleted, the desired state does not update to reflect this important change so the reconciler’s behavior will be very different from what the real world wants. It attaches the volume back to the deleted node and because of the design, it further makes very difficult to detach from the deleted node quickly.

Proposed solution.
Node delete will cause volumes are deleted from desired state. When the node is added back, controller will list all pods and update the desired state. (This to avoid pods are not garbage collected if node object is quickly added back to the API server). This step will make sure that the desired state reflects that the volumes should not attach to the deleted node.

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 19, 2017
@jingxu97 jingxu97 added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. labels Sep 19, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 19, 2017
@jingxu97 jingxu97 added this to the v1.9 milestone Sep 19, 2017
@gnufied
Copy link
Member

gnufied commented Sep 22, 2017

Thanks for writing this up @jingxu97 . I also wanted to add that, the cloudprovider VerifiedVolumeAttached check is not needed in AWS/EBS and I think we should either disable it or make it no-op.

The reason why the original check was added is because in some cloudprovider's shutting down or rebooting a node can cause attached volumes to be detached. That doesn't happen on AWS. On AWS only termination can cause volume detach and in which case it is pointless to try and attach back those volumes as you wrote above.

@jingxu97
Copy link
Contributor Author

jingxu97 commented Oct 6, 2017

priority/important-longterm

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Removed

@jingxu97

Important: This issue was missing labels required for the v1.9 milestone for more than 3 days:

priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 11, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/removed sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

5 participants