Node reboot cause unable to mark volume as detached normally in GKE #52735

jingxu97 · 2017-09-19T17:31:12Z

In our current GKE environment, when an instance is stopped, crashes, or deleted, a new instances will be automatically recreated so it can resume its processing tasks. Because node is deleted from cloud provider, all the volumes attached to the instance will be released (detached) automatically by cloud provider. Node object will be deleted from API server by node controller after it detects that node is deleted from cloud provider. Pods running on that nodes will be considered as orphaned pods and get deleted too. Volume manager has a control loop to periodically check the whether the volume is still attached to the node periodically.

In this scenario of node is deleted before pods are deleted, current volume manager could not handle it very well and cause unnecessary delay or even failures for volume operations. The following sequence of events will happen.

Node controller notices that kubelet is not updating the node status and verify that the node is no longer exist in cloud provider, deleting it. (At this moment, volumes attached to that node is no longer attached, and also mounts)
Node delete event is received by attach_detach controller, but it does not update desired state because it still has volumes attached. About the same time, reconciler verifies that the volume is no longer attached to the node, mark it as detached in actual state.
Compared between desired and actual state, reconciler decides to attach the volume back to the node. When the new instance is up (could be in a few second), reconciler could attach the volume back the the node and then update volume as attached and mounted (by default it assumed as mounted for the first attach operation) in actual state. https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/cache/actual_state_of_world.go#L336
Garbage collector delete Pods on that node, so desired state is updated to delete the volumes to attach to the node. Reconciler tries to detach but fails at safety check. It assumes that the volume is still mounted.
Pod is recreated and scheduled to a different node2, reconciler fails to attach volume to node2 since it is currently still attached to the deleted node.
Until 6 minutes timeout, reconcider decides to detach the volume and finally volume can be attached to the new node2.
Note that around or after step 4, the new node is added back (normally a few minutes after node is deleted), volume is shown as not mounted from VolumeInUse field. At the stage, reconciler could pass the safety check and trigger detach. However, it might still get delayed unless we solve the next issue 6)

The main problem here is that when node is deleted before pods are deleted, the desired state does not update to reflect this important change so the reconciler’s behavior will be very different from what the real world wants. It attaches the volume back to the deleted node and because of the design, it further makes very difficult to detach from the deleted node quickly.

Proposed solution.
Node delete will cause volumes are deleted from desired state. When the node is added back, controller will list all pods and update the desired state. (This to avoid pods are not garbage collected if node object is quickly added back to the API server). This step will make sure that the desired state reflects that the volumes should not attach to the deleted node.

gnufied · 2017-09-22T02:34:57Z

Thanks for writing this up @jingxu97 . I also wanted to add that, the cloudprovider VerifiedVolumeAttached check is not needed in AWS/EBS and I think we should either disable it or make it no-op.

The reason why the original check was added is because in some cloudprovider's shutting down or rebooting a node can cause attached volumes to be detached. That doesn't happen on AWS. On AWS only termination can cause volume detach and in which case it is pointless to try and attach back those volumes as you wrote above.

jingxu97 · 2017-10-06T18:02:25Z

priority/important-longterm

k8s-github-robot · 2017-10-09T08:25:21Z

[MILESTONENOTIFIER] Milestone Removed

@jingxu97

Important: This issue was missing labels required for the v1.9 milestone for more than 3 days:

priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

fejta-bot · 2018-01-11T18:37:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-02-11T18:07:10Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-13T18:53:38Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 19, 2017

jingxu97 added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. labels Sep 19, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 19, 2017

jingxu97 added this to the v1.9 milestone Sep 19, 2017

k8s-github-robot added the milestone/incomplete-labels label Oct 5, 2017

k8s-github-robot added milestone/removed and removed milestone/incomplete-labels labels Oct 9, 2017

k8s-github-robot removed this from the v1.9 milestone Oct 9, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2018

jingxu97 mentioned this issue Jan 12, 2018

Wrong AWS volume can be mounted #29324

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 11, 2018

lukaszgryglicki mentioned this issue Mar 1, 2018

Pervasive lag issue with label/milestone changes in issues and PRs cncf/devstats.archive#78

Closed

k8s-ci-robot closed this as completed Mar 13, 2018

gnufied mentioned this issue Aug 16, 2018

UPSTREAM: 66835: cloudprovider: aws: return true on existence check for stopped instances openshift/origin#20663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node reboot cause unable to mark volume as detached normally in GKE #52735

Node reboot cause unable to mark volume as detached normally in GKE #52735

jingxu97 commented Sep 19, 2017

gnufied commented Sep 22, 2017

jingxu97 commented Oct 6, 2017

k8s-github-robot commented Oct 9, 2017

fejta-bot commented Jan 11, 2018

fejta-bot commented Feb 11, 2018

fejta-bot commented Mar 13, 2018

Node reboot cause unable to mark volume as detached normally in GKE #52735

Node reboot cause unable to mark volume as detached normally in GKE #52735

Comments

jingxu97 commented Sep 19, 2017

gnufied commented Sep 22, 2017

jingxu97 commented Oct 6, 2017

k8s-github-robot commented Oct 9, 2017

fejta-bot commented Jan 11, 2018

fejta-bot commented Feb 11, 2018

fejta-bot commented Mar 13, 2018