Attach/detach controller does not recover from missed pod deletion #34242

jsafrane · 2016-10-06T15:38:16Z

We run OpenShift in master-slave setup and our master crashes once in a while (from unrelated reason). When a new master starts, it does not detach volumes that should be detached.

Steps to reproduce on AWS with standard Kubernetes:

run a AWS-aware cluster, hack/local-up-cluster.sh is fine
create several pods that use claims that point to AWS PVs
kill controller-manager process
delete all pods
start a new controller manager

Result: volumes are attached forever (or at least for next 30 minutes).
It should be reproducible also on GCE. Shouldn't there be a periodic sync that ensures the controller finds deleted pods? This comment looks scary: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L76

Affected version: kubernetes-1.3.8

@saad-ali @jingxu97 @kubernetes/sig-storage

jingxu97 · 2016-10-06T16:39:08Z

Jan,

Yes, this is an issue that we haven't addressed. Basically when master
controller restarted, how to recover the state (which volumes are
attached/detached). If during this time, no pods are deleted, controller
can recover the state by retrieving volume information from pod obj in API
server, populate it in desired state and reconciler can recover the actual
state afterwards. The challenge is that if during this time, pods are
deleted from API server, the information about which volumes are still
attached to the node will be lost. One way to get this info is through
cloud provider. But the question is after getting a list of attached
volumes from cloud provider, how to decide which one should be detached by
controller. It will be dangerous to detach volumes that are not supposed
to. The other approach is to checkpoint the state from master controller.
Need more discussion about this...

Please let me know any suggestions/comments. Thank you!

Best,
Jing

On Thu, Oct 6, 2016 at 8:39 AM, Jan Šafránek notifications@github.com
wrote:

We run OpenShift in master-slave setup and our master crashes once in a
while (from unrelated reason). When a new master starts, it does not detach
volumes that should be detached.

Steps to reproduce on AWS with standard Kubernetes:

run a AWS-aware cluster, hack/local-up-cluster.sh is fine

create several pods that use claims that point to AWS PVs

kill controller-manager process

delete all pods

start a new controller manager

Result: volumes are attached forever (or at least for next 30 minutes).
It should be reproducible also on GCE. Shouldn't there be a periodic sync
that ensures the controller finds deleted pods? This comment looks scary:
https://github.com/kubernetes/kubernetes/blob/master/pkg/
controller/volume/attachdetach/attach_detach_controller.go#L76

Affected version: kubernetes-1.3.8

@saad-ali https://github.com/saad-ali @jingxu97
https://github.com/jingxu97 @kubernetes/sig-storage
https://github.com/orgs/kubernetes/teams/sig-storage

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34242, or mute the
thread
https://github.com/notifications/unsubscribe-auth/ASSNxeWwEbe_ePWER7F11_5R7Q01ArPWks5qxRYcgaJpZM4KQF98
.

Jing

jsafrane · 2016-10-07T07:37:56Z

We could tag AWS EBS and Cinder volumes on attach with name(s) of pod(s) that use it and un-tag it on detach. I am not sure about GCE, there is PD.Description where we put some json when dynamically creating the volume. Perhaps we can update the json on attach/detach. I don't know anything about Ceph RBD, which is getting attach/detach support soon.

In addition, on AWS we assign devices /dev/xvdb[a-z] /dev/xvdc[a-z] and so on to Kubernetes volumes, leaving /dev/xvd[a-z] and /dev/xvda[a-z] to "system".

Still, the safest thing would be to save attach/detach information somewhere in API server as a separate object or somewhere inside Node.Spec or Status.

jsafrane · 2016-12-06T12:55:32Z

Returning back to this bug with newest kube-controller-manager and kubelet (almost 1.5), I noticed that if I restart controller-manager, node retains enough information about attached volumes:

  status:
    volumesAttached:
    - devicePath: /dev/xvdba
      name: kubernetes.io/aws-ebs/aws://us-east-1d/vol-4fc15dde

Could it be enough to detach these volumes when controller restarts? I know, there is some window where the volume is attached and node status is not written yet, still it would help in most of the cases.

jingxu97 · 2016-12-06T17:14:34Z

That's a good point. We might recover some information from node object and put this information back to the actual state when controller restarts. But again we need design this carefully to avoid race condition. Will follow up with a proposal later this week. Thanks! Jing

…

On Tue, Dec 6, 2016 at 4:55 AM, Jan Šafránek ***@***.***> wrote: Returning back to this bug with newest kube-controller-manager and kubelet (almost 1.5), I noticed that if I restart controller-manager, node retains enough information about attached volumes: status: volumesAttached: - devicePath: /dev/xvdba name: kubernetes.io/aws-ebs/aws://us-east-1d/vol-4fc15dde Could it be enough to detach these volumes when controller restarts? I know, there is some window where the volume is attached and node status is not written yet, still it would help in most of the cases. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASSNxaQ5g2GUV2Io9cpqrqvDINrMFA2hks5rFVtdgaJpZM4KQF98> .

-- - Jing

rootfs · 2016-12-06T19:13:21Z

@jsafrane @jingxu97 @rkouj
would #37727 help? Controller should relist the node status and repopulate cache.

Separately, does gke controller upgrade see this issue?

jingxu97 · 2016-12-06T19:40:32Z

@rootfs, #37727 is addressing a different issue, I think. When node restarts, kubelet might delete the old node object, and create a new one. Because of this, the list of attached volumes will be wiped out from the node object so that kubelet on the node will not able to retrieve this information from the api server and will wait forever for node to be attached (even though the truth is that it is already attached). Jan mentioned the problem is caused by controller manager at master restarts, if pods are deleted in the meantime, the list of volumes that are currently attached to the node will be lost. It would help if we add the logic to recover this information from the node object when controller manager restarts. #37727 reminds me that if at this moment, kubelet also restarts and delete the old node object, the list of attached volumes information will be gone too and could not be recovered.

…

On Tue, Dec 6, 2016 at 11:13 AM, Huamin Chen ***@***.***> wrote: @jsafrane <https://github.com/jsafrane> @jingxu97 <https://github.com/jingxu97> would #37727 <#37727> help? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASSNxVTtOcavM0xXkmrYLX3eJoJheH_3ks5rFbPqgaJpZM4KQF98> .

-- - Jing

rootfs · 2016-12-06T19:48:01Z

@jingxu97 here is my thought. When controller master restarts, if it first gets the node status (and gets the attached volumes) before sync pods, wouldn't the attached volume be still there by the time pod is to be deleted?

jingxu97 · 2016-12-06T19:54:14Z

@rootfs, pod deletion does not affect the node status. Yes, the information about the attached volume will still be available and we can add the logic to recover this information (currently we don't have this logic and only rely on sync pod to recover the volumes information). But if node restarts at the same, then the information about the attached volumes will be gone because the whole node object is deleted (we plan to revisit this logic about delete node object too.)

…

On Tue, Dec 6, 2016 at 11:48 AM, Huamin Chen ***@***.***> wrote: @jingxu97 <https://github.com/jingxu97> here is my thought. When controller master restarts, if it first gets the node status (and gets the attached volumes) before sync pods, wouldn't the attached volume be still there by the time pod is to be deleted? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASSNxVsgvBD9Oe3KhIBohgaK4kfavyZfks5rFbwKgaJpZM4KQF98> .

-- - Jing

saad-ali · 2016-12-17T00:10:10Z

But if node restarts at the same, then the information about the attached volumes will be gone because the whole node object is deleted (we plan to revisit this logic about delete node object too.)

Spoke with @jingxu97 offline. I'm ok with using volumesAttached from node status to populate actual state on controller start.

When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status and figure out which volumes to detach. This requires some changes in the volume providers too: the only information available from the nodes is the volume name and the device path. The controller needs to find the correct volume plugin and reconstruct the volume spec just from the name. This reuired a small change also in the volume plugin interface.

tsmetana · 2017-01-10T07:31:54Z

Hello.
I'm trying to fix the problem: you may take a look at the patch. I'm basically pre-populating the DesiredStateOfWorld with information from the pods and ActualStateOfTheWolrd using the volumesAttached data from the nodes. Since the only thing I can get from the node is the volume unique name and the device path I had to add a helper interface method to the volume plugins too: I need to get a volume spec and the existing interface requires a mount path for this (which is rather odd). Also I need to find out what's the plugin name so I assumed the unique name is always <plugin name>/<volume name>.

I tested this in AWS and it looks to be working as expected: the unused volume gets detached even when the pod has been deleted during the controller-manager downtime.

jingxu97 · 2017-01-10T18:47:38Z

@tsmetana thank you for helping on this. I think when pod is deleted from api server, some information such as volume spec is not recoverable. But it might be ok to put some dummy information as long as the information needed for detach is good.
You are right, unique name is /

When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status and figure out which volumes to detach. This requires some changes in the volume providers too: the only information available from the nodes is the volume name and the device path. The controller needs to find the correct volume plugin and reconstruct the volume spec just from the name. This reuired a small change also in the volume plugin interface.

When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status.

jsravn · 2017-03-06T10:53:11Z

Does this also affect HA controller-manager w/ leader election? Because I observe a similar issue when leaders swap (the new leader doesn't detach some volumes correctly).

jsravn · 2017-03-06T11:16:27Z

I don't think it is, because what I observe is the pod remains running during the leader election swap. For some reason when the pod is then deleted hours later the new volume controller master doesn't detach the volume. From what I gather the volume manager should be able to handle this case.

childsb · 2017-03-13T21:46:53Z

There's a fix for this here:

#39732

Its a large fix, and I'd like sign off from @saad-ali before merging it.

saad-ali · 2017-03-13T22:03:22Z

That's way too large a change to merge to 1.6. We can consider it for post-1.6.

ethernetdan · 2017-03-14T18:38:08Z

Too late for v1.6, moving to v1.7 milestone. If this is incorrect please correct.

/cc @kubernetes/release-team

When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status.

@jsafrane

Automatic merge from submit-queue (batch tested with PRs 44722, 44704, 44681, 44494, 39732) Fix issue #34242: Attach/detach should recover from a crash When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status and figure out which volumes to detach. This requires some changes in the volume providers too: the only information available from the nodes is the volume name and the device path. The controller needs to find the correct volume plugin and reconstruct the volume spec just from the name. This required a small change also in the volume plugin interface. Fixes Issue #34242. cc: @jsafrane @jingxu97

When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status.

…over from a crash :100644 100644 d3de5fdf98... 01658bd9b3... M pkg/controller/volume/attachdetach/BUILD :100644 100644 01d2adc016... 66cac888ca... M pkg/controller/volume/attachdetach/attach_detach_controller.go :100644 100644 4a7a8ebfd2... a1a2266d65... M pkg/controller/volume/attachdetach/attach_detach_controller_test.go :100644 100644 5387bec0d9... db40529822... M pkg/controller/volume/attachdetach/cache/actual_state_of_world.go :100644 100644 86f0461493... fa19728b33... M pkg/controller/volume/attachdetach/cache/actual_state_of_world_test.go :100644 100644 505e11e071... 08ce7effc1... M pkg/controller/volume/attachdetach/reconciler/reconciler.go :100644 100644 7911072557... baf67d9ca7... M pkg/controller/volume/attachdetach/reconciler/reconciler_test.go :100644 100644 b484cfa8ce... 2b954e6b79... M pkg/controller/volume/attachdetach/testing/testvolumespec.go :100644 100644 b78c76d2f9... 89b29be2a5... M pkg/volume/plugins.go :100644 100644 8e28405786... f8ae260244... M pkg/volume/util/operationexecutor/operation_executor.go :100644 100644 f1aff52c81... f6a9eb092b... M pkg/volume/util/operationexecutor/operation_generator.go :100644 100644 c55c8db60e... d4dd45dfe3... M pkg/volume/util/volumehelper/volumehelper.go

saad-ali · 2017-06-07T15:37:44Z

#39732 merged for 1.7
Closing

jsafrane added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Oct 6, 2016

k8s-github-robot added area/kubelet team/cluster labels Oct 6, 2016

jingxu97 added this to the v1.6 milestone Dec 16, 2016

jingxu97 self-assigned this Dec 16, 2016

tsmetana mentioned this issue Jan 11, 2017

Fix issue #34242: Attach/detach should recover from a crash #39732

Merged

jsravn mentioned this issue Mar 6, 2017

EBS in AWS clusters is all about issues and headaches #41339

Closed

ethernetdan modified the milestones: v1.7, v1.6 Mar 14, 2017

jsravn mentioned this issue Mar 17, 2017

EBS fails to detach after controller manager is restarted #43300

Closed

ianchakeres mentioned this issue Apr 27, 2017

Refactor volume operation log and error messages #44969

Merged

saad-ali closed this as completed Jun 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attach/detach controller does not recover from missed pod deletion #34242

Attach/detach controller does not recover from missed pod deletion #34242

jsafrane commented Oct 6, 2016

jingxu97 commented Oct 6, 2016

jsafrane commented Oct 7, 2016

jsafrane commented Dec 6, 2016

jingxu97 commented Dec 6, 2016 via email

rootfs commented Dec 6, 2016 •

edited

jingxu97 commented Dec 6, 2016 via email

rootfs commented Dec 6, 2016

jingxu97 commented Dec 6, 2016 via email

saad-ali commented Dec 17, 2016

tsmetana commented Jan 10, 2017

jingxu97 commented Jan 10, 2017

jsravn commented Mar 6, 2017 •

edited

jsravn commented Mar 6, 2017

childsb commented Mar 13, 2017

saad-ali commented Mar 13, 2017 •

edited

ethernetdan commented Mar 14, 2017

saad-ali commented Jun 7, 2017

Attach/detach controller does not recover from missed pod deletion #34242

Attach/detach controller does not recover from missed pod deletion #34242

Comments

jsafrane commented Oct 6, 2016

jingxu97 commented Oct 6, 2016

jsafrane commented Oct 7, 2016

jsafrane commented Dec 6, 2016

jingxu97 commented Dec 6, 2016 via email

rootfs commented Dec 6, 2016 • edited

jingxu97 commented Dec 6, 2016 via email

rootfs commented Dec 6, 2016

jingxu97 commented Dec 6, 2016 via email

saad-ali commented Dec 17, 2016

tsmetana commented Jan 10, 2017

jingxu97 commented Jan 10, 2017

jsravn commented Mar 6, 2017 • edited

jsravn commented Mar 6, 2017

childsb commented Mar 13, 2017

saad-ali commented Mar 13, 2017 • edited

ethernetdan commented Mar 14, 2017

saad-ali commented Jun 7, 2017

rootfs commented Dec 6, 2016 •

edited

jsravn commented Mar 6, 2017 •

edited

saad-ali commented Mar 13, 2017 •

edited