AWS: Fix detaching volumes of deleted pods #24861

jsafrane · 2016-04-27T13:51:13Z

When attachDiskAndVerify gives up waiting for a volume to be attached, AWS may be still trying to attach the volume.

If the pod that wants the volume attached still exists, kubelet will retry attaching the volume on next kubelet sync (and everything is allright).

However, if the pod is deleted, AWS is still trying to attach the volume and it may eventually succeed. We end up with a volume attached to a node and no pod for it. Nobody will detach the volume.

As fix, just fire DetachVolume to AWS when attachDiskAndVerify gives up, without waiting for result. If associated pod still exists, it will try to attach the volume soon. If the pod is deleted, AWS will detach the volume (or stop attaching it).

Fixes: #24807

There are several tiny patches, the last one bringing everything together. I tested it really fixes the referenced issue, still I am open for better suggestions how to fix it in a better way.

@justinsb @kubernetes/sig-storage

saad-ali · 2016-04-27T17:48:13Z

From my experience on the GCE side of things, I'd be caution against doing this. The existing code has some ugly races between attach and detach. This means rapid pod creation and deletion can result in funky behavior like two attaches getting triggered back to back followed by a detach. Basically, this whole house of cards is held together with duct tape and moving one little piece to fix one bug may result in unintentionally (and non-deterministically) breaking a bunch of other scenarios.

That said, if you're confident this will fix the issue and not break anything, go for it. Otherwise you may want to wait for v1.3's implementation of #20262 that should address this issue more robustly.

I'll let @justinsb review this.

ghost · 2016-04-27T19:44:28Z

pkg/cloudprovider/providers/aws/aws.go

+	attachmentStatus := ""
+	for _, attachment := range info.Attachments {
+		if attachmentStatus != "" {
+			glog.Warning("Found multiple attachments: ", info)


I don't follow this warning.. what are the possible values of attachmentStatus ?

I just refactored part of waitForAttachmentStatus into standalone function, I admit I don't understand all the details here. I've never seen this warning printed in my logs.

Yeah it is probably superfluous with the len(info.Attachments) check, but EBS seems so flaky at times that I wanted to program really defensively. This will only fire if somehow a volume got attached twice (which should be impossible)

ghost · 2016-04-27T20:55:24Z

i am okay with this change overall. Lets see what @justinsb says.
How much of an impact did it have ?

jsafrane · 2016-04-28T08:35:53Z

Fixed unit tests and pushed.

jsafrane · 2016-04-28T15:55:26Z

Basically, this whole house of cards is held together with duct tape and moving one little piece to fix one bug may result in unintentionally (and non-deterministically) breaking a bunch of other scenarios.

+1000. I know it's really fragile and I hope my changes will be replaced by rock stable attach controller soon. Unfortunately, we do leak attached volumes now and we should fix it even before 1.3.

For easier debugging.

When attachDiskAndVerify gives up waiting for a volume to be attached, AWS may be still trying to attach the volume. If the pod that wants the volume attached still exists, kubelet will retry attaching the volume on next kubelet sync (and everything is allright). However, if the pod is deleted, AWS is still trying to attach the volume and it may eventually succeed. We end up with a volume attached to a node and no pod for it. Nobody will detach the volume. As fix, when attachDiskAndVerify gives up, just fire DetachVolume to AWS, without waiting for result. If associated pod still exists, it will try to attach the volume soon. If the pod is deleted, AWS will detach the volume (or stop attaching it).

jsafrane · 2016-04-29T08:05:15Z

@k8s-bot test this issue: #24937

k8s-bot · 2016-04-29T08:42:18Z

GCE e2e build/test passed for commit ecd38ad.

justinsb · 2016-05-12T14:18:11Z

Have we actually seen those leaks in the wild? Do we have logs we can review?

I am also super-wary of changing the aws ebs volumes code, as @saad-ali said. I did a lot of work in 1.2 to get back in sync with the GCE volumes code, and I am wary of breaking sync with it again. If we do this on AWS, why don't we do this on GCE also?

That said, there is a loophole in my logic... If you think this is worse than leaking the volume, then maybe we could put the Detach fire into AttachDisk? Similar to this code:

// attachEnded is set to true if the attach operation completed
    // (successfully or not)
    attachEnded := false
    defer func() {
        if attachEnded {
            awsInstance.endAttaching(disk.awsID, mountDevice)
        }
    }()

We would essentially fire your detach on exit from AttachDisk where we otherwise would leak the attach.

I'm also not entirely sure that a Detach actually cancels an EBS Attach, if it is in fact stuck.

What is the status of a the new sync logic @saad-ali ?

saad-ali · 2016-05-12T18:00:58Z

What is the status of a the new sync logic @saad-ali ?

The new attach detach controller should be in for v1.3 Code Complete May 20th.

saad-ali · 2016-06-08T18:41:56Z

This PR will be obsolete with #25888 and #26801. Recommend closing this and waiting for the suggested fix: adding a mechanism to attach detach controller to query storage providers for orphaned attached volumes.

jsafrane · 2016-06-09T07:53:53Z

@saad-ali is right, I'm closing this PR

AWS: Implement IsDiskAttached.

04a9a2a

googlebot added the cla: yes label Apr 27, 2016

k8s-github-robot assigned justinsb Apr 27, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Apr 27, 2016

jsafrane changed the title ~~Devel/fix aws detach2~~ AWS: Fix volumes not being detached when appropriate pod is deleted. Apr 27, 2016

jsafrane changed the title ~~AWS: Fix volumes not being detached when appropriate pod is deleted.~~ AWS: Fix detaching volumes of deleted pods Apr 27, 2016

ghost reviewed Apr 27, 2016
View reviewed changes

jsafrane force-pushed the devel/fix-aws-detach2 branch 2 times, most recently from 2b864c3 to 1a739c4 Compare April 28, 2016 08:35

jsafrane force-pushed the devel/fix-aws-detach2 branch from 1a739c4 to f3a25f1 Compare April 28, 2016 15:59

jsafrane added 3 commits April 28, 2016 18:00

AWS: Make waiting for AWS volume detach optional.

e283e46

AWS: Add logs to Attach/DetachDisk.

9c0e652

For easier debugging.

jsafrane force-pushed the devel/fix-aws-detach2 branch from f3a25f1 to ecd38ad Compare April 28, 2016 16:00

andrewklau mentioned this pull request May 3, 2016

dynamicaly provisioned AWS EBS volumes not deleted on release openshift/origin#8695

Closed

pmorie added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label May 19, 2016

justinsb added the area/platform/aws label May 27, 2016

jsafrane closed this Jun 9, 2016

jsafrane deleted the devel/fix-aws-detach2 branch August 19, 2016 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: Fix detaching volumes of deleted pods #24861

AWS: Fix detaching volumes of deleted pods #24861

jsafrane commented Apr 27, 2016

saad-ali commented Apr 27, 2016

ghost Apr 27, 2016

jsafrane Apr 28, 2016

justinsb May 12, 2016

ghost commented Apr 27, 2016

jsafrane commented Apr 28, 2016

jsafrane commented Apr 28, 2016

jsafrane commented Apr 29, 2016

k8s-bot commented Apr 29, 2016

justinsb commented May 12, 2016

saad-ali commented May 12, 2016

saad-ali commented Jun 8, 2016

jsafrane commented Jun 9, 2016

AWS: Fix detaching volumes of deleted pods #24861

AWS: Fix detaching volumes of deleted pods #24861

Conversation

jsafrane commented Apr 27, 2016

saad-ali commented Apr 27, 2016

ghost Apr 27, 2016

Choose a reason for hiding this comment

jsafrane Apr 28, 2016

Choose a reason for hiding this comment

justinsb May 12, 2016

Choose a reason for hiding this comment

ghost commented Apr 27, 2016

jsafrane commented Apr 28, 2016

jsafrane commented Apr 28, 2016

jsafrane commented Apr 29, 2016

k8s-bot commented Apr 29, 2016

justinsb commented May 12, 2016

saad-ali commented May 12, 2016

saad-ali commented Jun 8, 2016

jsafrane commented Jun 9, 2016