Fix issue #34242: Attach/detach should recover from a crash #39732

tsmetana · 2017-01-11T11:00:50Z

When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status and figure out which volumes to detach. This requires some changes in the volume providers too: the only information available from the nodes is the volume name and the device path. The controller needs to find the correct volume plugin and reconstruct the volume spec just from the name. This required a small change also in the volume plugin interface.

Fixes Issue #34242.
cc: @jsafrane @jingxu97

k8s-reviewable · 2017-01-11T11:00:59Z

This change is

jsafrane · 2017-01-11T15:48:54Z

Assigning to @jingxu97, she knows attach/detach logic better

tsmetana · 2017-01-11T15:54:29Z

@k8s-bot gci gke e2e test this issue: #23634

gnufied · 2017-01-11T16:27:38Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+	}
+
+	return nil
+}


I think we need some tests to verify if these two methods indeed reconstructs the actual and desired state of world correctly via API calls.

Right. I only was not sure how to test the changes exactly: the biggest problem is that I had to change all the volume plugins so ideally they should be all tested. But for this part I think I should be able to come with something. Thanks.

kerneltime · 2017-01-11T19:09:25Z

Are there any automated tests for this ? If not what testing was done?

Looking into the code change but from a first glance the reconstruction of volume spec might need fixing for vSphere.

kerneltime · 2017-01-11T19:09:50Z

pkg/volume/vsphere_volume/vsphere_volume.go

+		Name: volumeName,
+		VolumeSource: v1.VolumeSource{
+			VsphereVolume: &v1.VsphereVirtualDiskVolumeSource{
+				VolumePath: volumeName,


This will need fixing.. @BaluDontu

OK. Thank for the response. I tested the patch in AWS and the purpose of the PR was mainly to solicit feedback on the patch since it's actually my first actual contribution to kubernetes.

Let me know if you need us to address this code if you plan to merge.

Thanks @kerneltime. I've looked at the plugins code and I really filled the spec with nonsense. Probably at more places than just vsphere... I would definitely appreciate any help: I'm not able to test all the plugins myself: I'm quite sure those that don't need mountPath to reconstruct volumeSpec are OK but I might need some guidance in the other cases.

tsmetana · 2017-01-18T12:26:41Z

Sorry or the failed builds. I need to rebase (and fix) the patches.

jingxu97 · 2017-02-03T20:47:03Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+	return nil
+}
+
+func (adc *attachDetachController) populateDesiredStateOfWorld() error {


A problem here is the race condition between this population and nodeAdd. If a pod is added before the node is added to the desired state, the podAdd will fail.
One solution could be get the node list from kubeClient and add all the nodes to the desired state before adding pods

And the funny part is I found this out myself and circumvented it in the unit tests (so they pass)... Will fix that.

jingxu97 · 2017-02-03T21:02:20Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+				glog.Errorf("could not construct volume spec in plugin %s for volume %s: %v", pluginName, volumeName, err)
+				continue
+			}
+			_, err = adc.actualStateOfWorld.AddVolumeNode(volumeSpec, nodeName, attachedVolume.DevicePath)


One thing related to AddVolumeNode() function is by default it set mountedByNodeSetCount: 0. Although it should not cause any dangerous thing happen, but based on current logic, it might delay for reconciler to detach. I am ok with this but might worth to put a note mention this. Or you can get VolumesInUse and update this field.

This is actually what I saw in the testing: the volume gets detached after some timeout (6 minutes I think). And after discussion with @jsafrane this should be addressed too.

Could you use Capital case for the first letter of error message to be consistent?

jingxu97 · 2017-02-03T21:02:54Z

pkg/volume/aws_ebs/aws_ebs.go

+		Name: volName,
+		VolumeSource: v1.VolumeSource{
+			AWSElasticBlockStore: &v1.AWSElasticBlockStoreVolumeSource{
+				VolumeID: volName,


The Name:volName is not the real one. But in our code, this Name is not really used anywhere. Maybe put a note about it here in case this information is needed in the future.

jingxu97 · 2017-02-03T21:17:58Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+				glog.Errorf("could not find plugin named %s for volume %s: %v", pluginName, volumeName, err)
+				continue
+			}
+			volumeSpec, err := plug.ConstructVolumeSpecFromName(volumeName)


ConstructVolumeSpecFromName function fake volumeSource information for almost all of volume plugins. The main issue is that this information will be used by VerifyVolumesAreAttached(). If this information (e.g., volumeID) is wrong, the function will return wrong information related whether volumes are attached or not.

One way to avoid this is populate desired state first which contains this information, and then find the matching volume to populate this spec information from desired state. But volume spec information might not be recovered if pod is deleted so desired state does not have it either. PR #40239 might help solve this issue, but this PR is revered. Will follow up with the status.

Yes. The ConstructVolumeSpecFromName fills the spec with wrong data. It's been noted above in the VSphere plugin. This is the most problematic part of the patch since I really had no idea how to fix that properly. I'll take a look a the other PR. (Or if somebody came with an idea I'm one ear.) Thanks.

It looks to me like PR #40239 would remove the need for recovery in the case where a crash happens before unmounting has succeeded. Because then the pod wouldn't be allowed to be deleted in the first place, and the controller could find it and populate its states as usual.

But if a crash happens after unmounting has already succeeded & before detach happens, then recovery is still needed because the pod would be allowed to be deleted & won't be around to inspect.

So if names from node.Status are not enough for reconstruction then you may have to serialize & store entire volume specs somewhere? Maybe as ugly annotations on the node? I have no idea what the "proper" way to do it is.

I went through the plugins code wondering why the volumes get detached in my tests since I really don't put the correct data to the spec. Seems it's actually by accident: I use the unique volume name from the node and split it into the plugin name and volume name. The volume name I incidentally ends with the same component (path-wise) as the mount path. The Detach() plugin methods usually start with extracting the last component of their first argument. If this is the volume name I have the result is identical as it would be with the correct volume source and the detach operation succeeds (@kerneltime -- I think this is the same also in the vSphere case).
The question is whether I should document this in the code and rely on this behaviour or some deeper re-design of the AttachDetach controller would be required.

Sorry for the confusion about the names. You are right. UniqueVolume is constructed as pluginName/volumeSourceName. This volumeSourceName is from plugin.GetVolumeName() which takes the source.volumeID (or other names) from different plugins. So when you ConstructVolumeSpecFromName() as below, the volumeID is actually the volumeName you extract from SplitUniqueName(). But the meta Name: volumeName for the Spec is wrong. I checked our code and I didn't find anywhere this meta name is actually used. So, overall your approach should work.

Thanks. I'm actually trying one more thing: store the string that I'm unable to generate from the volume name also in the node status. This would be more robust, however it's an API change (small one though).

Can you provide more details about your plan on API change? I actually don't recommend to have API change in this PR unless it is really necessary because it is a very complicated and long process to get approved.

Right now the node status looks like this:

status: volumesAttached: - devicePath: /dev/vdb name: kubernetes.io/...

So I was thinking of adding one more string there that would contain the part I'm unable to reconstruct from the name (e.g., PDName in case of gce_pd):

status: volumesAttached: - devicePath: /dev/vdb name: kubernetes.io/plugin/volume_name detachVolumeName: whatever_the_plugin_needs

I talked to @jsafrane today and he also thinks getting the API change approval would be quite a complicated thing and suggested to store this data in the node annotations. Which I suppose would work too...

jingxu97 · 2017-04-14T18:08:44Z

@tsmetana what is the status of the PR? Are there still some tests failing?

jingxu97 · 2017-04-14T18:33:58Z

Also I am thinking this PR might not solve the following issue

Because controller delete volume from "VolumesAttached" list before actually performing detach operation, there is still a chance that we lose the information about the volume while it is still attached. The events might happen is

pod is deleted
controller checks the condition and remove volume from "VolumesAttached", but it has not yet called "detach" operation
controller crash
controller restarts, the volume information will not be found anywhere (neither from pod nor VolumesAttached in node status)

tsmetana · 2017-04-18T07:40:25Z

@tsmetana what is the status of the PR? Are there still some tests failing?

Sorry for the late response... The latest commit seems to be working OK. And yes, you're right -- there are more races affecting attach/detach and this patch fixes just one them.

tsmetana · 2017-04-19T11:10:18Z

Rebased squashed, included fixes by @wongma7 (Thanks!).

wongma7 · 2017-04-19T19:20:13Z

pkg/controller/volume/attachdetach/cache/actual_state_of_world.go

@@ -65,7 +65,7 @@ type ActualStateOfWorld interface {
 	// returned.
 	// If no node with the name nodeName exists in list of attached nodes for
 	// the specified volume, an error is returned.
-	SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool) error
+	SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool, forceUnmount bool) error


function's comment should be amended to include the new forceUnmount parameter. since the comment pretends atm that there's only one bool

wongma7 · 2017-04-19T19:27:18Z

Overall the patch lgtm. The fishiest part is the use of SplitUniqueName during detach which relies on a kind of unenforced convention but we've already decided it's fine.

jing brings up a good issue, VolumesAttached is not as its name suggests the source of truth about what volumes are actually attached to the node. So there's still a hole there that needs plugging. IMO it could be fixed in a follow-up PR, this PR has gotten big enough already and fixes the more likely case where deletion is done while controller is down.

I imagine the fix for Jing's issue will have to either change the behavior around VolumesAttached or force us to introduce a new annotation "VolumesActuallyAttached" altogether, in which case we may as well not co-opt VolumesAttached and serialize the volume specs to the new annotation. But again, we can figure that stuff out in a follow-up.

saad-ali

A couple of comments. Overall direction looks fine.

saad-ali · 2017-04-20T05:21:52Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+		nodeName := types.NodeName(node.Name)
+		for _, attachedVolume := range node.Status.VolumesAttached {
+			uniqueName := attachedVolume.Name
+			err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil, nodeName, attachedVolume.DevicePath)


The nil value for volumeSpec scares me like @gnufied mentioned earlier. But it sounds like (#39732 (comment)) you are confident it won't break anything.

nit: when passing in a value and it is unclear what the parameter is add the parameter name as a inline comment, e.g.:

err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil /* volumeSpec */, nodeName, attachedVolume.DevicePath)

nit: add a comment why you think it will be ok to pass in nil (e.g. #39732 (comment)).

saad-ali · 2017-04-20T05:22:37Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+			uniqueName := attachedVolume.Name
+			err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil, nodeName, attachedVolume.DevicePath)
+			if err != nil {
+				glog.Errorf("Error adding node to ActualStateOfWorld: %v", err)


Error seems not to reflect action.

saad-ali · 2017-04-20T05:34:42Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+				continue
+			}
+			adc.processVolumesInUse(nodeName, node.Status.VolumesInUse, true /* forceUnmount */)
+			adc.desiredStateOfWorld.AddNode(types.NodeName(node.Name)) // Needed for DesiredStateOfWorld population


Check if the node is managed by attach/detach controller:

if _, exists := node.Annotations[volumehelper.ControllerManagedAttachAnnotation]; exists { // Node specifies annotation indicating it should be managed by attach // detach controller. Add it to desired state of world. adc.desiredStateOfWorld.AddNode(nodeName) }

saad-ali · 2017-04-20T05:40:03Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+				continue
+			}
+			nodeName := types.NodeName(podToAdd.Spec.NodeName)
+			plugin, err := adc.volumePluginMgr.FindPluginBySpec(volumeSpec)


FindAttachablePluginBySpec we don't care about non-attachable volumes in this controller.

saad-ali · 2017-04-20T05:42:47Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+			}
+			nodeName := types.NodeName(podToAdd.Spec.NodeName)
+			plugin, err := adc.volumePluginMgr.FindPluginBySpec(volumeSpec)
+			if err != nil {


Check if the returned plugin is nil:

if err != nil || attachableVolumePlugin == nil { glog.V(10).Infof( "Skipping volume %q for pod %q/%q: it does not implement attacher interface. err=%v", podVolume.Name, pod.Namespace, pod.Name, err) continue }

saad-ali · 2017-04-20T05:49:37Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

@@ -298,7 +403,7 @@ func (adc *attachDetachController) nodeUpdate(oldObj, newObj interface{}) {
 		// detach controller. Add it to desired state of world.
 		adc.desiredStateOfWorld.AddNode(nodeName)
 	}
-	adc.processVolumesInUse(nodeName, node.Status.VolumesInUse)
+	adc.processVolumesInUse(nodeName, node.Status.VolumesInUse, false)


saad-ali · 2017-04-20T05:50:28Z

pkg/controller/volume/attachdetach/cache/actual_state_of_world.go

@@ -65,7 +65,7 @@ type ActualStateOfWorld interface {
 	// returned.
 	// If no node with the name nodeName exists in list of attached nodes for
 	// the specified volume, an error is returned.
-	SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool) error
+	SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool, forceUnmount bool) error


saad-ali · 2017-04-20T05:57:45Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

 }

 // processVolumesInUse processes the list of volumes marked as "in-use"
 // according to the specified Node's Status.VolumesInUse and updates the
 // corresponding volume in the actual state of the world to indicate that it is
 // mounted.
 func (adc *attachDetachController) processVolumesInUse(
-	nodeName types.NodeName, volumesInUse []v1.UniqueVolumeName) {
+	nodeName types.NodeName, volumesInUse []v1.UniqueVolumeName, forceUnmount bool) {


Update comment to describe new boolean.

saad-ali · 2017-04-20T06:03:30Z

pkg/volume/util/operationexecutor/operation_generator.go

+		// Get attacher plugin and the volumeName by splitting the volume unique name in case
+		// there's no VolumeSpec: this happens only on attach/detach controller crash recovery
+		// when a pod has been deleted during the controller downtime
+		pluginName, volumeName, err = volumehelper.SplitUniqueName(volumeToDetach.VolumeName)


Agreed pretty funky should be improved in the future

When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status.

tsmetana · 2017-04-20T11:05:53Z

Thanks for the comments. I hope I addressed them correctly in the new version (just pushed).

jsafrane · 2017-04-20T11:56:12Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

@@ -239,12 +245,136 @@ func (adc *attachDetachController) Run(stopCh <-chan struct{}) {
 		return
 	}

+	err := adc.populateActualStateOfWorld()


There is controller.WaitForCacheSync() just few lines above. Does it mean that all the existing objects were already synced through podAdd() and nodeAdd() callbacks from the shared informers? They may be synced in a wrong order (podAdd before nodeAdd) and these populateActualStateOfWorld() and populateDesiredStateOfWorld() will just fix it, right?

I talked to @tsmetana and yes, these new world populators will fix states of worlds that could be inconsistent due to bad event ordering.

The main issue is that we need to find volumes attached for the pods that were deleted during the controller downtime: this would not be solved by the WaitForCacheSync() -- we still would be missing the pod deletion event. And yes, in the case the events come in wrong order the ASW/DSW populators should help too.

k8s-ci-robot · 2017-04-20T12:09:51Z

@tsmetana: The following test(s) failed:

Test name	Commit	Details	Rerun command
Jenkins GKE smoke e2e	99d461335e8a68f2b356f1580658d9e2d69b45c8	link	`@k8s-bot cvm gke e2e test this`
Jenkins GCI GKE smoke e2e	99d461335e8a68f2b356f1580658d9e2d69b45c8	link	`@k8s-bot gci gke e2e test this`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jsafrane · 2017-04-20T17:51:01Z

@k8s-bot cvm gce e2e test this
These are not attach/detach errors.

saad-ali · 2017-04-20T21:35:51Z

Looks good, thanks for your patience @tsmetana!

/lgtm
/approve

k8s-github-robot · 2017-04-20T21:36:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali, tsmetana

Needs approval from an approver in each of these OWNERS Files:

~~pkg/controller/volume/attachdetach/OWNERS~~ [saad-ali]
~~pkg/volume/OWNERS~~ [saad-ali]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-04-20T23:01:03Z

Automatic merge from submit-queue (batch tested with PRs 44722, 44704, 44681, 44494, 39732)

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 11, 2017

k8s-github-robot assigned pmorie Jan 11, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jan 11, 2017

jsafrane assigned jingxu97 Jan 11, 2017

jsafrane added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jan 11, 2017

gnufied reviewed Jan 11, 2017

View reviewed changes

kerneltime reviewed Jan 11, 2017

View reviewed changes

tsmetana force-pushed the issue_34242 branch 2 times, most recently from 2ff3103 to ba0a062 Compare January 23, 2017 13:31

k8s-github-robot assigned saad-ali and unassigned pmorie and jingxu97 Jan 30, 2017

apelisse assigned pmorie and jingxu97 Jan 31, 2017

childsb added this to the 1.6 milestone Jan 31, 2017

childsb requested a review from saad-ali February 3, 2017 18:50

jingxu97 reviewed Feb 3, 2017

View reviewed changes

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2017

calebamiles modified the milestones: v1.6, 1.6 Feb 13, 2017

jsravn mentioned this pull request Apr 13, 2017

EBS fails to detach after controller manager is restarted #43300

Closed

tsmetana force-pushed the issue_34242 branch from e6e1071 to 3ae9ad3 Compare April 13, 2017 15:58

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 18, 2017

tsmetana force-pushed the issue_34242 branch from fccf931 to 3802239 Compare April 19, 2017 11:09

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 19, 2017

wongma7 reviewed Apr 19, 2017

View reviewed changes

saad-ali suggested changes Apr 20, 2017

View reviewed changes

tsmetana force-pushed the issue_34242 branch from 3802239 to 852c44a Compare April 20, 2017 11:05

jsafrane reviewed Apr 20, 2017

View reviewed changes

saad-ali approved these changes Apr 20, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2017

k8s-github-robot merged commit db53e80 into kubernetes:master Apr 20, 2017

jingxu97 mentioned this pull request Apr 24, 2017

Attach/detach controller does not track "attaching" volumes persistently #44858

Closed

divyenpatel mentioned this pull request May 8, 2017

Orphaned pod found - but volume paths are still present on disk #45464

Closed

wongma7 mentioned this pull request May 9, 2017

UPSTREAM: Attach/detach PV bug fixes openshift/origin#14119

Merged

saad-ali mentioned this pull request Jul 21, 2017

kube-controller-manager crash looping with "nil pointer dereference" in volume/cinder/cinder.go #49418

Closed

jingxu97 mentioned this pull request Aug 15, 2017

Add proposal for volume spec reconstruction kubernetes/community#650

Closed

Fix issue #34242: Attach/detach should recover from a crash #39732

Fix issue #34242: Attach/detach should recover from a crash #39732

Conversation

tsmetana commented Jan 11, 2017

k8s-reviewable commented Jan 11, 2017

jsafrane commented Jan 11, 2017

tsmetana commented Jan 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerneltime commented Jan 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsmetana commented Jan 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 Feb 3, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 Feb 3, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 commented Apr 14, 2017

jingxu97 commented Apr 14, 2017

tsmetana commented Apr 18, 2017

tsmetana commented Apr 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wongma7 commented Apr 19, 2017

saad-ali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsmetana commented Apr 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 20, 2017 • edited

jsafrane commented Apr 20, 2017

saad-ali commented Apr 20, 2017

k8s-github-robot commented Apr 20, 2017

k8s-github-robot commented Apr 20, 2017

jingxu97 Feb 3, 2017 •

edited

jingxu97 Feb 3, 2017 •

edited

k8s-ci-robot commented Apr 20, 2017 •

edited