Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #34242: Attach/detach should recover from a crash #39732

Merged
merged 1 commit into from
Apr 20, 2017

Conversation

tsmetana
Copy link
Member

When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status and figure out which volumes to detach. This requires some changes in the volume providers too: the only information available from the nodes is the volume name and the device path. The controller needs to find the correct volume plugin and reconstruct the volume spec just from the name. This required a small change also in the volume plugin interface.

Fixes Issue #34242.
cc: @jsafrane @jingxu97

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 11, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jan 11, 2017
@jsafrane jsafrane added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jan 11, 2017
@jsafrane
Copy link
Member

Assigning to @jingxu97, she knows attach/detach logic better

@tsmetana
Copy link
Member Author

@k8s-bot gci gke e2e test this issue: #23634

}

return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some tests to verify if these two methods indeed reconstructs the actual and desired state of world correctly via API calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I only was not sure how to test the changes exactly: the biggest problem is that I had to change all the volume plugins so ideally they should be all tested. But for this part I think I should be able to come with something. Thanks.

@kerneltime
Copy link

Are there any automated tests for this ? If not what testing was done?

Looking into the code change but from a first glance the reconstruction of volume spec might need fixing for vSphere.

Name: volumeName,
VolumeSource: v1.VolumeSource{
VsphereVolume: &v1.VsphereVirtualDiskVolumeSource{
VolumePath: volumeName,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need fixing.. @BaluDontu

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Thank for the response. I tested the patch in AWS and the purpose of the PR was mainly to solicit feedback on the patch since it's actually my first actual contribution to kubernetes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you need us to address this code if you plan to merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kerneltime. I've looked at the plugins code and I really filled the spec with nonsense. Probably at more places than just vsphere... I would definitely appreciate any help: I'm not able to test all the plugins myself: I'm quite sure those that don't need mountPath to reconstruct volumeSpec are OK but I might need some guidance in the other cases.

@tsmetana
Copy link
Member Author

Sorry or the failed builds. I need to rebase (and fix) the patches.

@tsmetana tsmetana force-pushed the issue_34242 branch 2 times, most recently from 2ff3103 to ba0a062 Compare January 23, 2017 13:31
@k8s-github-robot k8s-github-robot assigned saad-ali and unassigned pmorie and jingxu97 Jan 30, 2017
@childsb childsb added this to the 1.6 milestone Jan 31, 2017
@childsb childsb requested a review from saad-ali February 3, 2017 18:50
return nil
}

func (adc *attachDetachController) populateDesiredStateOfWorld() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem here is the race condition between this population and nodeAdd. If a pod is added before the node is added to the desired state, the podAdd will fail.
One solution could be get the node list from kubeClient and add all the nodes to the desired state before adding pods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the funny part is I found this out myself and circumvented it in the unit tests (so they pass)... Will fix that.

glog.Errorf("could not construct volume spec in plugin %s for volume %s: %v", pluginName, volumeName, err)
continue
}
_, err = adc.actualStateOfWorld.AddVolumeNode(volumeSpec, nodeName, attachedVolume.DevicePath)
Copy link
Contributor

@jingxu97 jingxu97 Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing related to AddVolumeNode() function is by default it set mountedByNodeSetCount: 0. Although it should not cause any dangerous thing happen, but based on current logic, it might delay for reconciler to detach. I am ok with this but might worth to put a note mention this. Or you can get VolumesInUse and update this field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually what I saw in the testing: the volume gets detached after some timeout (6 minutes I think). And after discussion with @jsafrane this should be addressed too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use Capital case for the first letter of error message to be consistent?

Name: volName,
VolumeSource: v1.VolumeSource{
AWSElasticBlockStore: &v1.AWSElasticBlockStoreVolumeSource{
VolumeID: volName,
Copy link
Contributor

@jingxu97 jingxu97 Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Name:volName is not the real one. But in our code, this Name is not really used anywhere. Maybe put a note about it here in case this information is needed in the future.

glog.Errorf("could not find plugin named %s for volume %s: %v", pluginName, volumeName, err)
continue
}
volumeSpec, err := plug.ConstructVolumeSpecFromName(volumeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConstructVolumeSpecFromName function fake volumeSource information for almost all of volume plugins. The main issue is that this information will be used by VerifyVolumesAreAttached(). If this information (e.g., volumeID) is wrong, the function will return wrong information related whether volumes are attached or not.

One way to avoid this is populate desired state first which contains this information, and then find the matching volume to populate this spec information from desired state. But volume spec information might not be recovered if pod is deleted so desired state does not have it either. PR #40239 might help solve this issue, but this PR is revered. Will follow up with the status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The ConstructVolumeSpecFromName fills the spec with wrong data. It's been noted above in the VSphere plugin. This is the most problematic part of the patch since I really had no idea how to fix that properly. I'll take a look a the other PR. (Or if somebody came with an idea I'm one ear.) Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like PR #40239 would remove the need for recovery in the case where a crash happens before unmounting has succeeded. Because then the pod wouldn't be allowed to be deleted in the first place, and the controller could find it and populate its states as usual.

But if a crash happens after unmounting has already succeeded & before detach happens, then recovery is still needed because the pod would be allowed to be deleted & won't be around to inspect.

So if names from node.Status are not enough for reconstruction then you may have to serialize & store entire volume specs somewhere? Maybe as ugly annotations on the node? I have no idea what the "proper" way to do it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the plugins code wondering why the volumes get detached in my tests since I really don't put the correct data to the spec. Seems it's actually by accident: I use the unique volume name from the node and split it into the plugin name and volume name. The volume name I incidentally ends with the same component (path-wise) as the mount path. The Detach() plugin methods usually start with extracting the last component of their first argument. If this is the volume name I have the result is identical as it would be with the correct volume source and the detach operation succeeds (@kerneltime -- I think this is the same also in the vSphere case).
The question is whether I should document this in the code and rely on this behaviour or some deeper re-design of the AttachDetach controller would be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion about the names. You are right. UniqueVolume is constructed as pluginName/volumeSourceName. This volumeSourceName is from plugin.GetVolumeName() which takes the source.volumeID (or other names) from different plugins. So when you ConstructVolumeSpecFromName() as below, the volumeID is actually the volumeName you extract from SplitUniqueName(). But the meta Name: volumeName for the Spec is wrong. I checked our code and I didn't find anywhere this meta name is actually used. So, overall your approach should work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm actually trying one more thing: store the string that I'm unable to generate from the volume name also in the node status. This would be more robust, however it's an API change (small one though).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more details about your plan on API change? I actually don't recommend to have API change in this PR unless it is really necessary because it is a very complicated and long process to get approved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the node status looks like this:

status:
    volumesAttached:
    - devicePath: /dev/vdb
      name: kubernetes.io/...

So I was thinking of adding one more string there that would contain the part I'm unable to reconstruct from the name (e.g., PDName in case of gce_pd):

status:
    volumesAttached:
    - devicePath: /dev/vdb
      name: kubernetes.io/plugin/volume_name
      detachVolumeName: whatever_the_plugin_needs

I talked to @jsafrane today and he also thinks getting the API change approval would be quite a complicated thing and suggested to store this data in the node annotations. Which I suppose would work too...

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2017
@calebamiles calebamiles modified the milestones: v1.6, 1.6 Feb 13, 2017
@jingxu97
Copy link
Contributor

@tsmetana what is the status of the PR? Are there still some tests failing?

@jingxu97
Copy link
Contributor

Also I am thinking this PR might not solve the following issue

Because controller delete volume from "VolumesAttached" list before actually performing detach operation, there is still a chance that we lose the information about the volume while it is still attached. The events might happen is

  1. pod is deleted
  2. controller checks the condition and remove volume from "VolumesAttached", but it has not yet called "detach" operation
  3. controller crash
  4. controller restarts, the volume information will not be found anywhere (neither from pod nor VolumesAttached in node status)

@tsmetana
Copy link
Member Author

@tsmetana what is the status of the PR? Are there still some tests failing?

Sorry for the late response... The latest commit seems to be working OK. And yes, you're right -- there are more races affecting attach/detach and this patch fixes just one them.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 18, 2017
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 19, 2017
@tsmetana
Copy link
Member Author

Rebased squashed, included fixes by @wongma7 (Thanks!).

@@ -65,7 +65,7 @@ type ActualStateOfWorld interface {
// returned.
// If no node with the name nodeName exists in list of attached nodes for
// the specified volume, an error is returned.
SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool) error
SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool, forceUnmount bool) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function's comment should be amended to include the new forceUnmount parameter. since the comment pretends atm that there's only one bool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wongma7
Copy link
Contributor

wongma7 commented Apr 19, 2017

Overall the patch lgtm. The fishiest part is the use of SplitUniqueName during detach which relies on a kind of unenforced convention but we've already decided it's fine.

jing brings up a good issue, VolumesAttached is not as its name suggests the source of truth about what volumes are actually attached to the node. So there's still a hole there that needs plugging. IMO it could be fixed in a follow-up PR, this PR has gotten big enough already and fixes the more likely case where deletion is done while controller is down.

I imagine the fix for Jing's issue will have to either change the behavior around VolumesAttached or force us to introduce a new annotation "VolumesActuallyAttached" altogether, in which case we may as well not co-opt VolumesAttached and serialize the volume specs to the new annotation. But again, we can figure that stuff out in a follow-up.

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments. Overall direction looks fine.

nodeName := types.NodeName(node.Name)
for _, attachedVolume := range node.Status.VolumesAttached {
uniqueName := attachedVolume.Name
err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil, nodeName, attachedVolume.DevicePath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nil value for volumeSpec scares me like @gnufied mentioned earlier. But it sounds like (#39732 (comment)) you are confident it won't break anything.

nit: when passing in a value and it is unclear what the parameter is add the parameter name as a inline comment, e.g.:

err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil /* volumeSpec */, nodeName, attachedVolume.DevicePath)

nit: add a comment why you think it will be ok to pass in nil (e.g. #39732 (comment)).

uniqueName := attachedVolume.Name
err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil, nodeName, attachedVolume.DevicePath)
if err != nil {
glog.Errorf("Error adding node to ActualStateOfWorld: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error seems not to reflect action.

continue
}
adc.processVolumesInUse(nodeName, node.Status.VolumesInUse, true /* forceUnmount */)
adc.desiredStateOfWorld.AddNode(types.NodeName(node.Name)) // Needed for DesiredStateOfWorld population
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if the node is managed by attach/detach controller:

	if _, exists := node.Annotations[volumehelper.ControllerManagedAttachAnnotation]; exists {
		// Node specifies annotation indicating it should be managed by attach
		// detach controller. Add it to desired state of world.
		adc.desiredStateOfWorld.AddNode(nodeName)
	}

continue
}
nodeName := types.NodeName(podToAdd.Spec.NodeName)
plugin, err := adc.volumePluginMgr.FindPluginBySpec(volumeSpec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FindAttachablePluginBySpec we don't care about non-attachable volumes in this controller.

}
nodeName := types.NodeName(podToAdd.Spec.NodeName)
plugin, err := adc.volumePluginMgr.FindPluginBySpec(volumeSpec)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if the returned plugin is nil:

		if err != nil || attachableVolumePlugin == nil {
			glog.V(10).Infof(
				"Skipping volume %q for pod %q/%q: it does not implement attacher interface. err=%v",
				podVolume.Name,
				pod.Namespace,
				pod.Name,
				err)
			continue
		}

@@ -298,7 +403,7 @@ func (adc *attachDetachController) nodeUpdate(oldObj, newObj interface{}) {
// detach controller. Add it to desired state of world.
adc.desiredStateOfWorld.AddNode(nodeName)
}
adc.processVolumesInUse(nodeName, node.Status.VolumesInUse)
adc.processVolumesInUse(nodeName, node.Status.VolumesInUse, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -65,7 +65,7 @@ type ActualStateOfWorld interface {
// returned.
// If no node with the name nodeName exists in list of attached nodes for
// the specified volume, an error is returned.
SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool) error
SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool, forceUnmount bool) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

// processVolumesInUse processes the list of volumes marked as "in-use"
// according to the specified Node's Status.VolumesInUse and updates the
// corresponding volume in the actual state of the world to indicate that it is
// mounted.
func (adc *attachDetachController) processVolumesInUse(
nodeName types.NodeName, volumesInUse []v1.UniqueVolumeName) {
nodeName types.NodeName, volumesInUse []v1.UniqueVolumeName, forceUnmount bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to describe new boolean.

// Get attacher plugin and the volumeName by splitting the volume unique name in case
// there's no VolumeSpec: this happens only on attach/detach controller crash recovery
// when a pod has been deleted during the controller downtime
pluginName, volumeName, err = volumehelper.SplitUniqueName(volumeToDetach.VolumeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed pretty funky should be improved in the future

When the attach/detach controller crashes and a pod with attached PV is deleted
afterwards the controller will never detach the pod's attached volumes. To
prevent this the controller should try to recover the state from the nodes
status.
@tsmetana
Copy link
Member Author

Thanks for the comments. I hope I addressed them correctly in the new version (just pushed).

@@ -239,12 +245,136 @@ func (adc *attachDetachController) Run(stopCh <-chan struct{}) {
return
}

err := adc.populateActualStateOfWorld()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is controller.WaitForCacheSync() just few lines above. Does it mean that all the existing objects were already synced through podAdd() and nodeAdd() callbacks from the shared informers? They may be synced in a wrong order (podAdd before nodeAdd) and these populateActualStateOfWorld() and populateDesiredStateOfWorld() will just fix it, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @tsmetana and yes, these new world populators will fix states of worlds that could be inconsistent due to bad event ordering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that we need to find volumes attached for the pods that were deleted during the controller downtime: this would not be solved by the WaitForCacheSync() -- we still would be missing the pod deletion event. And yes, in the case the events come in wrong order the ASW/DSW populators should help too.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 20, 2017

@tsmetana: The following test(s) failed:

Test name Commit Details Rerun command
Jenkins GKE smoke e2e 99d461335e8a68f2b356f1580658d9e2d69b45c8 link @k8s-bot cvm gke e2e test this
Jenkins GCI GKE smoke e2e 99d461335e8a68f2b356f1580658d9e2d69b45c8 link @k8s-bot gci gke e2e test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jsafrane
Copy link
Member

@k8s-bot cvm gce e2e test this
These are not attach/detach errors.

@saad-ali
Copy link
Member

Looks good, thanks for your patience @tsmetana!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali, tsmetana

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 44722, 44704, 44681, 44494, 39732)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet