-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #34242: Attach/detach should recover from a crash #39732
Conversation
Assigning to @jingxu97, she knows attach/detach logic better |
} | ||
|
||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some tests to verify if these two methods indeed reconstructs the actual and desired state of world correctly via API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I only was not sure how to test the changes exactly: the biggest problem is that I had to change all the volume plugins so ideally they should be all tested. But for this part I think I should be able to come with something. Thanks.
Are there any automated tests for this ? If not what testing was done? Looking into the code change but from a first glance the reconstruction of volume spec might need fixing for vSphere. |
Name: volumeName, | ||
VolumeSource: v1.VolumeSource{ | ||
VsphereVolume: &v1.VsphereVirtualDiskVolumeSource{ | ||
VolumePath: volumeName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need fixing.. @BaluDontu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Thank for the response. I tested the patch in AWS and the purpose of the PR was mainly to solicit feedback on the patch since it's actually my first actual contribution to kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you need us to address this code if you plan to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kerneltime. I've looked at the plugins code and I really filled the spec with nonsense. Probably at more places than just vsphere... I would definitely appreciate any help: I'm not able to test all the plugins myself: I'm quite sure those that don't need mountPath to reconstruct volumeSpec are OK but I might need some guidance in the other cases.
Sorry or the failed builds. I need to rebase (and fix) the patches. |
2ff3103
to
ba0a062
Compare
return nil | ||
} | ||
|
||
func (adc *attachDetachController) populateDesiredStateOfWorld() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A problem here is the race condition between this population and nodeAdd. If a pod is added before the node is added to the desired state, the podAdd will fail.
One solution could be get the node list from kubeClient and add all the nodes to the desired state before adding pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the funny part is I found this out myself and circumvented it in the unit tests (so they pass)... Will fix that.
glog.Errorf("could not construct volume spec in plugin %s for volume %s: %v", pluginName, volumeName, err) | ||
continue | ||
} | ||
_, err = adc.actualStateOfWorld.AddVolumeNode(volumeSpec, nodeName, attachedVolume.DevicePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing related to AddVolumeNode() function is by default it set mountedByNodeSetCount: 0. Although it should not cause any dangerous thing happen, but based on current logic, it might delay for reconciler to detach. I am ok with this but might worth to put a note mention this. Or you can get VolumesInUse and update this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually what I saw in the testing: the volume gets detached after some timeout (6 minutes I think). And after discussion with @jsafrane this should be addressed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use Capital case for the first letter of error message to be consistent?
pkg/volume/aws_ebs/aws_ebs.go
Outdated
Name: volName, | ||
VolumeSource: v1.VolumeSource{ | ||
AWSElasticBlockStore: &v1.AWSElasticBlockStoreVolumeSource{ | ||
VolumeID: volName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Name:volName is not the real one. But in our code, this Name is not really used anywhere. Maybe put a note about it here in case this information is needed in the future.
glog.Errorf("could not find plugin named %s for volume %s: %v", pluginName, volumeName, err) | ||
continue | ||
} | ||
volumeSpec, err := plug.ConstructVolumeSpecFromName(volumeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConstructVolumeSpecFromName function fake volumeSource information for almost all of volume plugins. The main issue is that this information will be used by VerifyVolumesAreAttached(). If this information (e.g., volumeID) is wrong, the function will return wrong information related whether volumes are attached or not.
One way to avoid this is populate desired state first which contains this information, and then find the matching volume to populate this spec information from desired state. But volume spec information might not be recovered if pod is deleted so desired state does not have it either. PR #40239 might help solve this issue, but this PR is revered. Will follow up with the status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The ConstructVolumeSpecFromName fills the spec with wrong data. It's been noted above in the VSphere plugin. This is the most problematic part of the patch since I really had no idea how to fix that properly. I'll take a look a the other PR. (Or if somebody came with an idea I'm one ear.) Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like PR #40239 would remove the need for recovery in the case where a crash happens before unmounting has succeeded. Because then the pod wouldn't be allowed to be deleted in the first place, and the controller could find it and populate its states as usual.
But if a crash happens after unmounting has already succeeded & before detach happens, then recovery is still needed because the pod would be allowed to be deleted & won't be around to inspect.
So if names from node.Status are not enough for reconstruction then you may have to serialize & store entire volume specs somewhere? Maybe as ugly annotations on the node? I have no idea what the "proper" way to do it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the plugins code wondering why the volumes get detached in my tests since I really don't put the correct data to the spec. Seems it's actually by accident: I use the unique volume name from the node and split it into the plugin name and volume name. The volume name I incidentally ends with the same component (path-wise) as the mount path. The Detach() plugin methods usually start with extracting the last component of their first argument. If this is the volume name I have the result is identical as it would be with the correct volume source and the detach operation succeeds (@kerneltime -- I think this is the same also in the vSphere case).
The question is whether I should document this in the code and rely on this behaviour or some deeper re-design of the AttachDetach controller would be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion about the names. You are right. UniqueVolume is constructed as pluginName/volumeSourceName. This volumeSourceName is from plugin.GetVolumeName() which takes the source.volumeID (or other names) from different plugins. So when you ConstructVolumeSpecFromName() as below, the volumeID is actually the volumeName you extract from SplitUniqueName(). But the meta Name: volumeName for the Spec is wrong. I checked our code and I didn't find anywhere this meta name is actually used. So, overall your approach should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm actually trying one more thing: store the string that I'm unable to generate from the volume name also in the node status. This would be more robust, however it's an API change (small one though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more details about your plan on API change? I actually don't recommend to have API change in this PR unless it is really necessary because it is a very complicated and long process to get approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the node status looks like this:
status:
volumesAttached:
- devicePath: /dev/vdb
name: kubernetes.io/...
So I was thinking of adding one more string there that would contain the part I'm unable to reconstruct from the name (e.g., PDName in case of gce_pd):
status:
volumesAttached:
- devicePath: /dev/vdb
name: kubernetes.io/plugin/volume_name
detachVolumeName: whatever_the_plugin_needs
I talked to @jsafrane today and he also thinks getting the API change approval would be quite a complicated thing and suggested to store this data in the node annotations. Which I suppose would work too...
@tsmetana what is the status of the PR? Are there still some tests failing? |
Also I am thinking this PR might not solve the following issue Because controller delete volume from "VolumesAttached" list before actually performing detach operation, there is still a chance that we lose the information about the volume while it is still attached. The events might happen is
|
Sorry for the late response... The latest commit seems to be working OK. And yes, you're right -- there are more races affecting attach/detach and this patch fixes just one them. |
Rebased squashed, included fixes by @wongma7 (Thanks!). |
@@ -65,7 +65,7 @@ type ActualStateOfWorld interface { | |||
// returned. | |||
// If no node with the name nodeName exists in list of attached nodes for | |||
// the specified volume, an error is returned. | |||
SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool) error | |||
SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool, forceUnmount bool) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function's comment should be amended to include the new forceUnmount parameter. since the comment pretends atm that there's only one bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Overall the patch lgtm. The fishiest part is the use of SplitUniqueName during detach which relies on a kind of unenforced convention but we've already decided it's fine. jing brings up a good issue, VolumesAttached is not as its name suggests the source of truth about what volumes are actually attached to the node. So there's still a hole there that needs plugging. IMO it could be fixed in a follow-up PR, this PR has gotten big enough already and fixes the more likely case where deletion is done while controller is down. I imagine the fix for Jing's issue will have to either change the behavior around VolumesAttached or force us to introduce a new annotation "VolumesActuallyAttached" altogether, in which case we may as well not co-opt VolumesAttached and serialize the volume specs to the new annotation. But again, we can figure that stuff out in a follow-up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments. Overall direction looks fine.
nodeName := types.NodeName(node.Name) | ||
for _, attachedVolume := range node.Status.VolumesAttached { | ||
uniqueName := attachedVolume.Name | ||
err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil, nodeName, attachedVolume.DevicePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nil
value for volumeSpec
scares me like @gnufied mentioned earlier. But it sounds like (#39732 (comment)) you are confident it won't break anything.
nit: when passing in a value and it is unclear what the parameter is add the parameter name as a inline comment, e.g.:
err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil /* volumeSpec */, nodeName, attachedVolume.DevicePath)
nit: add a comment why you think it will be ok to pass in nil (e.g. #39732 (comment)).
uniqueName := attachedVolume.Name | ||
err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil, nodeName, attachedVolume.DevicePath) | ||
if err != nil { | ||
glog.Errorf("Error adding node to ActualStateOfWorld: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error seems not to reflect action.
continue | ||
} | ||
adc.processVolumesInUse(nodeName, node.Status.VolumesInUse, true /* forceUnmount */) | ||
adc.desiredStateOfWorld.AddNode(types.NodeName(node.Name)) // Needed for DesiredStateOfWorld population |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if the node is managed by attach/detach controller:
if _, exists := node.Annotations[volumehelper.ControllerManagedAttachAnnotation]; exists {
// Node specifies annotation indicating it should be managed by attach
// detach controller. Add it to desired state of world.
adc.desiredStateOfWorld.AddNode(nodeName)
}
continue | ||
} | ||
nodeName := types.NodeName(podToAdd.Spec.NodeName) | ||
plugin, err := adc.volumePluginMgr.FindPluginBySpec(volumeSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FindAttachablePluginBySpec
we don't care about non-attachable volumes in this controller.
} | ||
nodeName := types.NodeName(podToAdd.Spec.NodeName) | ||
plugin, err := adc.volumePluginMgr.FindPluginBySpec(volumeSpec) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if the returned plugin is nil
:
if err != nil || attachableVolumePlugin == nil {
glog.V(10).Infof(
"Skipping volume %q for pod %q/%q: it does not implement attacher interface. err=%v",
podVolume.Name,
pod.Namespace,
pod.Name,
err)
continue
}
@@ -298,7 +403,7 @@ func (adc *attachDetachController) nodeUpdate(oldObj, newObj interface{}) { | |||
// detach controller. Add it to desired state of world. | |||
adc.desiredStateOfWorld.AddNode(nodeName) | |||
} | |||
adc.processVolumesInUse(nodeName, node.Status.VolumesInUse) | |||
adc.processVolumesInUse(nodeName, node.Status.VolumesInUse, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -65,7 +65,7 @@ type ActualStateOfWorld interface { | |||
// returned. | |||
// If no node with the name nodeName exists in list of attached nodes for | |||
// the specified volume, an error is returned. | |||
SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool) error | |||
SetVolumeMountedByNode(volumeName v1.UniqueVolumeName, nodeName types.NodeName, mounted bool, forceUnmount bool) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
} | ||
|
||
// processVolumesInUse processes the list of volumes marked as "in-use" | ||
// according to the specified Node's Status.VolumesInUse and updates the | ||
// corresponding volume in the actual state of the world to indicate that it is | ||
// mounted. | ||
func (adc *attachDetachController) processVolumesInUse( | ||
nodeName types.NodeName, volumesInUse []v1.UniqueVolumeName) { | ||
nodeName types.NodeName, volumesInUse []v1.UniqueVolumeName, forceUnmount bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment to describe new boolean.
// Get attacher plugin and the volumeName by splitting the volume unique name in case | ||
// there's no VolumeSpec: this happens only on attach/detach controller crash recovery | ||
// when a pod has been deleted during the controller downtime | ||
pluginName, volumeName, err = volumehelper.SplitUniqueName(volumeToDetach.VolumeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed pretty funky should be improved in the future
When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status.
Thanks for the comments. I hope I addressed them correctly in the new version (just pushed). |
@@ -239,12 +245,136 @@ func (adc *attachDetachController) Run(stopCh <-chan struct{}) { | |||
return | |||
} | |||
|
|||
err := adc.populateActualStateOfWorld() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is controller.WaitForCacheSync()
just few lines above. Does it mean that all the existing objects were already synced through podAdd() and nodeAdd() callbacks from the shared informers? They may be synced in a wrong order (podAdd before nodeAdd) and these populateActualStateOfWorld()
and populateDesiredStateOfWorld()
will just fix it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked to @tsmetana and yes, these new world populators will fix states of worlds that could be inconsistent due to bad event ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue is that we need to find volumes attached for the pods that were deleted during the controller downtime: this would not be solved by the WaitForCacheSync() -- we still would be missing the pod deletion event. And yes, in the case the events come in wrong order the ASW/DSW populators should help too.
@tsmetana: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@k8s-bot cvm gce e2e test this |
Looks good, thanks for your patience @tsmetana! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: saad-ali, tsmetana
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 44722, 44704, 44681, 44494, 39732) |
When the attach/detach controller crashes and a pod with attached PV is deleted afterwards the controller will never detach the pod's attached volumes. To prevent this the controller should try to recover the state from the nodes status and figure out which volumes to detach. This requires some changes in the volume providers too: the only information available from the nodes is the volume name and the device path. The controller needs to find the correct volume plugin and reconstruct the volume spec just from the name. This required a small change also in the volume plugin interface.
Fixes Issue #34242.
cc: @jsafrane @jingxu97