Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix uncertain mounts #82492

Merged
merged 15 commits into from Dec 17, 2019
Merged

Fix uncertain mounts #82492

merged 15 commits into from Dec 17, 2019

Conversation

@gnufied
Copy link
Member

gnufied commented Sep 9, 2019

Add uncertain state for volume mounts which will handle cleanup of CSI mounts correctly.

Fixes #82426 and #82340 and #83129

This does not fixes similar problem associated with raw block devices. We are blocked on #74026 until we can fix raw block devices.

This PR fixes several issues by tracking certain MountDevice and SetUp failures as uncertain and adding them to actualStateofWorld so as volume can be cleaned up correctly when Pod is no longer scheduled on the node. Previously these failures would cause volume to be not added to actualStateofWorld and hence volume may actually get mounted successfully but never get cleaned up properly.

It does so by returning errors with additional context. Return value of SetUp and MountDevice functions now return following types of errors:

//   - TransientOperationFailure - operation failed with a transient error and could succeed if retried
//   - UncertainProgressError - operation failed but was started by reconciler and hence may be in-progress
//   - Error of any other type should be considered a final error and cause volume to not added to ASOW
Track mount operations as uncertain if operation fails with non-final error
@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Sep 9, 2019

@k8s-ci-robot k8s-ci-robot added kind/bug and removed needs-kind labels Sep 9, 2019
@gnufied gnufied force-pushed the gnufied:fix-uncertain-mounts branch from 392d7fa to 0dfdd5c Sep 9, 2019
@k8s-ci-robot k8s-ci-robot added size/XXL and removed size/XL labels Sep 9, 2019
@gnufied gnufied changed the title {WIP} Fix uncertain mounts Fix uncertain mounts Sep 9, 2019
@gnufied gnufied changed the title Fix uncertain mounts {WIP} Fix uncertain mounts Sep 9, 2019
@gnufied gnufied force-pushed the gnufied:fix-uncertain-mounts branch from 0dfdd5c to 8c4f1cf Sep 9, 2019
@k8s-ci-robot k8s-ci-robot added size/XL and removed size/XXL labels Sep 9, 2019
@gnufied gnufied changed the title {WIP} Fix uncertain mounts Fix uncertain mounts Sep 9, 2019
@gnufied gnufied force-pushed the gnufied:fix-uncertain-mounts branch 2 times, most recently from e3f5700 to bae901f Sep 9, 2019
pkg/volume/csi/csi_mounter.go Outdated Show resolved Hide resolved
pkg/volume/csi/csi_mounter.go Outdated Show resolved Hide resolved
@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Dec 3, 2019

Overall looks good to me. A few small comments. Since this is a big change, maybe add more design details in the PR comment section.

Use typed errors rather than operation status for
indicating operation progress
@gnufied gnufied force-pushed the gnufied:fix-uncertain-mounts branch from 6cac31c to 4b8e552 Dec 4, 2019
@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Dec 4, 2019

@jingxu97 added additional comments to the PR. ptal

pkg/volume/util/types/types.go Outdated Show resolved Hide resolved
if volumetypes.IsOperationFinishedError(mountError) &&
actualStateOfWorld.GetDeviceMountState(volumeToMount.VolumeName) == DeviceMountUncertain {
// Only devices which were uncertain can be marked as unmounted
markDeviceUnmountError := actualStateOfWorld.MarkDeviceAsUnmounted(volumeToMount.VolumeName)

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Dec 4, 2019

Contributor

My thought is if you don't mark device as unmount when previously marked as uncertain. then you could simplify the logic. there is no need of TransientOperationError. MarkDeviceAsUnmounted will be triggered during volume tear down phase.
What do you think?

This comment has been minimized.

Copy link
@gnufied

gnufied Dec 5, 2019

Author Member

There are two reasons, I am not very sure about taking this path:

  1. volumemanager has been designed to not track volumes that are not mounted (or for which mount failed with 100% certainity). Now we are saying - we will keep volumes which have failed to mount in "uncertain" state and hence they will require "teardown" and "unmountdevice" calls. This introduces additional design change into volumemanager and I hope that - no plugin breaks (or fail unmount call) when unmount is called for failed mounts. My intention has been to stick to code paths which are known to work.

  2. The second reason is - what are we supposed to return when transient error does happen in mount calls?

    • Mount is called and kubelet fails to fetch the secret(on first attempt): - If we return uncertainProgressError for them, then volume will be added to ASOW in uncertain state, even though mount call has not even been attempted. It seems wrong to accept a volume in ASOW. This PR also made a change that, fetching secret call has been moved up and hence if there was an error fetching secret, call returns early without even creating volume directories.
    • If we return a regular error for transient errors - then that would cause volume directories to be removed. This is a problem if, previous error was "UncertainProgressError". We do not want a transient error to remove volume directories, in case previous mount operation is still in-progress

Given above reasons - IMO having a transientError is not problematic. It makes it easier to reason about state of volumemanager and mount failures.

This comment has been minimized.

Copy link
@gnufied

gnufied Dec 5, 2019

Author Member

Updated second case - since we moved secret fetching code to top of the function, the defer is not executed if fetching secret fails. But the main point IMO still stands.

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Dec 6, 2019

Contributor

volumemanager has been designed to not track volumes that are not mounted (or for which mount failed with 100% certainity). Now we are saying - we will keep volumes which have failed to mount in "uncertain" state and hence they will require "teardown" and "unmountdevice" calls. This introduces additional design change into volumemanager and I hope that - no plugin breaks (or fail unmount call) when unmount is called for failed mounts. My intention has been to stick to code paths which are known to work.

For this one, actually plugin has to deal with "teardown while previous mount still in progress or actually failed mount already" after adding this uncertain state, I think. The sequence of event would be like

  1. volume setup call returns with timeout error, add it into state
  2. pod is deleted, volume tear down is called (meantime the mount might still in process or might failed or might succeeded)

Mount is called and kubelet fails to fetch the secret(on first attempt): - If we return uncertainProgressError for them, then volume will be added to ASOW in uncertain state, even though mount call has not even been attempted. It seems wrong to accept a volume in ASOW. This PR also made a change that, fetching secret call has been moved up and hence if there was an error fetching secret, call returns early without even creating volume directories.

I thought mount is called after fetching the secret?

This comment has been minimized.

Copy link
@gnufied

gnufied Dec 12, 2019

Author Member

I thought mount is called after fetching the secret?

I have been thinking about this and I think we still have a problem. Lets say - MountDevice is called and first attempt returns in UncertainProgressError which causes volume to be added to uncertain state. Second attempt however fails with a final error (not secret fetching error but lets say NodeStage itself failed) then because error was a final error -

klog.Errorf(log("attacher.MountDevice failed: %v", err))
will remove the deviceMountPath from node. But volume is still in "Uncertain" state and must be cleaned up, but will throw an error because data directory from deviceMountPath is already removed..

This comment has been minimized.

Copy link
@gnufied

gnufied Dec 12, 2019

Author Member

What do we actually gain by removing TransientOperationFailure from this PR?

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Dec 13, 2019

Contributor

My thought was if both with and without transaction error can work correctly, I prefer without transaction for simplicity. Later we don't need to worry about when other ppl work on this code, they need to understand this logic well and set correctly types of error to use.

But as you point it out, if mount fails, metadata will be cleaned. Then when reconciler calls unmount, it will fail to get metadata. The logic here can help remove volume from actual state cache, so unmount does not need to be triggered.

So I think this is good. Just to make sure we add enough comment about different types of errors

@@ -254,10 +257,13 @@ func (c *csiMountMgr) SetUpAt(dir string, mounterArgs volume.MounterArgs) error
)

if err != nil {
if removeMountDirErr := removeMountDir(c.plugin, dir); removeMountDirErr != nil {
klog.Error(log("mounter.SetupAt failed to remove mount dir after a NodePublish() error [%s]: %v", dir, removeMountDirErr))
// If operation finished with error then we can remove the mount directory.

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Dec 6, 2019

Contributor

MkdirAll() can be called right before NodePublishVolume call, right? so all previous errors does not need to worry about remove dir.

pkg/volume/csi/csi_client.go Outdated Show resolved Hide resolved
Copy link
Member

jsafrane left a comment

Testing with the mock driver and found these issues:

  • User deletes a pod, NodeUnstage times out (it never succeeds), user re-creates the pod -> NodePublish is called. NodeStage was expected to fix uncertain Stage.

  • User force-deletes a pod, NodeUnpublish times out (it never succeeds), user re-creates the pod -> NodePublish is called, followed by NodeUnpublish. The pod gets Running even though kubelet tries to Unpublish the volume.

    gRPCCall: {"Method":"/csi.v1.Node/NodeUnpublishVolume",... "Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"}
    gRPCCall: {"Method":"/csi.v1.Node/NodeUnpublishVolume",... "Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"}
    gRPCCall: {"Method":"/csi.v1.Node/NodeUnpublishVolume",... "Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"}
    
    < pod created here >
    
    gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities",... "Error":""}
    gRPCCall: {"Method":"/csi.v1.Node/NodePublishVolume","Request":{"volume_id":"1","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/mypv/globalmount","target_path":"/var/lib/kubelet/pods/6c5f730f-ad66-4cdf-abec-c55a8c04b033/volumes/kubernetes.io~csi/mypv/mount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":1}},"volume_context":{"name":"Mock Volume 1","storage.kubernetes.io/csiProvisionerIdentity":"1532709836099-8081-csi-mock"}},"Response":{},"Error":""}
    gRPCCall: {"Method":"/csi.v1.Node/NodeUnpublishVolume","Request":{"volume_id":"1","target_path":"/var/lib/kubelet/pods/356b4361-7e05-47cc-beaa-e649363d1c5e/volumes/kubernetes.io~csi/mypv/mount"},"Response":null,"Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"}
    

    edit: my bad, this is correct. Kubelet published to a new pod dir and is still trying to unpublish the old pod dir.

@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Dec 12, 2019

@jsafrane I think fixing unmount code path we filed up as a follow up issue since this PR was getting too big as it is - #85248

@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Dec 12, 2019

/retest

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Dec 13, 2019

I think fixing unmount code path we filed up as a follow up issue since this PR was getting too big as it is - #85248

#85248 mentions only NodeUnpublish timeout, not NodeUnstage. If I remember correctly, old implementation of this PR called NodeStage after NodeUnstage timeout or error && new pod appeared. See that last test in #82492 (comment)

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Dec 13, 2019

@jsafrane In your test case, you mentioned local pod volume dir is created even when MountDevice timesout. Is this still the case. Because from the logic of the code, if MountDevice returns errors, it should not move to MountVolume step?

@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Dec 13, 2019

@jingxu97 that certainly would still be the case - as I mentioned in bug - #84878 , the fact that just instantiating the mounter for CSI results in creation of pod local volume directories and they are left over even if MountDevice errors out. This is a bug. It exists in current k8s and this PR does not fixes it and hence I filed a separate bug to fix it.

@jsafrane Are you sure NodeStage is not being called if Unstage times out but NodUnpublish succeeds? The way I see it - nothing in code would prevent calling NodeStage if volume was unmounted via NodeUnpublish because stage and mount operations are done as part of same operation and code does not check if volume is marked as "global mounted" before calling NodeStage. Here is my tests:

Pod is deleted and NodeUnpublish is successful:

gRPCCall: {"Method":"/csi.v1.Node/NodeUnpublishVolume","Request":{"volume_id":"4","target_path":"/var/lib/kubelet/pods/67560ff7-aa33-4d08-bc45-e74ba188603e/volumes/kubernetes.io~csi/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/mount"},"Response":{},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{}}},{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeUnstageVolume","Request":{"volume_id":"4","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/globalmount"},"Response":null,"Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"
}
gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{}}},{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeUnstageVolume","Request":{"volume_id":"4","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/globalmount"},"Response":null,"Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"
}
gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{}}},{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeUnstageVolume","Request":{"volume_id":"4","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/globalmount"},"Response":null,"Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"
}
gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{}}},{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeUnstageVolume","Request":{"volume_id":"4","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/globalmount"},"Response":null,"Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"
}
gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{}}},{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeUnstageVolume","Request":{"volume_id":"4","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/globalmount"},"Response":null,"Error":"rpc error: code = DeadlineExceeded desc = Mock timeout"
}

Pod is re-created on same node:

gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{}}},{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeStageVolume","Request":{"volume_id":"4","publish_context":{"device":"/dev/mock","readonly":"false"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/globalmount","volume_capability":{"A
ccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"name":"pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1","storage.kubernetes.io/csiProvisionerIdentity":"1576265141256-8081-io.kubernetes.storage.mock"}},"Response":{},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{}}},{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodePublishVolume","Request":{"volume_id":"4","publish_context":{"device":"/dev/mock","readonly":"false"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/globalmount","target_path":"/var/l
ib/kubelet/pods/9fc9e8bf-b47f-41d1-bf1e-1c278147f72d/volumes/kubernetes.io~csi/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"name":"pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1",
"storage.kubernetes.io/csiProvisionerIdentity":"1576265141256-8081-io.kubernetes.storage.mock"}},"Response":{},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{}}},{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]},"Error":""}
gRPCCall: {"Method":"/csi.v1.Node/NodeGetVolumeStats","Request":{"volume_id":"4","volume_path":"/var/lib/kubelet/pods/9fc9e8bf-b47f-41d1-bf1e-1c278147f72d/volumes/kubernetes.io~csi/pvc-bafb07c3-6f11-4985-9903-6492e46f5dc1/mount"},"Response":{"usage":[{"total":1073741824,"unit":1}]},"
Error":""}

Having said that - the reason NodeStage is being called even if NodeUnstage is never successful is because mock driver does not create a mount point on deviceMountPath - if it did then https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_attacher.go#L235 line will cause MountDevice to exit without ever calling NodeStage. So this is still a bug IMO and should be fixed, because for certain CSI driver implementation this will cause NodeStage to be never called.

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Dec 13, 2019

I think we have separate problem of assuming things are mount points. 1) IsLikelyNotMount doesn't work for bind mounts on the rootfs, so this is problematic for a lot of local-based csi drivers 2) csi spec technically says things should be mount points, but k8s has been pretty lax about this so we would probably start breaking a lot of drivers if we tried to enforce it

@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Dec 13, 2019

Can we safely call NodeStage even though deviceMountPath is already a mount point? All CSI calls being idempotent - it should be safe to do that. We might be able to remove checks for mount point entirely before calling NodeStage RPC.

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Dec 13, 2019

This PR is lgtm. Thanks a lot @gnufied for putting great effort on this!

Just wait for other reviewers to pass it too.

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Dec 17, 2019

Filed #86360 for Unstage timeout handling

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 17, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 17, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gnufied

This comment has been minimized.

Copy link
Member Author

gnufied commented Dec 17, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 40df9f8 into kubernetes:master Dec 17, 2019
18 checks passed
18 checks passed
cla/linuxfoundation gnufied authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-snapshot Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.