NodeVolumeUnpublish: tolerate repeated requests #139

okartau · 2020-01-15T08:42:04Z

What type of PR is this?
/kind failing-test

What this PR does / why we need it:
NodeVolumeUnpublish handling has to be idempotent,
tolerating repeated requests.
That means, we can not attempt umount and return failure
without checking is that really mount point.

Which issue(s) this PR fixes:
Fixes #105

Special notes for your reviewer:
Consider this as WIP: I intentionally replicated mount check code initially
in two case: blocks, and that's not meant to be final.
I made it like that to avoid semantic change to current switch block, which
already replicates code for two cases.
This switch does not have default (should it have?), i.e. we assume that
VolAccessType has to be one of two handled by case: blocks?
Does that mean we can bring mount check code to be before switch block?
But for that to be correct, we should probably consider adding default stmt in switch.
Note that two case: parts already have most of code identical
(we can use targetPath instead of req.GetTargetPath() in 2nd case)

Does this PR introduce a user-facing change?:

NONE

k8s-ci-robot · 2020-01-15T08:42:13Z

Hi @okartau. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly · 2020-01-15T09:18:07Z

/ok-to-test

pohly · 2020-01-15T09:18:24Z

/ok-to-test

pohly · 2020-01-21T09:29:59Z

This switch does not have default (should it have?), i.e. we assume that
VolAccessType has to be one of two handled by case: blocks?

Better add a default.

Note that two case: parts already have most of code identical
(we can use targetPath instead of req.GetTargetPath() in 2nd case)

Please try to avoid code duplication if possible.

okartau · 2020-01-23T09:14:41Z

I added default case in that switch block, and that allowed to move common replicated parts out of switch-block. I also moved "check is mount point" test to happen before switch, without replication.

Although we discussed off-line with @pohly that more idempotency-safety via mutex locking could be added in same pass with this PR, I think it's simpler to take smaller steps, as this PR solely improves the ability to withstand repeated single-thread operations, as can be demonstrated by kubernetes-csi/csi-test#229, once merged.
The locking-based improvement is different story, it will improve test cases that dont exist yet, but can be added, making use of parallel requests. Thus, I propose to merge this PR first (enables to merge csi-test change adding repeated tests), and then move forward with locking changes.

pohly · 2020-01-27T12:31:08Z

pkg/hostpath/nodeserver.go

@@ -204,24 +204,27 @@ func (ns *nodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpu
 		return nil, status.Error(codes.NotFound, err.Error())
 	}

+	// Check if the target path is really a mount point. If its not a mount point do nothing


Nit: "If it's not a mount point do nothing."

in a closer look, we tell "if (not) mount point" twice which is one too many. Same message can be given without repeating same words, and in one statement. And combining that with another improvement idea from below, I change that comment to "Unmount only if the target path is really a mount point."

pohly · 2020-01-27T12:33:00Z

pkg/hostpath/nodeserver.go

+	// Check if the target path is really a mount point. If its not a mount point do nothing
+	if notMnt, err := mount.IsNotMountPoint(mount.New(""), targetPath); notMnt || err != nil && !os.IsNotExist(err) {
+		glog.V(4).Infof("hostpath: %s is not mount point, skip", targetPath)
+		return &csi.NodeUnpublishVolumeResponse{}, nil


If it is not mounted, we can only skip unmounting. We should not skip the rest of the function because a previous NodeUnpublishVolume might have been interrupted after unmounting and before removing the target path.

if we only skip unmounting and proceed , we have to make sure repeated calling of os.RemoveAll is safe. It seems os.RemoveAll returns error if called with non-existent directory (is it so?), means we have to check for existence before calling?

Better check for the "not-exist" error after calling it and ignore that one.

yes, I also got idea to re-arrange those statements not to have everything packed on single line. (This line I copied from pmem-csi). But it improves readability to have it on different lines and check for err first, IMHO.

But also, I discovered that removal of directory in this functopn does not follow CSI spec which tells to remove the directory. Current code does it for VolAccessTyoe=Block only.

So what about:

we remove the switch stmt so that there will be just os.RemoveAll.

we add check for VolAccessType allowed values near start of function,
to avoid currently possible code path where some changes are made, then there is parameters check in the middle of function which can return without further changes.

yes, I also got idea to re-arrange those statements not to have everything packed on single line. (This line I copied from pmem-csi). But it improves readability to have it on different lines and check for err first, IMHO.

I'm not sure I follow here.

if err := <do something>; <check err here> is the canonical way to handle errors. If the line gets too long, then simply spread it over multiple lines.

Current code does it for VolAccessTyoe=Block only.

Yes, it should create (or try to create) the target directory in NodePublishVolume and remove it in NodeUnpublishVolume. It is a bug in Kubernetes that it creates the target path for filesystem volumes.

I'm not sure I follow here.

What I meant is that instead of multiple ANDed and ORed login on same line:

if notMnt, err := mount.IsNotMountPoint(mount.New(""), targetPath); notMnt || err != nil && !os.IsNotExist(err) {

I prefer more straightforward structure:
notMnt, err := stmt
if err
.... and so on, having statement and checks separated for easier reading.
i.e. the code that is in PR now (I pushed new state close to my previous comment)

But what about my idea of removing switch stmt in middle. If you say we can trust volAccessType to be valid, does it mean then we can remove the switch altogether, without adding separate check closer to start of function ? And remove directory in all cases?

I prefer more straightforward structure:
notMnt, err := stmt
if err

That's non-idiomatic Go. You can get the same readability with:

if notMnt, err := stmt; err ...

If you say we can trust volAccessType to be valid, does it mean then we can remove the switch altogether, without adding separate check closer to start of function ?

Yes.

And remove directory in all cases?

Yes.

I pushed new state addressing raised concerns. gofmt has own opinion how to format that "if" part so I used that. The switch stmt is gone, we always remove the directory. And we do not check for VolAccessType any more, we trust the type is valid.

pohly · 2020-01-27T12:35:51Z

pkg/hostpath/nodeserver.go

 		glog.V(4).Infof("hostpath: volume %s/%s has been unmounted.", targetPath, volumeID)
+	default:
+		return nil, status.Error(codes.Internal, fmt.Sprintf("unsupported access type %v", vol.VolAccessType))


In the unlikely case that we get here, we potentially did already some work (unmounting), but I think that's okay.

For proper handling, we should check earlier that VolAccessType is one of supported types, and refuse to make any changes if not?

I don't think that's necessary. vol.VolAccessType really should be valid, as it was set earlier by the driver itself.

okartau · 2020-02-03T07:48:20Z

/retest
failure seems not related to change in this PR

pohly · 2020-02-07T08:30:38Z

pkg/hostpath/nodeserver.go

 		// Unmounting the image
-		err = mount.New("").Unmount(req.GetTargetPath())
+		err = mount.New("").Unmount(targetPath)


The comment is no longer accurate. It should say "Unmounting the image or filesystem" now.

pohly · 2020-02-07T08:33:50Z

pkg/hostpath/nodeserver.go

 	}
+	// Delete the block file or mount point.


With "block file", do you mean the image file that the loop device is bound to? We don't delete that here.

I would just say "Delete the mount point." here, which is a correct statement, regardless whether that mount point is a directory (for filesystem mode) or file (for raw block).

I just re-used previous "block file" and added another part in that comment. Changed now as proposed, pushed.
One thing I notice when looking at latest code now, the recent change which took away "return OK" in "not mount point" case, likely shifted the potential idempotency failure point to next operation, os.RemoveAll which will fail if there is no mount point.
And I have not tested newest code against repeated test yet.
I need to do that before accept point of that code, I will do that locally soon, my guess is it now will fail at RemoveAll().
Partial reason why this is left unnoticed is that similar code on pmem-csi side does not check for return code of os.RemoveAll, i.e. tolerates failure silently, with end result being still OK.
The failure to remove non-existing directory is not really bad thing in that case.
But more clear is to check for return code and have code to handle this, commented.

or, what about another "code saving" approach: we do not check for return code and add comment why we do not?

no, I think correct way is to check. One of

check first does path exist, if yes then RemoveAll, and also check for return code

try RemoveAll and check for return code, if "no such path" then OK, otherwise error

The latter. Testing before some operation and then doing the operation is always a bit fishy. There have been security exploits because of this (https://hackernoon.com/time-of-check-to-time-of-use-toctou-a-race-condition-99c2311bd9fc).

This shouldn't be a problem here, but it's still better to follow best practices. It's also cheaper (= less syscalls), although again that doesn't matter here.

I tried current code and it works without error, i.e. os.RemoveAll does not return error on non-existing path
This is expected and documented on https://golang.org/pkg/os/ :
If the path does not exist, RemoveAll returns nil (no error)
Means, current code is good as is, needs comment stating above to make it even more clear (I will add a comment and push)

NodeVolumeUnpublish handling has to be idempotent, i.e. has to tolerate repeated requests. That means, we can not attempt umount and return failure without checking is targetPath really a mount point. Re-arranged "not mount point" checking to be simpler. Removed switch about VolAccessTypes as targetPath has to be removed in both cases by CSI spec.

okartau · 2020-02-07T16:13:13Z

/retest
failure seems not related to change in this PR

pohly

/lgtm

pohly · 2020-02-10T18:20:52Z

/assign @msau42

For approval.

msau42 · 2020-02-10T19:59:05Z

/approve
Thanks for fixing this!

k8s-ci-robot · 2020-02-10T19:59:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, okartau

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

a1e11275 Merge pull request kubernetes-csi#139 from pohly/kind-for-kubernetes-latest 1c0fb096 prow.sh: use KinD main for latest Kubernetes 1d77cfcb Merge pull request kubernetes-csi#138 from pohly/kind-update-0.10 bff2fb7e prow.sh: KinD 0.10.0 git-subtree-dir: release-tools git-subtree-split: a1e11275b5a4febd6ad21beeac730e22c579825b

k8s-ci-robot requested review from jsafrane and msau42 January 15, 2020 08:42

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 15, 2020

okartau changed the title ~~NodeVolumeUnpublish: tolerate repeted requests~~ [WIP] NodeVolumeUnpublish: tolerate repeted requests Jan 15, 2020

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2020

okartau mentioned this pull request Jan 15, 2020

Idempotency: repeated Node Unpublish operations cause failure #105

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 15, 2020

okartau changed the title ~~[WIP] NodeVolumeUnpublish: tolerate repeted requests~~ [WIP] NodeVolumeUnpublish: tolerate repeated requests Jan 15, 2020

okartau force-pushed the tolerate-repeated-unpublish branch from 4866976 to 246ccd1 Compare January 21, 2020 11:17

okartau changed the title ~~[WIP] NodeVolumeUnpublish: tolerate repeated requests~~ NodeVolumeUnpublish: tolerate repeated requests Jan 23, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2020

pohly requested changes Jan 27, 2020

View reviewed changes

okartau force-pushed the tolerate-repeated-unpublish branch from 246ccd1 to 0b60b66 Compare February 3, 2020 07:16

okartau force-pushed the tolerate-repeated-unpublish branch from 0b60b66 to 961ec99 Compare February 5, 2020 16:46

pohly requested changes Feb 7, 2020

View reviewed changes

okartau force-pushed the tolerate-repeated-unpublish branch from 961ec99 to 7bd758c Compare February 7, 2020 08:58

okartau force-pushed the tolerate-repeated-unpublish branch from 7bd758c to 2d18f8a Compare February 7, 2020 15:32

pohly approved these changes Feb 10, 2020

View reviewed changes

k8s-ci-robot assigned pohly Feb 10, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2020

k8s-ci-robot assigned msau42 Feb 10, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2020

k8s-ci-robot merged commit 7691a85 into kubernetes-csi:master Feb 10, 2020

okartau mentioned this pull request Feb 11, 2020

add repetition loop to test idempotency kubernetes-csi/csi-test#229

Merged

msau42 mentioned this pull request Feb 13, 2020

release 1.3.0 preparations #155

Merged

pohly mentioned this pull request Mar 24, 2021

master: update release-tools #264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeVolumeUnpublish: tolerate repeated requests #139

NodeVolumeUnpublish: tolerate repeated requests #139

okartau commented Jan 15, 2020

k8s-ci-robot commented Jan 15, 2020

pohly commented Jan 15, 2020

pohly commented Jan 15, 2020

pohly commented Jan 21, 2020

okartau commented Jan 23, 2020

pohly Jan 27, 2020

okartau Jan 31, 2020

pohly Jan 27, 2020

okartau Jan 31, 2020

pohly Feb 3, 2020

okartau Feb 3, 2020

pohly Feb 3, 2020

okartau Feb 4, 2020

pohly Feb 4, 2020

okartau Feb 7, 2020

pohly Jan 27, 2020

okartau Jan 31, 2020

pohly Feb 3, 2020

okartau commented Feb 3, 2020

pohly Feb 7, 2020

pohly Feb 7, 2020

okartau Feb 7, 2020

okartau Feb 7, 2020

okartau Feb 7, 2020

pohly Feb 7, 2020

okartau Feb 7, 2020

okartau commented Feb 7, 2020

pohly left a comment

pohly commented Feb 10, 2020

msau42 commented Feb 10, 2020

k8s-ci-robot commented Feb 10, 2020

NodeVolumeUnpublish: tolerate repeated requests #139

NodeVolumeUnpublish: tolerate repeated requests #139

Conversation

okartau commented Jan 15, 2020

k8s-ci-robot commented Jan 15, 2020

pohly commented Jan 15, 2020

pohly commented Jan 15, 2020

pohly commented Jan 21, 2020

okartau commented Jan 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

okartau commented Feb 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

okartau commented Feb 7, 2020

pohly left a comment

Choose a reason for hiding this comment

pohly commented Feb 10, 2020

msau42 commented Feb 10, 2020

k8s-ci-robot commented Feb 10, 2020