Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeVolumeUnpublish: tolerate repeated requests #139

Merged

Conversation

okartau
Copy link
Contributor

@okartau okartau commented Jan 15, 2020

What type of PR is this?
/kind failing-test

What this PR does / why we need it:
NodeVolumeUnpublish handling has to be idempotent,
tolerating repeated requests.
That means, we can not attempt umount and return failure
without checking is that really mount point.

Which issue(s) this PR fixes:
Fixes #105

Special notes for your reviewer:
Consider this as WIP: I intentionally replicated mount check code initially
in two case: blocks, and that's not meant to be final.
I made it like that to avoid semantic change to current switch block, which
already replicates code for two cases.
This switch does not have default (should it have?), i.e. we assume that
VolAccessType has to be one of two handled by case: blocks?
Does that mean we can bring mount check code to be before switch block?
But for that to be correct, we should probably consider adding default stmt in switch.
Note that two case: parts already have most of code identical
(we can use targetPath instead of req.GetTargetPath() in 2nd case)

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 15, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @okartau. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 15, 2020
@okartau okartau changed the title NodeVolumeUnpublish: tolerate repeted requests [WIP] NodeVolumeUnpublish: tolerate repeted requests Jan 15, 2020
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2020
@pohly
Copy link
Contributor

pohly commented Jan 15, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 15, 2020
@pohly
Copy link
Contributor

pohly commented Jan 15, 2020

/ok-to-test

@okartau okartau changed the title [WIP] NodeVolumeUnpublish: tolerate repeted requests [WIP] NodeVolumeUnpublish: tolerate repeated requests Jan 15, 2020
@pohly
Copy link
Contributor

pohly commented Jan 21, 2020

This switch does not have default (should it have?), i.e. we assume that
VolAccessType has to be one of two handled by case: blocks?

Better add a default.

Note that two case: parts already have most of code identical
(we can use targetPath instead of req.GetTargetPath() in 2nd case)

Please try to avoid code duplication if possible.

@okartau okartau changed the title [WIP] NodeVolumeUnpublish: tolerate repeated requests NodeVolumeUnpublish: tolerate repeated requests Jan 23, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2020
@okartau
Copy link
Contributor Author

okartau commented Jan 23, 2020

I added default case in that switch block, and that allowed to move common replicated parts out of switch-block. I also moved "check is mount point" test to happen before switch, without replication.

Although we discussed off-line with @pohly that more idempotency-safety via mutex locking could be added in same pass with this PR, I think it's simpler to take smaller steps, as this PR solely improves the ability to withstand repeated single-thread operations, as can be demonstrated by kubernetes-csi/csi-test#229, once merged.
The locking-based improvement is different story, it will improve test cases that dont exist yet, but can be added, making use of parallel requests. Thus, I propose to merge this PR first (enables to merge csi-test change adding repeated tests), and then move forward with locking changes.

@@ -204,24 +204,27 @@ func (ns *nodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpu
return nil, status.Error(codes.NotFound, err.Error())
}

// Check if the target path is really a mount point. If its not a mount point do nothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "If it's not a mount point do nothing."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a closer look, we tell "if (not) mount point" twice which is one too many. Same message can be given without repeating same words, and in one statement. And combining that with another improvement idea from below, I change that comment to "Unmount only if the target path is really a mount point."

// Check if the target path is really a mount point. If its not a mount point do nothing
if notMnt, err := mount.IsNotMountPoint(mount.New(""), targetPath); notMnt || err != nil && !os.IsNotExist(err) {
glog.V(4).Infof("hostpath: %s is not mount point, skip", targetPath)
return &csi.NodeUnpublishVolumeResponse{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not mounted, we can only skip unmounting. We should not skip the rest of the function because a previous NodeUnpublishVolume might have been interrupted after unmounting and before removing the target path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we only skip unmounting and proceed , we have to make sure repeated calling of os.RemoveAll is safe. It seems os.RemoveAll returns error if called with non-existent directory (is it so?), means we have to check for existence before calling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better check for the "not-exist" error after calling it and ignore that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I also got idea to re-arrange those statements not to have everything packed on single line. (This line I copied from pmem-csi). But it improves readability to have it on different lines and check for err first, IMHO.

But also, I discovered that removal of directory in this functopn does not follow CSI spec which tells to remove the directory. Current code does it for VolAccessTyoe=Block only.

So what about:

  • we remove the switch stmt so that there will be just os.RemoveAll.
  • we add check for VolAccessType allowed values near start of function,
    to avoid currently possible code path where some changes are made, then there is parameters check in the middle of function which can return without further changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I also got idea to re-arrange those statements not to have everything packed on single line. (This line I copied from pmem-csi). But it improves readability to have it on different lines and check for err first, IMHO.

I'm not sure I follow here.

if err := <do something>; <check err here> is the canonical way to handle errors. If the line gets too long, then simply spread it over multiple lines.

Current code does it for VolAccessTyoe=Block only.

Yes, it should create (or try to create) the target directory in NodePublishVolume and remove it in NodeUnpublishVolume. It is a bug in Kubernetes that it creates the target path for filesystem volumes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow here.

What I meant is that instead of multiple ANDed and ORed login on same line:

if notMnt, err := mount.IsNotMountPoint(mount.New(""), targetPath); notMnt || err != nil && !os.IsNotExist(err) {

I prefer more straightforward structure:
notMnt, err := stmt
if err
.... and so on, having statement and checks separated for easier reading.
i.e. the code that is in PR now (I pushed new state close to my previous comment)

But what about my idea of removing switch stmt in middle. If you say we can trust volAccessType to be valid, does it mean then we can remove the switch altogether, without adding separate check closer to start of function ? And remove directory in all cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer more straightforward structure:
notMnt, err := stmt
if err

That's non-idiomatic Go. You can get the same readability with:

if notMnt, err := stmt;
    err ...

If you say we can trust volAccessType to be valid, does it mean then we can remove the switch altogether, without adding separate check closer to start of function ?

Yes.

And remove directory in all cases?

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed new state addressing raised concerns. gofmt has own opinion how to format that "if" part so I used that. The switch stmt is gone, we always remove the directory. And we do not check for VolAccessType any more, we trust the type is valid.

glog.V(4).Infof("hostpath: volume %s/%s has been unmounted.", targetPath, volumeID)
default:
return nil, status.Error(codes.Internal, fmt.Sprintf("unsupported access type %v", vol.VolAccessType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the unlikely case that we get here, we potentially did already some work (unmounting), but I think that's okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For proper handling, we should check earlier that VolAccessType is one of supported types, and refuse to make any changes if not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary. vol.VolAccessType really should be valid, as it was set earlier by the driver itself.

@okartau
Copy link
Contributor Author

okartau commented Feb 3, 2020

/retest
failure seems not related to change in this PR

// Unmounting the image
err = mount.New("").Unmount(req.GetTargetPath())
err = mount.New("").Unmount(targetPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is no longer accurate. It should say "Unmounting the image or filesystem" now.

}
// Delete the block file or mount point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "block file", do you mean the image file that the loop device is bound to? We don't delete that here.

I would just say "Delete the mount point." here, which is a correct statement, regardless whether that mount point is a directory (for filesystem mode) or file (for raw block).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just re-used previous "block file" and added another part in that comment. Changed now as proposed, pushed.
One thing I notice when looking at latest code now, the recent change which took away "return OK" in "not mount point" case, likely shifted the potential idempotency failure point to next operation, os.RemoveAll which will fail if there is no mount point.
And I have not tested newest code against repeated test yet.
I need to do that before accept point of that code, I will do that locally soon, my guess is it now will fail at RemoveAll().
Partial reason why this is left unnoticed is that similar code on pmem-csi side does not check for return code of os.RemoveAll, i.e. tolerates failure silently, with end result being still OK.
The failure to remove non-existing directory is not really bad thing in that case.
But more clear is to check for return code and have code to handle this, commented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, what about another "code saving" approach: we do not check for return code and add comment why we do not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I think correct way is to check. One of

  • check first does path exist, if yes then RemoveAll, and also check for return code
  • try RemoveAll and check for return code, if "no such path" then OK, otherwise error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter. Testing before some operation and then doing the operation is always a bit fishy. There have been security exploits because of this (https://hackernoon.com/time-of-check-to-time-of-use-toctou-a-race-condition-99c2311bd9fc).

This shouldn't be a problem here, but it's still better to follow best practices. It's also cheaper (= less syscalls), although again that doesn't matter here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried current code and it works without error, i.e. os.RemoveAll does not return error on non-existing path
This is expected and documented on https://golang.org/pkg/os/ :
If the path does not exist, RemoveAll returns nil (no error)
Means, current code is good as is, needs comment stating above to make it even more clear (I will add a comment and push)

NodeVolumeUnpublish handling has to be idempotent,
i.e. has to tolerate repeated requests.
That means, we can not attempt umount and return failure
without checking is targetPath really a mount point.
Re-arranged "not mount point" checking to be simpler.
Removed switch about VolAccessTypes as targetPath has
to be removed in both cases by CSI spec.
@okartau
Copy link
Contributor Author

okartau commented Feb 7, 2020

/retest
failure seems not related to change in this PR

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2020
@pohly
Copy link
Contributor

pohly commented Feb 10, 2020

/assign @msau42

For approval.

@msau42
Copy link
Collaborator

msau42 commented Feb 10, 2020

/approve
Thanks for fixing this!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, okartau

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7691a85 into kubernetes-csi:master Feb 10, 2020
pohly added a commit to pohly/csi-driver-host-path that referenced this pull request Mar 24, 2021
a1e11275 Merge pull request kubernetes-csi#139 from pohly/kind-for-kubernetes-latest
1c0fb096 prow.sh: use KinD main for latest Kubernetes
1d77cfcb Merge pull request kubernetes-csi#138 from pohly/kind-update-0.10
bff2fb7e prow.sh: KinD 0.10.0

git-subtree-dir: release-tools
git-subtree-split: a1e11275b5a4febd6ad21beeac730e22c579825b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Idempotency: repeated Node Unpublish operations cause failure
4 participants