fix: corrupted mount point in csi driver node stage/publish #88569

andyzhangx · 2020-02-26T08:13:05Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixed the corrupted mount point in csi driver. Detailed description of this issue could be found here: No easy way how to update CSI driver that uses fuse
"
We recommend to use DaemonSet to run CSI drivers on node. If a driver runs fuse daemon, it's almost impossible to update it, as killing a pod with the driver kills the fuse daemons too and it will kill all mounts, possibly corrupting application data.

We need a documented and supported way how to update such CSI drivers. Note that the update process can be manual or the code can live somewhere else, we just need it to to be documented and supported so people don't loose data.
"

With this PR, when fuse based CSI driver daemonset is restarted on the node, original blobfuse mount is broken, this PR would handle broken mount in both NodeStage and NodePublish

detect the broken mount path
unmount broken mount path
remount mount path

And I think this issue is not only related to fuse based CSI driver, there could be lots of possibilities that stage, publish mount path is broken, we should leave CSI driver itself to handle corrupted mount point, while current behavior is return error directly, there is no way to let CSI driver to handle corrupted mount point.
E.g. in flexvolume, it would leave flexvol driver to handle corrupted mount point:

kubernetes/pkg/volume/flexvolume/detacher.go

Lines 60 to 61 in e4a5012

    
           if pathErr != nil && !mount.IsCorruptedMnt(pathErr) { 
        
           	return fmt.Errorf("Error checking path: %v", pathErr)

And as I could recall, original in-tree driver could also handle corrupted mount point, and now CSI driver changed this behavior, if it's already a corrupted mount point, there is no way in CSI driver to handle this now.

Which issue(s) this PR fixes:

Fixes #70013

Special notes for your reviewer:
/assign @msau42 @davidz627 @saad-ali @gnufied
/priority important-soon

Does this PR introduce a user-facing change?:

fix: corrupted mount point in csi driver

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

feiskyer · 2020-02-26T08:30:10Z

/milestone v1.18

add test fix build failure and bazel fix golint

andyzhangx · 2020-02-26T11:56:12Z

/test pull-kubernetes-integration
/test pull-kubernetes-e2e-gce

msau42 · 2020-02-26T17:53:58Z

pkg/volume/csi/csi_attacher.go

@@ -228,13 +228,19 @@ func (c *csiAttacher) MountDevice(spec *volume.Spec, devicePath string, deviceMo
 		return errors.New(log("attacher.MountDevice failed, deviceMountPath is empty"))
 	}

+	corruptedDir := false
 	mounted, err := isDirMounted(c.plugin, deviceMountPath)


I think actually we may want to skip this check altogether and just always call StagePublish. @gnufied @jsafrane do you see any problems with that? Ref #86784

yes, we don't even need to do os.MkdirAll(deviceMountPath, 0750), all leave csi driver to handle that logic:

kubernetes/pkg/volume/csi/csi_attacher.go

Lines 290 to 293 in e4a5012

if err = os.MkdirAll(deviceMountPath, 0750); err != nil {

return errors.New(log("attacher.MountDevice failed to create dir %#v: %v", deviceMountPath, err))

}

klog.V(4).Info(log("created target path successfully [%s]", deviceMountPath))

this looks like a big behavior change, is it ok to do this in this PR?

I am okay with removing the mounted check, but it requires uncertain mount fix to work reliably, so we won't be able to backport that change to versions without uncertain fix

I would like to back port this fix to old release, so shall we go with this PR now?
about removing the mounted check, I could work out another PR that won't be back ported, is that ok? @gnufied thanks.

msau42 · 2020-02-26T17:54:42Z

/assign @jsafrane
@kubernetes/sig-storage-pr-reviews

andyzhangx · 2020-02-28T03:54:32Z

@jsafrane could you take a look? thanks.

andyzhangx · 2020-02-28T12:48:25Z

BTW, kubernetes-sigs/blob-csi-driver#117 is an example fix in fuse based driver about how to handle corrupted mount point, so with these two PRs, even fuse driver daemonset is restarted, driver could also work after pod with fuse volume mount restarted.

Also this PR not only mitigated fuse driver issues, e.g. for other remote network file system based csi driver, if remote server does not respond transiently, the mount point could be broken, and new pod mount on the same mount point(using same PV) will fail, this PR could also fix also those issues.

smourapina · 2020-03-02T06:12:58Z

@andyzhangx, @jsafrane, @msau42:
Bug Triage for release 1.18 checking in. We are a few days away from code freeze (which happens next Thursday, 5 March). Will this PR be merged on time?

andyzhangx · 2020-03-02T06:17:51Z

@andyzhangx, @jsafrane, @msau42:
Bug Triage for release 1.18 checking in. We are a few days away from code freeze (which happens next Thursday, 5 March). Will this PR be merged on time?

@smourapina thanks, we are trying.
@jsafrane, @msau42 @gnufied could you take a look? thanks.

jsafrane · 2020-03-02T10:03:07Z

I think we can merge this PR as it is so it is safe to backport and then, as a separate PR, remove mount check for releases that have reliable uncertain state of mount.

/lgtm
/approve

k8s-ci-robot · 2020-03-02T10:04:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/volume/csi/OWNERS~~ [jsafrane]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

feiskyer

/retest

…8569-upstream-release-1.15 Automated cherry pick of #88569: fix: corrupted mount point in csi driver

…8569-upstream-release-1.16 Automated cherry pick of #88569: fix: corrupted mount point in csi driver

…8569-upstream-release-1.17 Automated cherry pick of #88569: fix: corrupted mount point in csi driver

k8s-ci-robot assigned davidz627 Feb 26, 2020

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 26, 2020

k8s-ci-robot assigned gnufied Feb 26, 2020

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Feb 26, 2020

k8s-ci-robot assigned msau42 and saad-ali Feb 26, 2020

k8s-ci-robot requested review from humblec and jingxu97 February 26, 2020 08:14

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2020

andyzhangx force-pushed the csi-corrupt-mnt-fix branch from 2bdc7e9 to beab1ed Compare February 26, 2020 08:21

This was referenced Feb 26, 2020

fix corrupted mount issue when driver daemonset restarted kubernetes-sigs/blob-csi-driver#117

Merged

No easy way how to update CSI driver that uses fuse #70013

Closed

k8s-ci-robot added this to the v1.18 milestone Feb 26, 2020

fix: corrupted mount point in csi driver

5a6435a

add test fix build failure and bazel fix golint

andyzhangx force-pushed the csi-corrupt-mnt-fix branch from beab1ed to 5a6435a Compare February 26, 2020 09:44

andyzhangx mentioned this pull request Feb 26, 2020

restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable kubernetes-sigs/blob-csi-driver#115

Closed

msau42 reviewed Feb 26, 2020

View reviewed changes

k8s-ci-robot assigned jsafrane Feb 26, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2020

feiskyer approved these changes Mar 2, 2020

View reviewed changes

k8s-ci-robot merged commit 39ed64e into kubernetes:master Mar 2, 2020

k8s-ci-robot added a commit that referenced this pull request Mar 9, 2020

Merge pull request #88732 from andyzhangx/automated-cherry-pick-of-#8…

e2cfd76

…8569-upstream-release-1.15 Automated cherry pick of #88569: fix: corrupted mount point in csi driver

k8s-ci-robot added a commit that referenced this pull request Mar 9, 2020

Merge pull request #88730 from andyzhangx/automated-cherry-pick-of-#8…

5bd9864

…8569-upstream-release-1.16 Automated cherry pick of #88569: fix: corrupted mount point in csi driver

k8s-ci-robot added a commit that referenced this pull request Mar 10, 2020

Merge pull request #88729 from andyzhangx/automated-cherry-pick-of-#8…

6c82587

…8569-upstream-release-1.17 Automated cherry pick of #88569: fix: corrupted mount point in csi driver

joshimoo mentioned this pull request Jun 2, 2021

[BUG] Volumes are not properly mounted/unmounted when kubelet restarts longhorn/longhorn#2629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: corrupted mount point in csi driver node stage/publish #88569

fix: corrupted mount point in csi driver node stage/publish #88569

andyzhangx commented Feb 26, 2020 •

edited

feiskyer commented Feb 26, 2020

andyzhangx commented Feb 26, 2020

msau42 Feb 26, 2020

andyzhangx Feb 27, 2020

gnufied Feb 27, 2020 •

edited

andyzhangx Feb 28, 2020 •

edited

msau42 commented Feb 26, 2020

andyzhangx commented Feb 28, 2020

andyzhangx commented Feb 28, 2020

smourapina commented Mar 2, 2020

andyzhangx commented Mar 2, 2020

jsafrane commented Mar 2, 2020

k8s-ci-robot commented Mar 2, 2020

feiskyer left a comment

	if pathErr != nil && !mount.IsCorruptedMnt(pathErr) {
	return fmt.Errorf("Error checking path: %v", pathErr)

	if err = os.MkdirAll(deviceMountPath, 0750); err != nil {
	return errors.New(log("attacher.MountDevice failed to create dir %#v: %v", deviceMountPath, err))
	}
	klog.V(4).Info(log("created target path successfully [%s]", deviceMountPath))

fix: corrupted mount point in csi driver node stage/publish #88569

fix: corrupted mount point in csi driver node stage/publish #88569

Conversation

andyzhangx commented Feb 26, 2020 • edited

feiskyer commented Feb 26, 2020

andyzhangx commented Feb 26, 2020

msau42 Feb 26, 2020

Choose a reason for hiding this comment

andyzhangx Feb 27, 2020

Choose a reason for hiding this comment

gnufied Feb 27, 2020 • edited

Choose a reason for hiding this comment

andyzhangx Feb 28, 2020 • edited

Choose a reason for hiding this comment

msau42 commented Feb 26, 2020

andyzhangx commented Feb 28, 2020

andyzhangx commented Feb 28, 2020

smourapina commented Mar 2, 2020

andyzhangx commented Mar 2, 2020

jsafrane commented Mar 2, 2020

k8s-ci-robot commented Mar 2, 2020

feiskyer left a comment

Choose a reason for hiding this comment

andyzhangx commented Feb 26, 2020 •

edited

gnufied Feb 27, 2020 •

edited

andyzhangx Feb 28, 2020 •

edited