Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: corrupted mount point in csi driver node stage/publish #88569

Merged
merged 1 commit into from Mar 2, 2020

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Feb 26, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixed the corrupted mount point in csi driver. Detailed description of this issue could be found here: No easy way how to update CSI driver that uses fuse
"
We recommend to use DaemonSet to run CSI drivers on node. If a driver runs fuse daemon, it's almost impossible to update it, as killing a pod with the driver kills the fuse daemons too and it will kill all mounts, possibly corrupting application data.

We need a documented and supported way how to update such CSI drivers. Note that the update process can be manual or the code can live somewhere else, we just need it to to be documented and supported so people don't loose data.
"

With this PR, when fuse based CSI driver daemonset is restarted on the node, original blobfuse mount is broken, this PR would handle broken mount in both NodeStage and NodePublish

  • detect the broken mount path
  • unmount broken mount path
  • remount mount path

And I think this issue is not only related to fuse based CSI driver, there could be lots of possibilities that stage, publish mount path is broken, we should leave CSI driver itself to handle corrupted mount point, while current behavior is return error directly, there is no way to let CSI driver to handle corrupted mount point.
E.g. in flexvolume, it would leave flexvol driver to handle corrupted mount point:

if pathErr != nil && !mount.IsCorruptedMnt(pathErr) {
return fmt.Errorf("Error checking path: %v", pathErr)

And as I could recall, original in-tree driver could also handle corrupted mount point, and now CSI driver changed this behavior, if it's already a corrupted mount point, there is no way in CSI driver to handle this now.

Which issue(s) this PR fixes:

Fixes #70013

Special notes for your reviewer:
/assign @msau42 @davidz627 @saad-ali @gnufied
/priority important-soon

Does this PR introduce a user-facing change?:

fix: corrupted mount point in csi driver

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 26, 2020
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Feb 26, 2020
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2020
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2020
@feiskyer
Copy link
Member

/milestone v1.18

@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Feb 26, 2020
add test

fix build failure and bazel

fix golint
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-integration
/test pull-kubernetes-e2e-gce

@@ -228,13 +228,19 @@ func (c *csiAttacher) MountDevice(spec *volume.Spec, devicePath string, deviceMo
return errors.New(log("attacher.MountDevice failed, deviceMountPath is empty"))
}

corruptedDir := false
mounted, err := isDirMounted(c.plugin, deviceMountPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually we may want to skip this check altogether and just always call StagePublish. @gnufied @jsafrane do you see any problems with that? Ref #86784

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we don't even need to do os.MkdirAll(deviceMountPath, 0750), all leave csi driver to handle that logic:

if err = os.MkdirAll(deviceMountPath, 0750); err != nil {
return errors.New(log("attacher.MountDevice failed to create dir %#v: %v", deviceMountPath, err))
}
klog.V(4).Info(log("created target path successfully [%s]", deviceMountPath))

this looks like a big behavior change, is it ok to do this in this PR?

Copy link
Member

@gnufied gnufied Feb 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with removing the mounted check, but it requires uncertain mount fix to work reliably, so we won't be able to backport that change to versions without uncertain fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to back port this fix to old release, so shall we go with this PR now?
about removing the mounted check, I could work out another PR that won't be back ported, is that ok? @gnufied thanks.

@msau42
Copy link
Member

msau42 commented Feb 26, 2020

/assign @jsafrane
@kubernetes/sig-storage-pr-reviews

@andyzhangx
Copy link
Member Author

@jsafrane could you take a look? thanks.

@andyzhangx
Copy link
Member Author

BTW, kubernetes-sigs/blob-csi-driver#117 is an example fix in fuse based driver about how to handle corrupted mount point, so with these two PRs, even fuse driver daemonset is restarted, driver could also work after pod with fuse volume mount restarted.

Also this PR not only mitigated fuse driver issues, e.g. for other remote network file system based csi driver, if remote server does not respond transiently, the mount point could be broken, and new pod mount on the same mount point(using same PV) will fail, this PR could also fix also those issues.

@smourapina
Copy link

@andyzhangx, @jsafrane, @msau42:
Bug Triage for release 1.18 checking in. We are a few days away from code freeze (which happens next Thursday, 5 March). Will this PR be merged on time?

@andyzhangx
Copy link
Member Author

@andyzhangx, @jsafrane, @msau42:
Bug Triage for release 1.18 checking in. We are a few days away from code freeze (which happens next Thursday, 5 March). Will this PR be merged on time?

@smourapina thanks, we are trying.
@jsafrane, @msau42 @gnufied could you take a look? thanks.

@jsafrane
Copy link
Member

jsafrane commented Mar 2, 2020

I think we can merge this PR as it is so it is safe to backport and then, as a separate PR, remove mount check for releases that have reliable uncertain state of mount.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2020
Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/retest

@k8s-ci-robot k8s-ci-robot merged commit 39ed64e into kubernetes:master Mar 2, 2020
k8s-ci-robot added a commit that referenced this pull request Mar 9, 2020
…8569-upstream-release-1.15

Automated cherry pick of #88569: fix: corrupted mount point in csi driver
k8s-ci-robot added a commit that referenced this pull request Mar 9, 2020
…8569-upstream-release-1.16

Automated cherry pick of #88569: fix: corrupted mount point in csi driver
k8s-ci-robot added a commit that referenced this pull request Mar 10, 2020
…8569-upstream-release-1.17

Automated cherry pick of #88569: fix: corrupted mount point in csi driver
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No easy way how to update CSI driver that uses fuse
9 participants