Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pod stuck in Terminating issue due to corrupt mnt point in flexvol plugin #75234

Merged
merged 1 commit into from Jun 14, 2019

Conversation

@andyzhangx
Copy link
Member

commented Mar 9, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixed this issue:
pod could not be terminated when pod volume path is corrupted due to smb server return error status: resource temporarily unavailable, in this condition, pod will be in Terminating forever. In this PR, it would call Unmount if PathExists returns any error.

$ kubectl get po
NAME              READY   STATUS        RESTARTS   AGE
nginx-flex-smb   0/1     Terminating   0          6h46m

Which issue(s) this PR fixes:

Fixes #75233

Special notes for your reviewer:
EAGAIN - APPLICATION_ERROR: "resource temporarily unavailable"
This error happens when SMB server is unavailable due to network issue, password change, smb path change, etc. , in this PR, it would call Unmount if PathExists returns any error.

workaround for this issue

  • on agent node, run mount | grep cifs to get all cifs mounts
  • check every cifs mount, e.g.
sudo ls -lt /var/lib/kubelet/pods/5c949781-4c6d-11e9-825d-000d3a0dd47b/volumes/microsoft.com~smb/test
ls: cannot access '/var/lib/kubelet/pods/5c949781-4c6d-11e9-825d-000d3a0dd47b/volumes/microsoft.com~smb/test': Permission denied
  • if ls -lt ... failed, run
sudo umount /var/lib/kubelet/pods/5c949781-4c6d-11e9-825d-000d3a0dd47b/volumes/microsoft.com~smb/test

And then the pod will be in terminated soon.

Does this PR introduce a user-facing change?:

fix pod stuck issue due to corrupt mnt point in flexvol plugin, call Unmount if PathExists returns any error

/priority important-soon
/assign @jingxu97 @msau42 @davidz627

pkg/kubelet/kubelet_volumes.go Outdated Show resolved Hide resolved
@gnufied

This comment has been minimized.

Copy link
Member

commented Mar 9, 2019

/assign

cc @chakri-nelluri

pkg/util/mount/mount_helper.go Outdated Show resolved Hide resolved
@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Mar 13, 2019

thanks guys for the comments, let me hold this PR, and track the root case
/hold

@andyzhangx andyzhangx changed the title fix pod stuck issue due to corrupt mnt point fix pod stuck issue due to corrupt mnt point (resource temporarily unavailable) Mar 22, 2019
@andyzhangx andyzhangx force-pushed the andyzhangx:corrupt-mnt-fix branch from cbefeae to cf69d84 Mar 22, 2019
@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Mar 22, 2019

PTAL, thanks.


if pathErr != nil && !mount.IsCorruptedMnt(pathErr) {
return fmt.Errorf("Error checking path: %v", pathErr)
if pathErr != nil {

This comment has been minimized.

Copy link
@jingxu97

jingxu97 May 23, 2019

Contributor

how about
if !pathExists & pathErr == nil {
return nil
}
...

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx May 23, 2019

Author Member

then this function would depend on return value of PathExists, if it return <false, error>, it won't proceed, mainly I would like to cherry pick to old release, only for flexvol.TearDownAt func

This comment has been minimized.

Copy link
@jingxu97

jingxu97 May 23, 2019

Contributor

oh, sorry my mistake. I corrected it.
only if it return (false, nil) which means path is definitely not exist , it will return immediately without any further actions.

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx May 23, 2019

Author Member

Hi @jingxu97 logic of this PR is identical to this your suggested comment:

	if pathErr != nil {
		// only log warning here since plugins should anyways have to deal with errors
		klog.Warningf("Error checking path: %v", pathErr)
	} else {
		if !pathExists {
			klog.Warningf("Warning: Unmount skipped because path does not exist: %v", dir)
			return nil
		}
	}

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Jun 3, 2019

Contributor

yes, the logic is the same. I am thinking to make it easier to read. And I think the other PR #75645 also uses the simpler way.

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Jun 4, 2019

Author Member

@jingxu97 shall we go with this PR first, and I would like to cherry pick this PR to old release, thanks.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented May 30, 2019

/hold cancel
@jingxu97 PTAL, already addressed your comments, thanks.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jun 3, 2019

@jingxu97

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 7, 2019
@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jun 8, 2019

/lgtm

still needs /approve, and are we going to merge this into v1.15? @jingxu97

@jingxu97

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

/assign @saad-ali @chakri-nelluri @MikaelCluseau

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

@andyzhangx: GitHub didn't allow me to assign the following users: MikaelCluseau.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @saad-ali @chakri-nelluri @MikaelCluseau

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

IIUC, before, Unmount would not be called if PathExists returns non-corrupted error.

Now what this fix does is it always calls Unmount if PathExists returns any error, corrupted or not. Is that the intended behavior @andyzhangx? From the PR description, it sounds like you wanted some change in behavior for corrupted errors?

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2019

IIUC, before, Unmount would not be called if PathExists returns non-corrupted error.

Now what this fix does is it always calls Unmount if PathExists returns any error, corrupted or not. Is that the intended behavior @andyzhangx? From the PR description, it sounds like you wanted some change in behavior for corrupted errors?

I have changed the PR description, current reality is there is some unexpected error in flexvolume path, and in this PR, it would call Unmount if PathExists returns any error. @msau42

@msau42

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

/approve

Thanks for clarifying! Can you also fix the release note?

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jingxu97, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2019

/approve

Thanks for clarifying! Can you also fix the release note?

done, thanks.

@k8s-ci-robot k8s-ci-robot merged commit 3a598d7 into kubernetes:master Jun 14, 2019
14 of 20 checks passed
14 of 20 checks passed
pull-kubernetes-e2e-gce Job triggered.
Details
pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
pull-kubernetes-integration Job triggered.
Details
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
pull-kubernetes-node-e2e Job triggered.
Details
pull-kubernetes-verify Job triggered.
Details
cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-typecheck Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

@andyzhangx: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-integration fb81a28 link /test pull-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot added a commit that referenced this pull request Jun 20, 2019
…5234-upstream-release-1.12

Automated cherry pick of #75234: fix flexvol stuck issue due to corrupted mnt point
k8s-ci-robot added a commit that referenced this pull request Jun 20, 2019
…5234-upstream-release-1.13

Automated cherry pick of #75234: fix flexvol stuck issue due to corrupted mnt point
k8s-ci-robot added a commit that referenced this pull request Jun 20, 2019
…5234-upstream-release-1.14

Automated cherry pick of #75234: fix flexvol stuck issue due to corrupted mnt point
k8s-ci-robot added a commit that referenced this pull request Jun 28, 2019
…5234-upstream-release-1.15

Automated cherry pick of #75234: fix flexvol stuck issue due to corrupted mnt point
@andyzhangx andyzhangx changed the title fix pod stuck issue due to corrupt mnt point in flexvol plugin fix pod stuck in Terminating issue due to corrupt mnt point in flexvol plugin Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.