Failed to unmount nfs when connection was lost #72048

linxiulei · 2018-12-14T08:25:01Z

What happened:

NFS Background: Most operations on nfs mountpoint will hang if the connection to nfs server is lost.

When tearing down pod kubelet tries to unmount all volumes pod have mounted. There are some operations kubelet would done like stat/unmount mntpoint etc. At that moment, the stat/unmount codes would hang forever if connection to nfs server is lost, though it may recover when the connection is reconnected.

What you expected to happen:

No matter what happened to NFS server, kubelet tears down pods along with their nfs volumes successfully.

How to reproduce it (as minimally and precisely as possible):

setup a nfs service
create a pod with above nfs volume mounted
teardown nfs service
destroy pod

kubelet will hang at step 4 and output message related to unmount volume

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

/kind bug

The text was updated successfully, but these errors were encountered:

linxiulei · 2018-12-14T08:25:22Z

/sig storage

sikhlaqu · 2019-02-04T12:52:02Z

@linxiulei
What is the current workaround for this issue ?

We have a customer complaining about this, when his firewall goes down, all his NFS storage hangs and they dont recover even after the Firewall is restored/back and when he deletes his Application PODs, they always remain in "terminating state" (kube 1.12 + containerd runtime). POD describe shows "Killing container with id containerd://mobilitx:Need to kill Pod ..."

---solution we tried---
A "--force" delete POD doesn't remove the actual running container. Hence the only solution, we see is NODE reboot. But when many Nodes are impacted, its not desirable. Is there any alternative ?

During your test, are those terminating POD disappeared, when NFS connectivity was back ? but Our case, its not back.

Will this fix solve the above issue ? and what is the timeline and release for this fix rollout ?

linxiulei · 2019-02-04T12:56:28Z

@sikhlaqu you could umount —force (try —lazy, if needed) nfs to solve the hang of kubelet. If the pod is still under deleting status, please try delete pod dir and restart kubelet

yes, I have tested it through please try it and bring up any problem to me to see what I could help you. This commit didn’t need nfs connectivity to be back at all, it would umount nfs and delete pod after all.

zrss · 2019-02-11T06:10:54Z

thx for pointing this out. i confirm that hang will happen when nfs server down or there is a network problem

and i have another question that how the nfs client handle the crash of the nfs server, specially, after a long down time, can it still be auto-recover ?

linxiulei · 2019-02-11T06:20:22Z

I guess it’s a linux blkio subsystem flaw that most of io issue would cause hang. I have gave a short period of downtime testing though, the nfs client auto recovered

zrss · 2019-02-11T06:23:41Z

ok, i find this out, it seems it will always retry in default mode

thanks anyway

fejta-bot · 2019-05-12T06:40:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-06-11T07:23:25Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-07-11T08:11:35Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-07-11T08:11:43Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

h0jeZvgoxFepBQ2C · 2021-11-05T14:59:40Z

So the only solution is to reboot the node or kubelet? We are getting hit by this very often and it really makes problems for our system...

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 14, 2018

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 14, 2018

linxiulei mentioned this issue Dec 14, 2018

Fix hang when unmounting disconnected nfs volume #72049

Closed

linxiulei closed this as completed Feb 4, 2019

linxiulei reopened this Feb 4, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 12, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 11, 2019

k8s-ci-robot closed this as completed Jul 11, 2019

Shaked mentioned this issue Jun 20, 2021

NFS: make sure NFS is unmounted in case of a failure democratic-csi/democratic-csi#100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to unmount nfs when connection was lost #72048

Failed to unmount nfs when connection was lost #72048

linxiulei commented Dec 14, 2018

linxiulei commented Dec 14, 2018

sikhlaqu commented Feb 4, 2019

linxiulei commented Feb 4, 2019 •

edited

zrss commented Feb 11, 2019

linxiulei commented Feb 11, 2019

zrss commented Feb 11, 2019

fejta-bot commented May 12, 2019

fejta-bot commented Jun 11, 2019

fejta-bot commented Jul 11, 2019

k8s-ci-robot commented Jul 11, 2019

h0jeZvgoxFepBQ2C commented Nov 5, 2021

Failed to unmount nfs when connection was lost #72048

Failed to unmount nfs when connection was lost #72048

Comments

linxiulei commented Dec 14, 2018

linxiulei commented Dec 14, 2018

sikhlaqu commented Feb 4, 2019

linxiulei commented Feb 4, 2019 • edited

zrss commented Feb 11, 2019

linxiulei commented Feb 11, 2019

zrss commented Feb 11, 2019

fejta-bot commented May 12, 2019

fejta-bot commented Jun 11, 2019

fejta-bot commented Jul 11, 2019

k8s-ci-robot commented Jul 11, 2019

h0jeZvgoxFepBQ2C commented Nov 5, 2021

linxiulei commented Feb 4, 2019 •

edited