Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to unmount nfs when connection was lost #72048

Closed
linxiulei opened this issue Dec 14, 2018 · 11 comments
Closed

Failed to unmount nfs when connection was lost #72048

linxiulei opened this issue Dec 14, 2018 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@linxiulei
Copy link
Contributor

What happened:

NFS Background: Most operations on nfs mountpoint will hang if the connection to nfs server is lost.

When tearing down pod kubelet tries to unmount all volumes pod have mounted. There are some operations kubelet would done like stat/unmount mntpoint etc. At that moment, the stat/unmount codes would hang forever if connection to nfs server is lost, though it may recover when the connection is reconnected.

What you expected to happen:

No matter what happened to NFS server, kubelet tears down pods along with their nfs volumes successfully.

How to reproduce it (as minimally and precisely as possible):

  1. setup a nfs service
  2. create a pod with above nfs volume mounted
  3. teardown nfs service
  4. destroy pod

kubelet will hang at step 4 and output message related to unmount volume

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 14, 2018
@linxiulei
Copy link
Contributor Author

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 14, 2018
@sikhlaqu
Copy link

sikhlaqu commented Feb 4, 2019

@linxiulei
What is the current workaround for this issue ?

We have a customer complaining about this, when his firewall goes down, all his NFS storage hangs and they dont recover even after the Firewall is restored/back and when he deletes his Application PODs, they always remain in "terminating state" (kube 1.12 + containerd runtime). POD describe shows "Killing container with id containerd://mobilitx:Need to kill Pod ..."

---solution we tried---
A "--force" delete POD doesn't remove the actual running container. Hence the only solution, we see is NODE reboot. But when many Nodes are impacted, its not desirable. Is there any alternative ?

During your test, are those terminating POD disappeared, when NFS connectivity was back ? but Our case, its not back.

Will this fix solve the above issue ? and what is the timeline and release for this fix rollout ?

@linxiulei
Copy link
Contributor Author

linxiulei commented Feb 4, 2019

@sikhlaqu you could umount —force (try —lazy, if needed) nfs to solve the hang of kubelet. If the pod is still under deleting status, please try delete pod dir and restart kubelet

yes, I have tested it through please try it and bring up any problem to me to see what I could help you. This commit didn’t need nfs connectivity to be back at all, it would umount nfs and delete pod after all.

@linxiulei linxiulei reopened this Feb 4, 2019
@zrss
Copy link

zrss commented Feb 11, 2019

thx for pointing this out. i confirm that hang will happen when nfs server down or there is a network problem

and i have another question that how the nfs client handle the crash of the nfs server, specially, after a long down time, can it still be auto-recover ?

@linxiulei
Copy link
Contributor Author

I guess it’s a linux blkio subsystem flaw that most of io issue would cause hang. I have gave a short period of downtime testing though, the nfs client auto recovered

@zrss
Copy link

zrss commented Feb 11, 2019

ok, i find this out, it seems it will always retry in default mode

image

thanks anyway

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 12, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 11, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@h0jeZvgoxFepBQ2C
Copy link

So the only solution is to reboot the node or kubelet? We are getting hit by this very often and it really makes problems for our system...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants