New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to unmount nfs when connection was lost #72048
Comments
/sig storage |
@linxiulei We have a customer complaining about this, when his firewall goes down, all his NFS storage hangs and they dont recover even after the Firewall is restored/back and when he deletes his Application PODs, they always remain in "terminating state" (kube 1.12 + containerd runtime). POD describe shows "Killing container with id containerd://mobilitx:Need to kill Pod ..." ---solution we tried--- During your test, are those terminating POD disappeared, when NFS connectivity was back ? but Our case, its not back. Will this fix solve the above issue ? and what is the timeline and release for this fix rollout ? |
@sikhlaqu you could umount —force (try —lazy, if needed) nfs to solve the hang of kubelet. If the pod is still under deleting status, please try delete pod dir and restart kubelet yes, I have tested it through please try it and bring up any problem to me to see what I could help you. This commit didn’t need nfs connectivity to be back at all, it would umount nfs and delete pod after all. |
thx for pointing this out. i confirm that hang will happen when nfs server down or there is a network problem and i have another question that how the nfs client handle the crash of the nfs server, specially, after a long down time, can it still be auto-recover ? |
I guess it’s a linux blkio subsystem flaw that most of io issue would cause hang. I have gave a short period of downtime testing though, the nfs client auto recovered |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
So the only solution is to reboot the node or kubelet? We are getting hit by this very often and it really makes problems for our system... |
What happened:
NFS Background: Most operations on nfs mountpoint will hang if the connection to nfs server is lost.
When tearing down pod kubelet tries to unmount all volumes pod have mounted. There are some operations kubelet would done like stat/unmount mntpoint etc. At that moment, the stat/unmount codes would hang forever if connection to nfs server is lost, though it may recover when the connection is reconnected.
What you expected to happen:
No matter what happened to NFS server, kubelet tears down pods along with their nfs volumes successfully.
How to reproduce it (as minimally and precisely as possible):
kubelet will hang at step 4 and output message related to unmount volume
Anything else we need to know?:
Environment:
kubectl version
):uname -a
):/kind bug
The text was updated successfully, but these errors were encountered: