kubelet: storage: don't hang kubelet on unresponsive nfs #35038

sjenning · 2016-10-18T13:57:23Z

Currently, due to the nature of nfs, an unresponsive nfs volume in a pod can wedge the kubelet such that additional pods can not be run.

The discussion thus far surrounding this issue was to wrap the lstat, the syscall that ends up hanging in uninterruptible sleep, in a goroutine and limiting the number of goroutines that hang to one per-pod per-volume.

However, in my investigation, I found that the callsites that request a listing of the volumes from a particular volume plugin directory don't care anything about the properties provided by the lstat call. They only care about whether or not a directory exists.

Given that constraint, this PR just avoids the lstat call by using Readdirnames() instead of ReadDir() or ReadDirNoExit()

More detail for reviewers

Consider the pod mounted nfs volume at /var/lib/kubelet/pods/881341b5-9551-11e6-af4c-fa163e815edd/volumes/kubernetes.io~nfs/myvol. The kubelet wedges because when we do a ReadDir() or ReadDirNoExit() it calls syscall.Lstat on myvol which requires communication with the nfs server. If the nfs server is unreachable, this call hangs forever.

However, for our code, we only care what about the names of files/directory contained in kubernetes.io~nfs directory, not any of the more detailed information the Lstat call provides. Getting the names can be done with Readdirnames(), which doesn't need to involve the nfs server.

@pmorie @eparis @ncdc @derekwaynecarr @saad-ali @thockin @vishh @kubernetes/rh-cluster-infra

This change is

derekwaynecarr · 2016-10-18T14:47:19Z

@k8s-bot gci gke e2e test this

derekwaynecarr · 2016-10-18T15:12:02Z

@sjenning -- per our chat, a comment that describes the problem on this pr with an actual example set of directories was helpful for simpletons like myself to follow the core problem.

derekwaynecarr · 2016-10-18T16:53:10Z

@k8s-bot gci gke e2e test this

sjenning · 2016-10-18T17:00:02Z

@derekwaynecarr I updated the first comment with a more detailed example of how this happens IRL

derekwaynecarr · 2016-10-18T17:54:52Z

@k8s-bot gci gke e2e test this

derekwaynecarr · 2016-10-18T17:56:04Z

LGTM

derekwaynecarr · 2016-10-18T17:56:32Z

fyi @kubernetes/sig-node

wongma7 · 2016-10-18T18:27:03Z

any chance of this fix being cherry-picked?

derekwaynecarr · 2016-10-18T18:28:58Z

@k8s-bot gci gke e2e test this

k8s-cherrypick-bot · 2016-10-18T18:29:00Z

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

k8s-github-robot · 2016-10-18T18:54:41Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-ci-robot · 2016-10-18T19:34:31Z

Jenkins GCI GKE smoke e2e failed for commit da3683e. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-github-robot · 2016-10-18T19:37:31Z

Automatic merge from submit-queue

saad-ali · 2016-10-24T23:01:50Z

Thanks for digging into this and getting a workaround out for 1.5 @sjenning!

Let's get this cherry-picked back to 1.4.

jingxu97 · 2016-10-27T17:09:49Z

@sjenning, during some testing, I noticed that your fix still won't solve the problem of kubelet hung on unresponsive nfs. When a node has some directories mounting to a nfs server (running in a container) exports, after the nfs server container is deleted, ReadDirNoStat() function will hung too.

Could you please check again and see whether you get different result? Thank you!

cc @kubernetes/sig-storage

kubelet: storage: don't hang kubelet on unresponsive nfs

da3683e

googlebot added the cla: yes label Oct 18, 2016

k8s-github-robot assigned yujuhong Oct 18, 2016

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Oct 18, 2016

derekwaynecarr self-assigned this Oct 18, 2016

derekwaynecarr unassigned yujuhong Oct 18, 2016

derekwaynecarr added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 18, 2016

derekwaynecarr added the cherrypick-candidate label Oct 18, 2016

k8s-cherrypick-bot removed the cherrypick-candidate label Oct 18, 2016

derekwaynecarr added this to the v1.5 milestone Oct 18, 2016

k8s-github-robot merged commit 84aa5f6 into kubernetes:master Oct 18, 2016

sjenning mentioned this pull request Oct 18, 2016

UPSTREAM: 35038: don't hang kubelet on unresponsive nfs openshift/origin#11424

Closed

saad-ali mentioned this pull request Oct 24, 2016

Fix volume states out of sync problem after kubelet restarts #33616

Merged

saad-ali modified the milestones: v1.4, v1.5 Oct 24, 2016

jingxu97 mentioned this pull request Oct 27, 2016

Hung volumes can wedge the kubelet #31272

Open

sjenning deleted the nfs-nonblock-reader2 branch November 21, 2016 16:17

tnqn mentioned this pull request Apr 29, 2021

kubelet SyncLoop hangs on "os.Stat" forever if there is an unresponsive NFS volume #101622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet: storage: don't hang kubelet on unresponsive nfs #35038

kubelet: storage: don't hang kubelet on unresponsive nfs #35038

sjenning commented Oct 18, 2016 •

edited

derekwaynecarr commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

sjenning commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

wongma7 commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

k8s-cherrypick-bot commented Oct 18, 2016

k8s-github-robot commented Oct 18, 2016

k8s-ci-robot commented Oct 18, 2016

k8s-github-robot commented Oct 18, 2016

saad-ali commented Oct 24, 2016

jingxu97 commented Oct 27, 2016

kubelet: storage: don't hang kubelet on unresponsive nfs #35038

kubelet: storage: don't hang kubelet on unresponsive nfs #35038

Conversation

sjenning commented Oct 18, 2016 • edited

More detail for reviewers

derekwaynecarr commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

sjenning commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

wongma7 commented Oct 18, 2016

derekwaynecarr commented Oct 18, 2016

k8s-cherrypick-bot commented Oct 18, 2016

k8s-github-robot commented Oct 18, 2016

k8s-ci-robot commented Oct 18, 2016

k8s-github-robot commented Oct 18, 2016

saad-ali commented Oct 24, 2016

jingxu97 commented Oct 27, 2016

sjenning commented Oct 18, 2016 •

edited