Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hung volumes can wedge the kubelet #31272

Open
ncdc opened this issue Aug 23, 2016 · 97 comments
Open

Hung volumes can wedge the kubelet #31272

ncdc opened this issue Aug 23, 2016 · 97 comments
Assignees
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@ncdc
Copy link
Member

ncdc commented Aug 23, 2016

If you have pods that use something like NFS storage, if the system is unable to read the mounted directory, or unmount it, it is possible to completely wedge the kubelet such that it can't successfully run any new pods that use volumes (which is basically all, if they use secret tokens) until either the storage issue is resolved, or you restart the kubelet.

To reproduce:

  1. Create a pod that uses an NFS volume
  2. Stop the NFS server
  3. Try to delete the pod
  4. Wait a couple of minutes - you'll see the pod stuck Terminating
  5. Try to create a new pod (kubectl run --rm --attach --restart Never --image busybox bbox date)

The busybox pod will be stuck ContainerCreating with events such as these:

Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath   Type            Reason          Message
  ---------     --------        -----   ----                    -------------   --------        ------          -------
  8m            8m              1       {default-scheduler }                    Normal          Scheduled       Successfully assigned bbox to 127.0.0.1
  6m            6s              4       {kubelet 127.0.0.1}                     Warning         FailedMount     Unable to mount volumes for pod "bbox_default(12a20cdb-694f-11e6-baa4-001c42e13e5d)": timeout expired waiting for volumes to attach/mount for pod "bbox"/"default". list of unattached/unmounted volumes=[default-token-joiwi]
  6m            6s              4       {kubelet 127.0.0.1}                     Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "bbox"/"default". list of unattached/unmounted volumes=[default-token-joiwi]

In this stack trace I gathered after I deleted the pod, it shows that the volume reconciler is still trying to get the volumes for the pod I just deleted. You'll also see a goroutine trying to stop the Docker container, but it is stuck.

In this stack trace I gathered after I tried to create the bbox pod, it shows that the new pod (bbox) is waiting for its volumes to attach/mount (in this case, secrets).

We've seen this in 1.2.x and I just reproduced it in master (commit f297ea9).

cc @kubernetes/sig-storage @kubernetes/sig-node @kubernetes/rh-cluster-infra @pmorie @derekwaynecarr @timothysc @saad-ali

@ncdc ncdc added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Aug 23, 2016
@pmorie
Copy link
Member

pmorie commented Aug 23, 2016

Gonna fix this one now.

@luxas
Copy link
Member

luxas commented Aug 23, 2016

v1.4?

@pmorie
Copy link
Member

pmorie commented Aug 23, 2016

@luxas you betcha, this is muy no bueno imo

@dchen1107 dchen1107 added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 23, 2016
@dchen1107 dchen1107 added this to the v1.4 milestone Aug 23, 2016
@timothysc
Copy link
Member

Ohh man...

@jingxu97
Copy link
Contributor

I see one thing might related is during tearDown() for the volume,
if IsLikelyNotMountPoint() fails checking the dir, it will fail the whole
tearDown() operation, which will block the following operations on the
volume.

Jing

On Tue, Aug 23, 2016 at 11:40 AM, Timothy St. Clair <
notifications@github.com> wrote:

Ohh man...


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#31272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxaGsRtumGofKXUSy5ZrRRogxFolJks5qiz6MgaJpZM4JrLnH
.

  • Jing

@ncdc
Copy link
Member Author

ncdc commented Aug 23, 2016

Yes, there definitely could be multiple places that are affecting this.

@pmorie
Copy link
Member

pmorie commented Aug 23, 2016

Yeah, I think I want to audit this entire subsystem before we start making code changes

@pmorie
Copy link
Member

pmorie commented Aug 23, 2016

@ncdc @jingxu97 I think the long and short of this is that we need to audit for anything that tries to read a directory that might be a volume mountpoint and pass a timeout to it. Agree?

@ncdc
Copy link
Member Author

ncdc commented Aug 23, 2016

At least that, yes. Not sure if there is more we need to be looking at.

On Tue, Aug 23, 2016 at 3:14 PM, Paul Morie notifications@github.com
wrote:

@ncdc https://github.com/ncdc @jingxu97 https://github.com/jingxu97 I
think the long and short of this is that we need to audit for anything that
tries to read a directory that might be a volume mountpoint and pass a
timeout to it. Agree?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#31272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYtdohiJ1CYD8sagv-cJUMmMmQj7hks5qi0aRgaJpZM4JrLnH
.

@pmorie
Copy link
Member

pmorie commented Aug 23, 2016

I think it has to be anything that can do a stat.

@vishh
Copy link
Contributor

vishh commented Aug 23, 2016

In general, kubelet sync loop should run operations asynchronously and not
make assumptions on libraries unless the library provides an explicit
contract.

On Tue, Aug 23, 2016 at 12:14 PM, Paul Morie notifications@github.com
wrote:

@ncdc https://github.com/ncdc @jingxu97 https://github.com/jingxu97 I
think the long and short of this is that we need to audit for anything that
tries to read a directory that might be a volume mountpoint and pass a
timeout to it. Agree?


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#31272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKIGKX5e9nP-2rs8aBIxcKHhIR5_Vks5qi0aSgaJpZM4JrLnH
.

@pmorie
Copy link
Member

pmorie commented Aug 23, 2016

@vishh are you saying we should have made all things that can do a stat have a timeout from the get-go? :)

@vishh
Copy link
Contributor

vishh commented Aug 23, 2016

I'm saying that each pod setup should be asynchronous & should not affect
the sync loop :)
Having a timeout on all kernel syscalls would be like utopia :)

On Tue, Aug 23, 2016 at 12:23 PM, Paul Morie notifications@github.com
wrote:

@vishh https://github.com/vishh are you saying we should have made all
things that can do a stat have a timeout from the get-go? :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#31272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKN4m5eUpuTQe4XGiTAMBFBY8Ll8mks5qi0i4gaJpZM4JrLnH
.

@pmorie
Copy link
Member

pmorie commented Aug 23, 2016

@vishh so actually -- setup is asynchronous but the syncloop waits for it to complete 😭

@pmorie
Copy link
Member

pmorie commented Aug 23, 2016

Scratchpad for call sites that might be problematic:

  1. pkg/kubelet/volumemanager/reconciler/reconciler.go#getVolumesFromPodDir - ioutil.ReadDir
  2. pkg/kubelet/kubelet_getters.go#getPodVolumeNameListFromDisk - ioutil.ReadDir,util.ReadDirNoExit
  3. mounter.IsLikelyNotMountPoint implementations

@jingxu97 @saad-ali does that seem accurate?

@jingxu97
Copy link
Contributor

@pmorie, I didn't find any others.

@saad-ali
Copy link
Member

Scratchpad for call sites that might be problematic:

  1. pkg/kubelet/volumemanager/reconciler/reconciler.go#getVolumesFromPodDir - ioutil.ReadDir
  2. pkg/kubelet/kubelet_getters.go#getPodVolumeNameListFromDisk - ioutil.ReadDir,util.ReadDirNoExit
  3. mounter.IsLikelyNotMountPoint implementations

@jingxu97 @saad-ali does that seem accurate?

@pmorie Thanks for jumping on this. Yes that capture it. The problematic ones are 1 and 2 since those block the volume manager reconciler and kublet sync loop. 3 is less of an issue since it only blocks plugin level operations which are async.

@pmorie
Copy link
Member

pmorie commented Aug 24, 2016

@saad-ali in this case I feel like (3) is still an issue, even though with the attach / detach refactor, theoretically, IsLikelyNotMountPoint should never be called when a mountpoint already exists. In the case of NFS it is entirely possible to block forever and not even the insanely high tcp timeout will save you... because the NFS protocol uses udp by default. So, I'd like to be safe. Safety first after all.

@pmorie
Copy link
Member

pmorie commented Aug 24, 2016

So, I have some code wherein I introduced timeouts at (1) and (2), and realized I'm not quite certain what the desired behavior is when this happens. The behavior I see with my WIP code is:

  1. Pod with NFS mount is running
  2. Stop NFS server
  3. Delete pod
  4. Pod gets stuck in terminating, timeout errors show in log
  5. Start NFS server
  6. Pod get terminated

I think this is the correct behavior -- does anyone disagree?

@jeffvance
Copy link
Contributor

What if the nfs server is never restarted, or its exports are changed, or its ip is changed? Will the pod remain stuck forvever?

@ncdc
Copy link
Member Author

ncdc commented Aug 24, 2016

I think this is the correct behavior -- does anyone disagree?

No disagreement. The user's intent is for the pod to go away.

@ncdc
Copy link
Member Author

ncdc commented Aug 24, 2016

@jeffvance you can force the deletion of the pod using a grace period of 0.

@pmorie
Copy link
Member

pmorie commented Aug 24, 2016

@jeffvance, force delete doesn't go through the kubelet.

@saad-ali
Copy link
Member

saad-ali commented Jun 7, 2017

This did not make it into 1.7. Pushing to 1.8

@gnufied
Copy link
Member

gnufied commented Jun 27, 2017

I was trying to reproduce this and I do not think original steps work anymore. Also, following steps @sjenning used doesn't work.

  1. start nfs server
  2. start nfs pod
  3. stop nfs server
  4. delete nfs pod
  5. stop kubelet
  6. start kubelet
  7. try to start a pod

The main thing is, we are not cleaning volume for pods that are not deleted from api server (and hence a pod stuck in "terminating" state don't count). The e2e tests @jeffvance wrote also don't reproduce this.

I am going to keep looking and it is possible that some other refactoring can bring back the issue, so it is better to safe guard against it. @jingxu97 @sjenning if you can reproduce this somehow let me know.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Labels Complete

@ncdc @pmorie

Issue label settings:

  • sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Additional instructions available here The commands available for adding these labels are documented here

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2018
@errordeveloper
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2018
@sp-joseluis-ledesma
Copy link

today I realized we are having the same behavior with the EFS provider (https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs). EFS is the Amazon NFS implementation, the efs-provider basically creates a directory for each volume you need, and internally is handled as an NFS pv. The problem is the efs-provider is deleting the volume(it means remove a directory in the NFS) before the kubelet umounts it, becoming an error in the TearDown:

Jun 15 10:58:16 ip-172-17-96-143 kubelet[2488]: E0615 10:58:16.423138    2488 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/nfs/d2e22f1c-7086-11e8-939c-0a8a94845282-pvc-d25050ba-7086-11e8-b2ec-0e3c06e0fed6\" (\"d2e22f1c-7086-11e8-939c-0a8a94845282\")" failed. No retries permitted until 2018-06-15 11:00:18.423105953 +0000 UTC m=+174041.026910867 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"dynamodb-pd\" (UniqueName: \"kubernetes.io/nfs/d2e22f1c-7086-11e8-939c-0a8a94845282-pvc-d25050ba-7086-11e8-b2ec-0e3c06e0fed6\") pod \"d2e22f1c-7086-11e8-939c-0a8a94845282\" (UID: \"d2e22f1c-7086-11e8-939c-0a8a94845282\") : Error checking if path exists: stat /var/lib/kubelet/pods/d2e22f1c-7086-11e8-939c-0a8a94845282/volumes/kubernetes.io~nfs/pvc-d25050ba-7086-11e8-b2ec-0e3c06e0fed6: stale NFS file handle"

and resulting in the pods getting stuck in the Terminating state:

NAME                        READY     STATUS        RESTARTS   AGE
dynamodb-5bbcb7b45c-62mtn   0/1       Terminating   0          35m

As I understand kubelet should be able to tear-down a NFS volume in the "stale NFS file handle" state.

@msau42
Copy link
Member

msau42 commented Jun 19, 2018

Now with the StorageProtection beta in 1.10, we won't allow deleting the PVC (and PV) until all Pods using it are terminated.

@sp-joseluis-ledesma
Copy link

Does PVCProtection(alpha on 1.9) provide the same functionality?

@msau42
Copy link
Member

msau42 commented Jun 25, 2018

Yes 1.9 PVCProtection alpha feature is the same.

j4ckstraw added a commit to j4ckstraw/kubernetes that referenced this issue Aug 15, 2023
see kubernetes#31272, kubernetes#101622

Signed-off-by: j4ckstraw <j4ckstraw@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.