Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orphaned pod found - but volume paths are still present on disk #60987

Closed
patrickstjohn opened this issue Mar 9, 2018 · 137 comments · Fixed by #95301
Closed

Orphaned pod found - but volume paths are still present on disk #60987

patrickstjohn opened this issue Mar 9, 2018 · 137 comments · Fixed by #95301
Labels
needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@patrickstjohn
Copy link

patrickstjohn commented Mar 9, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
BUG

What happened:
Kubelet is periodically going into an error state and causing errors with our storage layer (ceph, shared filesystem). Upon cleaning out the orphaned pod directory things eventually right themselves.

  • Workaround: rmdir /var/lib/kubelet/pods/*/volumes/*rook/*

What you expected to happen:
Kubelet should intelligently deal with orphaned pods. Cleaning a stale directory manually should not be required.

How to reproduce it (as minimally and precisely as possible):
Using rook-0.7.0 (this isn't a rook problem as far as I can tell but this is how we're reproducing):
kubectl create -f rook-operator.yaml
kubectl create -f rook-cluster.yaml
kubectl create -f rook-filesystem.yaml

Mount/write to the shared filesystem and monitor /var/log/messages for the following:
kubelet: E0309 16:46:30.429770 3112 kubelet_volumes.go:128] Orphaned pod "2815f27a-219b-11e8-8a2a-ec0d9a3a445a" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.

Anything else we need to know?:
This looks identical to the following: #45464 but for a different plugin.

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:
    Bare-metal private cloud

  • OS (e.g. from /etc/os-release):
    Red Hat Enterprise Linux Server release 7.4 (Maipo)

  • Kernel (e.g. uname -a):
    Linux 4.4.115-1.el7.elrepo.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Sat Feb 3 20:11:41 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:
    kubeadm

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 9, 2018
@mlmhl
Copy link
Contributor

mlmhl commented Mar 10, 2018

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 10, 2018
@benceszikora
Copy link

I have the same issue, except that I can't even remove that directory:
rmdir: failed to remove ‘/var/lib/kubelet/pods/7b383940-3cc7-11e8-a78b-b8ca3a70880c/volumes/rook.io~rook/backups’: Device or resource busy

@patrickstjohn
Copy link
Author

@iliketosneeze When we've run into that issue the only recourse is to unfortunately reboot the host. Once it comes up things seem to be in a clean state.

@lukmanulhakimd
Copy link

i also experience this issue using kubernetes v1.10.1. Manually deleting the directory solves the problem. But yes, kubelet should intelligently deal with orphaned pods

@iliketosneeze maybe you should try umounting the directory from tmpfs

@bersace
Copy link

bersace commented Apr 23, 2018

Same here with minikube :

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.1", GitCommit:"d4ab47518836c750f9949b9e0d387f20fb92260b", GitTreeState:"clean", BuildDate:"2018-04-12T14:26:04Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
$ journalctl -f
Apr 23 12:37:26 minikube kubelet[2886]: E0423 12:37:26.781919    2886 kubelet_volumes.go:140] Orphaned pod "a08c2261-3eec-11e8-83b3-a0ea30334065" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Apr 23 12:37:28 minikube kubelet[2886]: E0423 12:37:28.789802    2886 kubelet_volumes.go:140] Orphaned pod "a08c2261-3eec-11e8-83b3-a0ea30334065" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
# find /var/lib/kubelet/pods/a08c2261-3eec-11e8-83b3-a0ea30334065/containers/kube-proxy/4ac98a82
/var/lib/kubelet/pods/a08c2261-3eec-11e8-83b3-a0ea30334065/containers/kube-proxy/4ac98a82
# ls /var/lib/kubelet/pods/a08c2261-3eec-11e8-83b3-a0ea30334065/volumes/
kubernetes.io~configmap  kubernetes.io~secret
$ 

That looks like an internal pod. I guess there is nothing to remove.

@Overv
Copy link

Overv commented May 22, 2018

We are seeing a similar issue with our own custom flexvolume in a Kubernetes 1.8.9 cluster. Is there any way to resolve this without restarting the host until there is an actual solution?

@benceszikora
Copy link

@lukmanulhakimd that did help with removing them, but then new volume mounts failed as the host was stuck in uninterruptable I/O. I had to cold cycle to nodes in the end.

@michaelkoro
Copy link

I'm having the same issue with kubernetes 1.8.5, rancher 1.6.13, docker-ce17.03.02.
I think the kubelet should be able to acknowledge this problem, which obviously doesn't not happen..

@pvlltvk
Copy link

pvlltvk commented Jul 27, 2018

We're also having this issue with Kubernetes 1.9.6, Docker 17.03.1-ce and vSphere Cloud Provider for persistent storage.

@minhdanh
Copy link

minhdanh commented Aug 2, 2018

Having the same issue with Kubernetes 1.10.2, Docker 18.06.0-ce

@owend
Copy link

owend commented Aug 2, 2018

Having the same issue with Kubernetes 1.11, Docker 18.06.0-ce and ceph 13.2.1

@piaoyu
Copy link

piaoyu commented Aug 9, 2018

Having the same issue with Kubernetes v1.9.0 Docker 1.12.6 and rook master

@redbaron
Copy link
Contributor

redbaron commented Aug 13, 2018

those who affected, do you see anything interesting when you run kubelet with --v=5 ?

@michaelkoro
Copy link

I ran with --v=10.
Haven't found too much info though.

@majstorki88
Copy link

I ran also with version 11 and vSphere dynamic storage provider

@redbaron
Copy link
Contributor

Many cloudproviders call generic util helper to unmount and delete directory, it would explain why multiple providers show same symptoms.

I assume disks are unmounted , but just a directory is missed behind, right? Directory deletion code is following:

notMnt, mntErr := mounter.IsLikelyNotMountPoint(mountPath)
if mntErr != nil {
return mntErr
}
if notMnt {
glog.V(4).Infof("%q is unmounted, deleting the directory", mountPath)
return os.Remove(mountPath)
}
return fmt.Errorf("Failed to unmount path %v", mountPath)

and it got to be producing some error or info v4 message immediately after <path> a mountpoint, unmounting is printed. In a working case it should be printing <path> is unmounted, deleting the directory or an error.

Only way it can decide not to delete directory is if IsLikelyNotMountPoint thinks it is still mounted. Maybe it is containerized kubelet which confuses it , or some mount --bind in upper directories.

@redbaron
Copy link
Contributor

redbaron commented Aug 14, 2018

I tried to rerpoduce it by creating/deleting following stateful with single PVC with 50 pods in it all node allocated to a single node, but no luck.

Kubelet 1.11.2 , vSphere cloud provider.

Pretty much same setup except for kubernetes version was regularly reporting orphaned pods with kubelet 1.10.5

@timchenxiaoyu
Copy link
Contributor

k8s 1.6.4 docker 1.12.6

E0816 16:22:24.166061 415225 kubelet_volumes.go:114] Orphaned pod "4c4f8eb9-a12d-11e8-b849-c0bfc0a0d6e2" found, but volume paths are still present on disk.

@githubcdr
Copy link

confirmed here also, had to clean the files manually a reboot did not solve this issue.

@fvigotti
Copy link

I observe lot of those errors logged by kubelet in my clusters , this bugs seems to be releated to pods that were using a custom bash-flexvolume plugin ( which mount cifs volumes ), anyway this is very annoing I about 90% of the kubelet logs are those

E0823 10:31:01.847946    1303 kubelet_volumes.go:140] Orphaned pod "19a4e3e6-a562-11e8-9a25-309c23027882" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
E0823 10:31:03.840552    1303 kubelet_volumes.go:140] Orphaned pod "19a4e3e6-a562-11e8-9a25-309c23027882" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.

printed every two seconds.. fixing require a manual operation ( or automate a risky rm -Rf operation based on a log line parser ) but this is a poor-man workaround.. while kubelet could/should/must! handle that problem itself
I see discussion around this bug since more than one year.. it's possible that no one consider this to be fixed?
if no one want to fix it I suggest to decrease the error-level ( outputting "ERROR" in 90% of my logs lines when you don't consider this as a serious bug is wrong )

@deshui123
Copy link

Having the same issue:
[root@sandbox-worker-05 /var/log]$ kubectl version Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

[root@sandbox-worker-05 ~]$ calicoctl version -c /etc/cni/calico.cfg Client Version: v1.6.3 Build date: 2017-12-20T22:32:36+0000 Git commit: d4cfc95c Cluster Version: v2.6.6 Cluster Type: unknown

@redbaron
Copy link
Contributor

At least version 1.11.2 doesn't have this issue, we stopped seeing this error after an upgrade.

@fredkan
Copy link
Member

fredkan commented Sep 10, 2018

the mountpoint is not umount while remove orphan pod, and RemoveAllOneFilesystem not be executed.

the reason why mountpoint not umounted? maybe volume manager have a error process while umount. maybe user service is using directory of mountpoint while umount. or some other reason?

		// If there are still volume directories, do not delete directory
		volumePaths, err := kl.getPodVolumePathListFromDisk(uid)
		if err != nil {
			orphanVolumeErrors = append(orphanVolumeErrors, fmt.Errorf("Orphaned pod %q found, but error %v occurred during reading volume dir from disk", uid, err))
			continue
		}
		if len(volumePaths) > 0 {
			 orphanVolumeErrors = append(orphanVolumeErrors, fmt.Errorf("Orphaned pod %q found, but volume paths are still present on disk", uid))
			 continue
		}
		glog.V(3).Infof("Orphaned pod %q found, removing", uid)
		if err := removeall.RemoveAllOneFilesystem(kl.mounter, kl.getPodDir(uid)); err != nil {
			glog.Errorf("Failed to remove orphaned pod %q dir; err: %v", uid, err)
			orphanRemovalErrors = append(orphanRemovalErrors, err)
		}

I think we can add the umount process while checking with mountpoint left.

		if len(volumePaths) > 0 {
                          // TODO: add the umount process and check again
			// orphanVolumeErrors = append(orphanVolumeErrors, fmt.Errorf("Orphaned pod %q found, but volume paths are still present on disk", uid))
			// continue
		}

PR: #68616

@smanpathak
Copy link

Faced the same problem with datera array and K8s. Rebooting nodes cleared up hung pods.

@MrAmbiG
Copy link

MrAmbiG commented Dec 30, 2020

same issue
kubernetes 1.18

@shelmingsong
Copy link

same issue

Kubernetes 1.18.9
Rook 1.4.4
Node Linux Kernel 5.8.14

any progress on this issue?

@GrzegorzDrozda
Copy link

same issue
Kubernetes 1.19.4
Rook 1.5.4
Kernel: 3.10.0-1160.11.1.el7.x86_64

@instantlinux
Copy link

For the past 2 years...steps to reproduce: run k8s for a while. Ungracefully reboot the server. It will come back up with 1 to a dozen of these, spewing out every second or two: enough to dominate all entries sent from a production cluster to a centralized syslog.

@smallersoup
Copy link

same issue
Kubernetes v1.18.14
Docker 19.03.8
Kernel: 3.10.0-1127.13.1.el7.x86_64

@mengjiao-liu
Copy link
Member

Same issue in kubernetes v1.20.2

@andyzhangx
Copy link
Member

why this issue is closed?

@andyzhangx
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@andyzhangx: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Mar 17, 2021
@k8s-ci-robot
Copy link
Contributor

@patrickstjohn: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 17, 2021
@rdxmb
Copy link

rdxmb commented Mar 17, 2021

k8s-ci-robot closed this in #95301 28 days ago

seems like there is a merge to fix that: #95301

@andyzhangx
Copy link
Member

thanks! close it.
/close

@k8s-ci-robot
Copy link
Contributor

@andyzhangx: Closing this issue.

In response to this:

thanks! close it.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andyzhangx
Copy link
Member

/close

@anupamdialpad
Copy link

For those of you (like me) still using an older version & facing this problem. You tried rm -rf but that failed with Device or resource busy. You can try to umount the path by

  • Identify the mounted path. I used findmnt to get the exact path
  • umount those paths

@instantlinux
Copy link

A warning to anyone with those older versions still running: avoid invoking rm -rf in any affected /var/lib/kubelet/pods subdirectory. There could be live data still mounted under those pods: and you will lose it, even if for example it's a remote volume on NFS / EFS or the like. Instead, to stop such orphaned-pod warning logs, do an mv command to move the affected directory out from under /var/lib/kubelet/pods.

@msau42
Copy link
Member

msau42 commented Mar 22, 2021

fyi, the fix in #95301 only removes empty directories. This is to address some use cases where the node rebooted and the mounts are gone but the directories remain.

It doesn't fix other scenarios that others are mentioning where the directory is still mounted. In those cases, you will need to work with the affected volume plugin owners to figure out what is going wrong that is preventing the volume from getting unmounted.

@IJOL
Copy link

IJOL commented Dec 4, 2021

We had that from below ( a big pile of them i must say ) in our kubelet logs just today, our k8s v1.18.10 seems little outdated but we are stuck at that version for now

E1204 15:46:01.851192 1556 kubelet_volumes.go:154] orphaned pod "067f154f-7ad7-4bbe-b83c-87539b66638e" found, but volume paths are still present on disk : There were a total of 40 errors similar to this. Turn up verbosity to see them.
is there any script to find those 40 orphaned pods we have? change verbosity is not possible just now for us

@nickma82
Copy link

nickma82 commented Nov 2, 2022

Just leaving this here for further use.. maybe ;)

POD_UUID=$(journalctl -e | tail -n1 | sed -nr 's/.+\\"([0-9a-f\-]{36}).*volume paths are still present on disk.*/\1/p') 
DIRR="/var/lib/kubelet/pods/${POD_UUID}/volumes/kubernetes.io~csi/pvc-*"
rm -rfv ${DIRR}

@kwenzh
Copy link

kwenzh commented Dec 6, 2022

the same with:

E1206 14:22:38.458396    2927 kubelet_volumes.go:179] orphaned pod "4cdc3773-4c0a-499c-a19d-4eed68598887" found, but failed to rmdir() volume at path /xxxxxxx/pods/4cdc3773-4c0a-499c-a19d-4eed68598887/volumes/kubernetes.io~csi/local-volume-dd70eba7-xxxxxxxxxxxxxxxxx b335: directory not empty : There were a total of 1 errors similar to this. Turn up verbosity to see them.


kubectl  version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

uname -rs
Linux 3.10.0-1062.4.1.el7.x86_64

@mingregister
Copy link
Contributor

Since we can not solve this problem once and for all. Why are't we report it as an event. Let the user to choose their way to fix this problem in their scene.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet