New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck on terminating #51835

Open
igorleao opened this Issue Sep 1, 2017 · 115 comments

Comments

Projects
None yet
@igorleao
Copy link

igorleao commented Sep 1, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
Pods stuck on terminating for a long time

What you expected to happen:
Pods get terminated

How to reproduce it (as minimally and precisely as possible):

  1. Run a deployment
  2. Delete it
  3. Pods are still terminating

Anything else we need to know?:
Kubernetes pods stuck as Terminating for a few hours after getting deleted.

Logs:
kubectl describe pod my-pod-3854038851-r1hc3

Name:				my-pod-3854038851-r1hc3
Namespace:			container-4-production
Node:				ip-172-16-30-204.ec2.internal/172.16.30.204
Start Time:			Fri, 01 Sep 2017 11:58:24 -0300
Labels:				pod-template-hash=3854038851
				release=stable
				run=my-pod-3
Annotations:			kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"container-4-production","name":"my-pod-3-3854038851","uid":"5816c...
				prometheus.io/scrape=true
Status:				Terminating (expires Fri, 01 Sep 2017 14:17:53 -0300)
Termination Grace Period:	30s
IP:
Created By:			ReplicaSet/my-pod-3-3854038851
Controlled By:			ReplicaSet/my-pod-3-3854038851
Init Containers:
  ensure-network:
    Container ID:	docker://guid-1
    Image:		XXXXX
    Image ID:		docker-pullable://repo/ensure-network@sha256:guid-0
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		True
    Restart Count:	0
    Environment:	<none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Containers:
  container-1:
    Container ID:	docker://container-id-guid-1
    Image:		XXXXX
    Image ID:		docker-pullable://repo/container-1@sha256:guid-2
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	100m
      memory:	1G
    Requests:
      cpu:	100m
      memory:	1G
    Environment:
      XXXX
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-2:
    Container ID:	docker://container-id-guid-2
    Image:		alpine:3.4
    Image ID:		docker-pullable://alpine@sha256:alpine-container-id-1
    Port:		<none>
    Command:
      X
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	20m
      memory:	40M
    Requests:
      cpu:		10m
      memory:		20M
    Environment:	<none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-3:
    Container ID:	docker://container-id-guid-3
    Image:		XXXXX
    Image ID:		docker-pullable://repo/container-3@sha256:guid-3
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	100m
      memory:	200M
    Requests:
      cpu:	100m
      memory:	100M
    Readiness:	exec [nc -zv localhost 80] delay=1s timeout=1s period=5s #success=1 #failure=3
    Environment:
      XXXX
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-4:
    Container ID:	docker://container-id-guid-4
    Image:		XXXX
    Image ID:		docker-pullable://repo/container-4@sha256:guid-4
    Port:		9102/TCP
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	600m
      memory:	1500M
    Requests:
      cpu:	600m
      memory:	1500M
    Readiness:	http-get http://:8080/healthy delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:
      XXXX
    Mounts:
      /app/config/external from volume-2 (ro)
      /data/volume-1 from volume-1 (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Conditions:
  Type		Status
  Initialized 	True
  Ready 	False
  PodScheduled 	True
Volumes:
  volume-1:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	volume-1
    Optional:	false
  volume-2:
    Type:	ConfigMap (a volume populated by a ConfigMap)
    Name:	external
    Optional:	false
  default-token-xxxxx:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-xxxxx
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>

sudo journalctl -u kubelet | grep "my-pod"

[...]
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing address using workloadID" Workload=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing all IPs with handle 'my-pod-3854038851-r1hc3'"
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=warning msg="Asked to release address but it doesn't exist. Ignoring" Workload=my-pod-3854038851-r1hc3 workloadId=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Teardown processing complete." Workload=my-pod-3854038851-r1hc3 endpoint=<nil>
Sep 01 17:19:06 ip-172-16-30-204 kubelet[9619]: I0901 17:19:06.591946    9619 kubelet.go:1824] SyncLoop (DELETE, "api"):my-pod-3854038851(b8cf2ecd-8f25-11e7-ba86-0a27a44c875)"

sudo journalctl -u docker | grep "docker-id-for-my-pod"

Sep 01 17:17:55 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:55.695834447Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"
Sep 01 17:17:56 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:56.698913805Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.3", GitCommit:"2c2fe6e8278a5db2d15a013987b53968c743f2a1", GitTreeState:"clean", BuildDate:"2017-08-03T15:13:53Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration**:
    AWS

  • OS (e.g. from /etc/os-release):
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Kernel (e.g. uname -a):
    Linux ip-172-16-30-204 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:
    Kops

  • Others:
    Docker version 1.12.6, build 78d1802

@kubernetes/sig-aws @kubernetes/sig-scheduling

@igorleao

This comment has been minimized.

Copy link

igorleao commented Sep 1, 2017

@kubernetes/sig-aws @kubernetes/sig-scheduling

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Sep 3, 2017

Usually volume and network cleanup consume more time in termination. Can you find in which phase your pod is stuck? Volume cleanup for example?

@dixudx

This comment has been minimized.

Copy link
Member

dixudx commented Sep 3, 2017

Usually volume and network cleanup consume more time in termination.

Correct. They are always suspect.

@igorleao You can try kubectl delete pod xxx --now as well.

@igorleao

This comment has been minimized.

Copy link

igorleao commented Sep 4, 2017

Hi @resouer and @dixudx
I'm not sure. Looking at kubelet logs for a different pod with the same problem, I found:

Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Releasing address using workloadID" Workload=my-pod-969733955-rbxhn
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Releasing all IPs with handle 'my-pod-969733955-rbxhn'"
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=warning msg="Asked to release address but it doesn't exist. Ignoring" Workload=my-pod-969733955-rbxhn workloadId=my-pod-969733955-rbxhn
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Teardown processing complete." Workload=my-pod-969733955-rbxhn endpoint=<nil>
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.496132    9620 qos_container_manager_linux.go:285] [ContainerManager]: Updated QoS cgroup configuration
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968147    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (spec.Name: "default-token-wrlv3") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968245    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-token-key" (spec.Name: "token-key") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968537    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-token-key\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968508761 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-token-key" (volume.spec.Name: "token-key") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968744    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-default-token-wrlv3\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968719924 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (volume.spec.Name: "default-token-wrlv3") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/default-token-wrlv3 /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_default-token-wrlv3.deleting~940140790: device or resource busy
--
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778742    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_default-token-wrlv3.deleting~940140790" (spec.Name: "wrapped_default-token-wrlv3.deleting~940140790") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778753    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~850807831" (spec.Name: "wrapped_token-key.deleting~850807831") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778764    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~413655961" (spec.Name: "wrapped_token-key.deleting~413655961") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778774    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~818780979" (spec.Name: "wrapped_token-key.deleting~818780979") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778784    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~348212189" (spec.Name: "wrapped_token-key.deleting~348212189") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778796    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~848395852" (spec.Name: "wrapped_token-key.deleting~848395852") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778808    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_default-token-wrlv3.deleting~610264100" (spec.Name: "wrapped_default-token-wrlv3.deleting~610264100") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778820    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~960022821" (spec.Name: "wrapped_token-key.deleting~960022821") devicePath: ""
Sep 02 15:33:05 ip-172-16-30-208 kubelet[9620]: I0902 15:33:05.081380    9620 server.go:778] GET /stats/summary/: (37.027756ms) 200 [[Go-http-client/1.1] 10.0.46.202:54644]
Sep 02 15:33:05 ip-172-16-30-208 kubelet[9620]: I0902 15:33:05.185367    9620 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/GUID-calico-token-w8tzx" (spec.Name: "calico-token-w8tzx") pod "GUID" (UID: "GUID").
Sep 02 15:33:07 ip-172-16-30-208 kubelet[9620]: I0902 15:33:07.187953    9620 kubelet.go:1824] SyncLoop (DELETE, "api"): "my-pod-969733955-rbxhn_container-4-production(GUID)"
Sep 02 15:33:13 ip-172-16-30-208 kubelet[9620]: I0902 15:33:13.879940    9620 aws.go:937] Could not determine public DNS from AWS metadata.
Sep 02 15:33:20 ip-172-16-30-208 kubelet[9620]: I0902 15:33:20.736601    9620 server.go:778] GET /metrics: (53.063679ms) 200 [[Prometheus/1.7.1] 10.0.46.198:43576]
Sep 02 15:33:23 ip-172-16-30-208 kubelet[9620]: I0902 15:33:23.898078    9620 aws.go:937] Could not determine public DNS from AWS metadata.

As you can see, this cluster has Calico for CNI.
The following lines bring my attention:

Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968245    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-token-key" (spec.Name: "token-key") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968537    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-token-key\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968508761 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-token-key" (volume.spec.Name: "token-key") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968744    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-default-token-wrlv3\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968719924 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (volume.spec.Name: "default-token-wrlv3") pod "GUID" (UID: "GUID") with: rename 

Is there a better way find out which phase a pod is stuck?

kubectl delete pod xxx --now seems to work pretty well, but I really wish to find out its root cause and avoid human interaction.

@dixudx

This comment has been minimized.

Copy link
Member

dixudx commented Sep 4, 2017

rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy

Seems kubelet/mount failed to mount configmap as a volume due to such file renaming.

@igorleao Is this reproducible? Or it is just not that stable, happening occasionally. I've met such errors before, just to make sure.

@igorleao

This comment has been minimized.

Copy link

igorleao commented Sep 4, 2017

@dixudx it happens several times a day for a certain cluster. Others clusters created with the same verstion of kops and kubernetes, in the same week, work just fine.

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Sep 13, 2017

@igorleao As the log shows that the volume manager failed to remove the secrete directory because device is busy.
Could you please check whether the directory /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key is still mounted or not? Thanks!

@r7vme

This comment has been minimized.

Copy link

r7vme commented Sep 26, 2017

@igorleao how do you run kubelet? in container? if so can you please post your systemd unit or docker config for kubelet?

We see similar behaviour. We run kubelet as container and problem was partially mitigated by mounting /var/lib/kubelet as shared (by default docker mounts volume as rslave). But still we see similar issues, but less frequent. Currently i suspect that some other mounts should be done different way (e.g. /var/lib/docker or /rootfs)

@r7vme

This comment has been minimized.

Copy link

r7vme commented Sep 28, 2017

@stormltf Can you please post your kubelet container configuration?

@r7vme

This comment has been minimized.

Copy link

r7vme commented Sep 29, 2017

@stormltf you're running kubelet in container and don't use --containerized flag (which do some tricks with mounts). Which basically means that all mounts that kubelet does will be done in container mount namespace. Good thing that they will be proposed back to host machine's namespace (as you have /var/lib/kubelet as shared), but i'm not sure what happens is namespace removed (when kubelet container removed).

Can you please for stuck pods do following:

on node where pod is running

  • docker exec -ti /kubelet /bin/bash -c "mount | grep STUCK_POD_UUID"
  • and same on node itself mount | grep STUCK_POD_UUID.

Please also do same for freshly created pod. I excpect to see some /var/lib/kubelet mounts (e.g. default-secret)

@r7vme

This comment has been minimized.

Copy link

r7vme commented Oct 11, 2017

@stormltf did you restart kubelet after first two pods were created?

@r7vme

This comment has been minimized.

Copy link

r7vme commented Oct 12, 2017

@stormltf You can try to make /var/lib/docker and /rootfs as shared (which i don't see in your docker inspect, but see inside container) mountpoint.

@ianchakeres

This comment has been minimized.

Copy link
Member

ianchakeres commented Oct 22, 2017

/sig storage

@r7vme

This comment has been minimized.

Copy link

r7vme commented Oct 23, 2017

For some it might help. We are running kubelet in docker container with --containerized flag and were able to solve this issue with mounting /rootfs, /var/lib/docker and /var/lib/kubelet as shared mounts. Final mounts look like this

      -v /:/rootfs:ro,shared \
      -v /sys:/sys:ro \
      -v /dev:/dev:rw \
      -v /var/log:/var/log:rw \
      -v /run/calico/:/run/calico/:rw \
      -v /run/docker/:/run/docker/:rw \
      -v /run/docker.sock:/run/docker.sock:rw \
      -v /usr/lib/os-release:/etc/os-release \
      -v /usr/share/ca-certificates/:/etc/ssl/certs \
      -v /var/lib/docker/:/var/lib/docker:rw,shared \
      -v /var/lib/kubelet/:/var/lib/kubelet:rw,shared \
      -v /etc/kubernetes/ssl/:/etc/kubernetes/ssl/ \
      -v /etc/kubernetes/config/:/etc/kubernetes/config/ \
      -v /etc/cni/net.d/:/etc/cni/net.d/ \
      -v /opt/cni/bin/:/opt/cni/bin/ \

For some more details. This does not properly solve the problem as for every bind mount you'll get 3 mounts inside kubelet container (2 parasite). But at least shared mount allow to easily unmount them with one shot.

CoreOS does not have this problem. Because the use rkt and not docker for kubelet container. In case our case kubelet runs in Docker and every mount inside kubelet continer gets proposed into /var/lib/docker/overlay/... and /rootfs that's why we have two parasite mounts for every bind mount volume:

  • one from /rootfs in /rootfs/var/lib/kubelet/<mount>
  • one from /var/lib/docker in /var/lib/docker/overlay/.../rootfs/var/lib/kubelet/<mount>
@stormltf

This comment has been minimized.

Copy link

stormltf commented Oct 25, 2017

-v /dev:/dev:rw 
-v /etc/cni:/etc/cni:ro 
-v /opt/cni:/opt/cni:ro 
-v /etc/ssl:/etc/ssl:ro 
-v /etc/resolv.conf:/etc/resolv.conf 
-v /etc/pki/tls:/etc/pki/tls:ro 
-v /etc/pki/ca-trust:/etc/pki/ca-trust:ro
-v /sys:/sys:ro 
-v /var/lib/docker:/var/lib/docker:rw 
-v /var/log:/var/log:rw
-v /var/lib/kubelet:/var/lib/kubelet:shared 
-v /var/lib/cni:/var/lib/cni:shared 
-v /var/run:/var/run:rw 
-v /www:/www:rw 
-v /etc/kubernetes:/etc/kubernetes:ro 
-v /etc/os-release:/etc/os-release:ro 
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
@tadas-subonis

This comment has been minimized.

Copy link

tadas-subonis commented Nov 20, 2017

I have the same issue with Kubernetes 1.8.1 on Azure - after deployment is changed and new pods are have been started, the old pods are stuck at terminating.

@wardhane

This comment has been minimized.

Copy link

wardhane commented Nov 24, 2017

I have the same issue on Kubernetes 1.8.2 on IBM Cloud. After new pods are started the old pods are stuck in terminating.

kubectl version
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.2-1+d150e4525193f1", GitCommit:"d150e4525193f1c79569c04efc14599d7deb5f3e", GitTreeState:"clean", BuildDate:"2017-10-27T08:15:17Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

I have used kubectl delete pod xxx --now as well as kubectl delete pod foo --grace-period=0 --force to no avail.

@r7vme

This comment has been minimized.

Copy link

r7vme commented Nov 24, 2017

If root cause still the same (improperly proposed mounts) then this is distribution specific bug imo.

Please describe how you run kubelet run in IBM cloud? systemd unit? does it have --containerized flag?

@wardhane

This comment has been minimized.

Copy link

wardhane commented Nov 24, 2017

it is run with --containerized flag set to false.

   kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2017-11-19 21:48:48 UTC; 4 days ago```


--containerized flag:  No
@r7vme

This comment has been minimized.

Copy link

r7vme commented Nov 24, 2017

ok, i need more info, please see my comment above #51835 (comment)

and also please show contents of /lib/systemd/system/kubelet.service and if there anything about kubelet in /etc/systemd/system please share too.

In particual, if kubelet runs in docker i want to see all bind mounts -v.

@knisbet

This comment has been minimized.

Copy link

knisbet commented Nov 29, 2017

Today I encountered an issue that may be the same as the one described, where we had pods on one of our customer systems getting stuck in the terminating state for several day's. We were also seeing the errors about "Error: UnmountVolume.TearDown failed for volume" with "device or resource busy" repeated for each of the stuck pods.

In our case, it appears to be an issue with docker on RHEL/Centos 7.4 based systems covered in this moby issue: moby/moby#22260 and this moby PR: https://github.com/moby/moby/pull/34886/files

For us, once we set the sysctl option fs.may_detach_mounts=1 within a couple minutes all our Terminating pods cleaned up.

@nmakhotkin

This comment has been minimized.

Copy link

nmakhotkin commented Nov 29, 2017

I'm also facing this problem: Pods got stuck in Terminating state on 1.8.3.

Relevant kubelet logs from the node:

Nov 28 22:48:51 <my-node> kubelet[1010]: I1128 22:48:51.616749    1010 reconciler.go:186] operationExecutor.UnmountVolume started for volume "nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw" (UniqueName: "kubernetes.io/nfs/58dc413c-d4d1-11e7-870d-3c970e298d91-nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw") pod "58dc413c-d4d1-11e7-870d-3c970e298d91" (UID: "58dc413c-d4d1-11e7-870d-3c970e298d91")
Nov 28 22:48:51 <my-node> kubelet[1010]: W1128 22:48:51.616762    1010 util.go:112] Warning: "/var/lib/kubelet/pods/58dc413c-d4d1-11e7-870d-3c970e298d91/volumes/kubernetes.io~nfs/nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw" is not a mountpoint, deleting
Nov 28 22:48:51 <my-node> kubelet[1010]: E1128 22:48:51.616828    1010 nestedpendingoperations.go:264] Operation for "\"kubernetes.io/nfs/58dc413c-d4d1-11e7-870d-3c970e298d91-nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw\" (\"58dc413c-d4d1-11e7-870d-3c970e298d91\")" failed. No retries permitted until 2017-11-28 22:48:52.616806562 -0800 PST (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw" (UniqueName: "kubernetes.io/nfs/58dc413c-d4d1-11e7-870d-3c970e298d91-nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw") pod "58dc413c-d4d1-11e7-870d-3c970e298d91" (UID: "58dc413c-d4d1-11e7-870d-3c970e298d91") : remove /var/lib/kubelet/pods/58dc413c-d4d1-11e7-870d-3c970e298d91/volumes/kubernetes.io~nfs/nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw: directory not empty
Nov 28 22:48:51 <my-node> kubelet[1010]: W1128 22:48:51.673774    1010 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "<pod>": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "f58ab11527aef5133bdb320349fe14fd94211aa0d35a1da006aa003a78ce0653"

Kubelet is running as systemd unit (not in container) on Ubuntu 16.04.
As you can see, there was a mount to NFS server and somehow kubelet tried to delete the mount directory because it considers this directory as non-mounted.

Volumes spec from the pod:

volumes:
  - name: nfs-mtkylje2oc4xlju1ls9rdwjlcmxhyi1ydw
    nfs:
      path: /<path>
      server: <IP>
  - name: default-token-rzqtt
    secret:
      defaultMode: 420
      secretName: default-token-rzqtt

UPD: I faced this problem before as well on 1.6.6

@sabbour

This comment has been minimized.

Copy link

sabbour commented Nov 29, 2017

Experiencing the same on Azure..

NAME                        READY     STATUS        RESTARTS   AGE       IP             NODE
busybox2-7db6d5d795-fl6h9   0/1       Terminating   25         1d        10.200.1.136   worker-1
busybox3-69d4f5b66c-2lcs6   0/1       Terminating   26         1d        <none>         worker-2
busybox7-797cc644bc-n5sv2   0/1       Terminating   26         1d        <none>         worker-2
busybox8-c8f95d979-8lk27    0/1       Terminating   25         1d        10.200.1.137   worker-1
nginx-56ccc998dd-hvpng      0/1       Terminating   0          2h        <none>         worker-1
nginx-56ccc998dd-nnsvj      0/1       Terminating   0          2h        <none>         worker-2
nginx-56ccc998dd-rsrvq      0/1       Terminating   0          2h        <none>         worker-1

kubectl version

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:46:41Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

describe pod nginx-56ccc998dd-nnsvj

Name:                      nginx-56ccc998dd-nnsvj
Namespace:                 default
Node:                      worker-2/10.240.0.22
Start Time:                Wed, 29 Nov 2017 13:33:39 +0400
Labels:                    pod-template-hash=1277755488
                           run=nginx
Annotations:               kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-56ccc998dd","uid":"614f71db-d4e8-11e7-9c45-000d3a25e3c0","...
Status:                    Terminating (expires Wed, 29 Nov 2017 15:13:44 +0400)
Termination Grace Period:  30s
IP:
Created By:                ReplicaSet/nginx-56ccc998dd
Controlled By:             ReplicaSet/nginx-56ccc998dd
Containers:
  nginx:
    Container ID:   containerd://d00709dfb00ed5ac99dcd092978e44fc018f44cca5229307c37d11c1a4fe3f07
    Image:          nginx:1.12
    Image ID:       docker.io/library/nginx@sha256:5269659b61c4f19a3528a9c22f9fa8f4003e186d6cb528d21e411578d1e16bdb
    Port:           <none>
    State:          Terminated
      Exit Code:    0
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-jm7h5 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  default-token-jm7h5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-jm7h5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type    Reason   Age   From               Message
  ----    ------   ----  ----               -------
  Normal  Killing  41m   kubelet, worker-2  Killing container with id containerd://nginx:Need to kill Pod

sudo journalctl -u kubelet | grep "nginx-56ccc998dd-nnsvj"

Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.124779   64794 kubelet.go:1837] SyncLoop (ADD, "api"): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)"
Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.160444   64794 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-jm7h5" (UniqueName: "kubernetes.io/secret/6171e2a7-d4e8-11e7-9c45-000d3a25e3c0-default-token-jm7h5") pod "nginx-56ccc998dd-nnsvj" (UID: "6171e2a7-d4e8-11e7-9c45-000d3a25e3c0")
Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.261128   64794 reconciler.go:257] operationExecutor.MountVolume started for volume "default-token-jm7h5" (UniqueName: "kubernetes.io/secret/6171e2a7-d4e8-11e7-9c45-000d3a25e3c0-default-token-jm7h5") pod "nginx-56ccc998dd-nnsvj" (UID: "6171e2a7-d4e8-11e7-9c45-000d3a25e3c0")
Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.286574   64794 operation_generator.go:484] MountVolume.SetUp succeeded for volume "default-token-jm7h5" (UniqueName: "kubernetes.io/secret/6171e2a7-d4e8-11e7-9c45-000d3a25e3c0-default-token-jm7h5") pod "nginx-56ccc998dd-nnsvj" (UID: "6171e2a7-d4e8-11e7-9c45-000d3a25e3c0")
Nov 29 09:33:39 worker-2 kubelet[64794]: I1129 09:33:39.431485   64794 kuberuntime_manager.go:370] No sandbox for pod "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)" can be found. Need to start a new one
Nov 29 09:33:42 worker-2 kubelet[64794]: I1129 09:33:42.449592   64794 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerStarted", Data:"0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af"}
Nov 29 09:33:47 worker-2 kubelet[64794]: I1129 09:33:47.637988   64794 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerStarted", Data:"d00709dfb00ed5ac99dcd092978e44fc018f44cca5229307c37d11c1a4fe3f07"}
Nov 29 11:13:14 worker-2 kubelet[64794]: I1129 11:13:14.468137   64794 kubelet.go:1853] SyncLoop (DELETE, "api"): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)"
Nov 29 11:13:14 worker-2 kubelet[64794]: E1129 11:13:14.711891   64794 kuberuntime_manager.go:840] PodSandboxStatus of sandbox "0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af" for pod "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)" error: rpc error: code = Unknown desc = failed to get task status for sandbox container "0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af": process id 0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af not found: not found
Nov 29 11:13:14 worker-2 kubelet[64794]: E1129 11:13:14.711933   64794 generic.go:241] PLEG: Ignoring events for pod nginx-56ccc998dd-nnsvj/default: rpc error: code = Unknown desc = failed to get task status for sandbox container "0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af": process id 0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af not found: not found
Nov 29 11:13:15 worker-2 kubelet[64794]: I1129 11:13:15.788179   64794 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerDied", Data:"d00709dfb00ed5ac99dcd092978e44fc018f44cca5229307c37d11c1a4fe3f07"}
Nov 29 11:13:15 worker-2 kubelet[64794]: I1129 11:13:15.788221   64794 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerDied", Data:"0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af"}
Nov 29 11:46:45 worker-2 kubelet[42337]: I1129 11:46:45.384411   42337 kubelet.go:1837] SyncLoop (ADD, "api"): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0), kubernetes-dashboard-7486b894c6-2xmd5_kube-system(e55ca22c-d416-11e7-9c45-000d3a25e3c0), busybox3-69d4f5b66c-2lcs6_default(adb05024-d412-11e7-9c45-000d3a25e3c0), kube-dns-7797cb8758-zblzt_kube-system(e925cbec-d40b-11e7-9c45-000d3a25e3c0), busybox7-797cc644bc-n5sv2_default(b7135a8f-d412-11e7-9c45-000d3a25e3c0)"
Nov 29 11:46:45 worker-2 kubelet[42337]: I1129 11:46:45.387169   42337 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerDied", Data:"d00709dfb00ed5ac99dcd092978e44fc018f44cca5229307c37d11c1a4fe3f07"}
Nov 29 11:46:45 worker-2 kubelet[42337]: I1129 11:46:45.387245   42337 kubelet.go:1871] SyncLoop (PLEG): "nginx-56ccc998dd-nnsvj_default(6171e2a7-d4e8-11e7-9c45-000d3a25e3c0)", event: &pleg.PodLifecycleEvent{ID:"6171e2a7-d4e8-11e7-9c45-000d3a25e3c0", Type:"ContainerDied", Data:"0f539a84b96814651bb199e91f71157bc90c6e0c26340001c3f1c9f7bd9165af"}

cat /etc/systemd/system/kubelet.service

[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=cri-containerd.service
Requires=cri-containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet \
  --allow-privileged=true \
  --anonymous-auth=false \
  --authorization-mode=Webhook \
  --client-ca-file=/var/lib/kubernetes/ca.pem \
  --cluster-dns=10.32.0.10 \
  --cluster-domain=cluster.local \
  --container-runtime=remote \
  --container-runtime-endpoint=unix:///var/run/cri-containerd.sock \
  --image-pull-progress-deadline=2m \
  --kubeconfig=/var/lib/kubelet/kubeconfig \
  --network-plugin=cni \
  --pod-cidr=10.200.2.0/24 \
  --register-node=true \
  --require-kubeconfig \
  --runtime-request-timeout=15m \
  --tls-cert-file=/var/lib/kubelet/worker-2.pem \
  --tls-private-key-file=/var/lib/kubelet/worker-2-key.pem \
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
@emparker

This comment has been minimized.

Copy link

emparker commented Aug 4, 2018

this happened to me a couple of days ago and I gave up deleting and left the pod as it. Then today, it was disappeared and seems to be deleted eventually.

@prein

This comment has been minimized.

Copy link

prein commented Aug 7, 2018

Happened to me just now. The --force --now solution didn't work for me. I found the following line in kubelet logs suspicious

Aug 6 15:25:37 kube-minion-1 kubelet[2778]: W0806 15:25:37.986549 2778 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "backend-foos-227474871-gzhw0_default": Unexpected command output nsenter: cannot open : No such file or directory

Which led me to finding the following issue:
openshift/origin#15802

I'm not on openshift but on Openstack, so I thought it could be related. I gave the advice to restart docker a shot.
Restarting docker made the pods stuck in "Terminating" go away.

@Rajczyk

This comment has been minimized.

Copy link

Rajczyk commented Aug 21, 2018

I know this is only a work-around, but I'm not waking up sometimes at 3am to fix this anymore.
Not saying you should use this, but it might help some people.

The sleep is what I have my pods terminationGracePeriodSeconds is set to (30 seconds). If its alive longer than that, this cronjob will --force --grace-period=0 and kill it completely

kind: CronJob
metadata:
  name: stuckpod-restart
spec:
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 5
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: stuckpod-restart
            image: devth/helm:v2.9.1
            args:
            - /bin/sh
            - -c
            - echo "$(date) Job stuckpod-restart Starting"; kubectl get pods --all-namespaces=true | awk '$3=="Terminating" {print "sleep 30; echo "$(date) Killing pod $1"; kubectl delete pod " $1 " --grace-period=0 --force"}'; echo "$(date) Job stuckpod-restart Complete";
          restartPolicy: OnFailure```
@erhudy

This comment has been minimized.

Copy link
Contributor

erhudy commented Aug 21, 2018

I am seeing the same error with Kubernetes v1.10.2. Pods get stuck in terminating indefinitely and the kubelet on the node in question repeatedly logs:

Aug 21 13:25:55 node-09 kubelet[164855]: E0821 13:25:55.149132  
164855 nestedpendingoperations.go:267] 
Operation for "\"kubernetes.io/configmap/b838409a-a49e-11e8-bdf7-000f533063c0-configmap\" 
(\"b838409a-a49e-11e8-bdf7-000f533063c0\")" failed. No retries permitted until 2018-08-21 
13:27:57.149071465 +0000 UTC m=+1276998.311766147 (durationBeforeRetry 2m2s). Error: "error 
cleaning subPath mounts for volume \"configmap\" (UniqueName: 
\"kubernetes.io/configmap/b838409a-a49e-11e8-bdf7-000f533063c0-configmap\") pod 
\"b838409a-a49e-11e8-bdf7-000f533063c0\" (UID: \"b838409a-a49e-11e8-bdf7-000f533063c0\") 
: error deleting /var/lib/kubelet/pods/b838409a-a49e-11e8-bdf7-000f533063c0/volume-
subpaths/configmap/pod-master/2: remove /var/lib/kubelet/pods/b838409a-a49e-11e8-bdf7-
000f533063c0/volume-subpaths/configmap/pod-master/2: device or resource busy"

I can manually unmount the subpath volume in question without complaint (Linux does not tell me it is busy). This stops the kubelet from logging the error message. However, this does not inspire Kubernetes to continue cleanup, as the pod is still shown in terminating state. Routinely restarting Docker to clean this up is not really an acceptable solution because of the disruption it causes to running containers.

Also of note: the container itself is gone from docker ps -a with no evidence that it ever existed, so I'm not sure this is actually a Docker issue. We are using Docker version 17.03.2-ce.

@erhudy

This comment has been minimized.

Copy link
Contributor

erhudy commented Aug 21, 2018

An update: we had configured our nodes to redirect the kubelet root directory to a non-OS volume with a symlink (/var/lib/kubelet was a symlink pointing to another directory on a different volume). When I reconfigured things to pass --root-dir to the kubelet so that it went to the desired directory directly, rather than through a symlink, and restarted the kubelet, it cleaned up the volume mounts and cleared out the pods that were stuck terminating without requiring a Docker restart.

@walterdolce

This comment has been minimized.

Copy link

walterdolce commented Sep 17, 2018

I experienced this issue today for the first time while running some pods locally on minikube.

I had a bunch of pods stuck in Terminating due to a configmap/secret mounted as a volume which was missing. None of the suggestions/workarounds/solutions posted above worked except this one.

One thing that I think is worth of notice is the following though:

  • When I ran kubectl get pods, I got the list of pods with the Terminating status.
  • When I ran docker ps | grep -i {{pod_name}} though, none of the pods in Terminating status as seen by kubectl get pods were running in the minikube VM.

I was expecting docker ps to return the list of pods stuck in the Terminating state but in reality none of them were running, yet kubectl get pods was returning data about them? Would anyone be able to explain why is that?

@bronger

This comment has been minimized.

Copy link

bronger commented Sep 17, 2018

I experienced this issue with 4 deployments. Then I switched from “local volume” to “host path” for all mounts, and it is gone for me.

@SachinHg

This comment has been minimized.

Copy link

SachinHg commented Sep 30, 2018

I just had the problem that the pods were not terminating because a secret was missing. After I created that secret in that namespace everything was back to normal.

How do you create a secret in the namespace if the namespace is in "Terminating" state?

@hixichen

This comment has been minimized.

Copy link

hixichen commented Oct 11, 2018

kubectl delete --all pods --namespace=xxxxx --force --grace-period=0

works for me.

Do not forget about "--grace-period=0". It matters

@windoze

This comment has been minimized.

Copy link

windoze commented Oct 13, 2018

kubectl warned me "warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely." when I use --force --grace-period=0.
Can anyone tell me if it'll really happen?

@zhangxiaoyu-zidif

This comment has been minimized.

Copy link
Member

zhangxiaoyu-zidif commented Oct 14, 2018

@windoze

This comment has been minimized.

Copy link

windoze commented Oct 16, 2018

Can you help to confirm if the pod will be deleted immediately?
Did that mean the warning message is actually inaccurate?

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Oct 16, 2018

@windoze , if you --force --grace-period=0 option, it means that the pod API object will deleted from API server immediately. Node kubelet is responsible to clean up volume mounts and kill containers. If kubelet is not running or has issues during cleaning up the pod, the container might be still running. But Kubelet should keep trying to clean up the pods whenever possible.

@windoze

This comment has been minimized.

Copy link

windoze commented Oct 18, 2018

So that still means the deletion could take forever because kubelet could be malfunctioning?
Is there any way to make sure the pod is deleted?
I'm asking the question because I've some huge pods running in the cluster and there is no enough memory on every node to run 2 instances of them.
If the deletion failed the node becomes unusable, and if this issue happening multiple times, the service will be completely down because eventually there'll be no node can run this pod.

In plain-n-old docker environment I can force kill a pod with kill -9 or something like it, but seems k8s doesn't have such function.

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Oct 18, 2018

@windoze do you know why your pod deletion often failed? It is because kubelet is not running, or kubelet was trying to kill the container but failed with some errors?

@windoze

This comment has been minimized.

Copy link

windoze commented Oct 18, 2018

Such situation happened several times on my cluster several months ago, kubelet was running but docker daemon seemed to have some trouble and got stuck with no error log.
My solution was to log in to the node and force kill the container process and restart the docker daemon.
After some upgradings the issue was gone and I never had it again.

@shinebayar-g

This comment has been minimized.

Copy link

shinebayar-g commented Oct 27, 2018

kubectl delete pods <podname> --force --grace-period=0 worked for me!

@agolomoodysaada

This comment has been minimized.

Copy link

agolomoodysaada commented Nov 2, 2018

@shinebayar-g , the problem with --force is that it could mean that your container will keep running. It just tells Kubernetes to forget about this pod's containers. A better solution is to SSH into the VM running the pod and investigate what's going on with Docker. Try to manually kill the containers with docker kill and if successful, attempt to delete the pod normally again.

@shinebayar-g

This comment has been minimized.

Copy link

shinebayar-g commented Nov 3, 2018

@agolomoodysaada Ah, that makes sense. Thanks for the explanation. So I wouldn't really know that actual container is really deleted or not right?

@sokoow

This comment has been minimized.

Copy link

sokoow commented Nov 3, 2018

so, it's the end of 2018, kube 1.12 is out and ... you all still have problems with stuck pods ?

@shangxdy

This comment has been minimized.

Copy link

shangxdy commented Nov 5, 2018

I have the same issue, either --force --grace-period=0 or --force --now doesn't work, the following is the logs:

root@r15-c70-b03-master01:~# kubectl -n infra-lmat get pod node-exporter-zbfpx
NAME READY STATUS RESTARTS AGE
node-exporter-zbfpx 0/1 Terminating 0 4d

root@r15-c70-b03-master01:~# kubectl -n infra-lmat delete pod node-exporter-zbfpx --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "node-exporter-zbfpx" deleted

root@r15-c70-b03-master01:~# kubectl -n infra-lmat get pod node-exporter-zbfpx
NAME READY STATUS RESTARTS AGE
node-exporter-zbfpx 0/1 Terminating 0 4d

root@r15-c70-b03-master01:~# kubectl -n infra-lmat delete pod node-exporter-zbfpx --now --force
pod "node-exporter-zbfpx" deleted

root@r15-c70-b03-master01:~# kubectl -n infra-lmat get pod node-exporter-zbfpx
NAME READY STATUS RESTARTS AGE
node-exporter-zbfpx 0/1 Terminating 0 4d

root@r15-c70-b03-master01:~#

I tried to edit the pod and delete the finalizers section in metadata, but it also failed.

@Benjamin-Dobell

This comment has been minimized.

Copy link

Benjamin-Dobell commented Nov 14, 2018

I'm still seeing this in a 100% reproducible fashion (same resource defintions) with the kubectl 1.13 alpha and Docker for Desktop on macOS. By reproducible I mean that the only way to fix it seems to be to factory reset Docker for Mac, and when I setup my cluster again using the same resources (deployment script), the same clean-up script fails.

I'm not sure why it would be relevant but my clean-up script looks like:

#!/usr/bin/env bash
set -e

function usage() {
	echo "Usage: $0 <containers|envs|volumes|all>"
}

if [ "$1" = "--help" ] || [ "$1" = "-h" ] || [ "$1" = "help" ]; then
	echo "$(usage)"
	exit 0
fi

if [ $# -lt 1 ] || [ $# -gt 1 ]; then
	>&2 echo "$(usage)"
	exit 1
fi

MODE=$1

function join_with {
	local IFS="$1"
	shift
	echo "$*"
}

resources=()

if [ "$MODE" = "containers" ] || [ "$MODE" = "all" ]; then
	resources+=(daemonsets replicasets statefulsets services deployments pods rc)
fi

if [ "$MODE" = "envs" ] || [ "$MODE" = "all" ]; then
	resources+=(configmaps secrets)
fi

if [ "$MODE" = "volumes" ] || [ "$MODE" = "all" ]; then
	resources+=(persistentvolumeclaims persistentvolumes)
fi

kubectl delete $(join_with , "${resources[@]}") --all

Because the cluster is run locally I can verify that there are no containers running in Docker, it's just kubectl that's getting hung up on terminating pods. When I describe the pods the status is listed as Status: Terminating (lasts <invalid>)

@shinebayar-g

This comment has been minimized.

Copy link

shinebayar-g commented Nov 14, 2018

Just happened to me once again. I was trying to install percona pmm-server with NFS share and software didn't even came up, so I removed and this happened. (Persistent claim wasn't working for this software). Guess I'm calling good old kubectl delete pods <podname> --force --grace-period=0 once again. But question is how do I know where this pod is living on?

@agolomoodysaada

This comment has been minimized.

Copy link

agolomoodysaada commented Nov 14, 2018

@shinebayar-g , SSH into the VM it was on and run docker ps.

@shinebayar-g

This comment has been minimized.

Copy link

shinebayar-g commented Nov 14, 2018

Well it wasn't there.. I have few VMs , so I asked how to find out which one is the right one. :)

@windoze

This comment has been minimized.

Copy link

windoze commented Nov 23, 2018

@shinebayar-g this may work:
kubectl describe pod/some-pod-name | grep '^Node:'

@chestack

This comment has been minimized.

Copy link
Contributor

chestack commented Dec 13, 2018

same issue.

docker ps found that the container is in "Dead" status not Exited(0) as expected

@nielsole

This comment has been minimized.

Copy link
Contributor

nielsole commented Dec 20, 2018

Manually deleting the container, lead to the following docker log entry:

level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 

Unfortunately the line is cut off, but I think I remember, the problem was that the process was not there anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment