Pod is removed from store but the containers are not terminated #88613

ialidzhikov · 2020-02-27T12:20:46Z

What happened:
Pod is removed from store but the associated containers can run on the Node for a very long time.

What you expected to happen:
I would expect to have a consistent behaviour and when a Pod is removed from the store, the associated containers to be terminated.

How to reproduce it (as minimally and precisely as possible):

Apply the following Pod:

apiVersion: v1
kind: Pod
metadata:
  name: alpine
spec:
  activeDeadlineSeconds: 30
  containers:
  - command:
    - sh
    - -c
    - sleep 3600
    image: alpine:3.10.3
    imagePullPolicy: IfNotPresent
    name: alpine
  terminationGracePeriodSeconds: 600

Ensure that after 30s (.spec.activeDeadlineSeconds) the Pod will be with .status.phase=Failed and .status.reason=DeadlineExceeded. Ensure that the container will receive SIGTERM signal at this point of the time.
Delete the Pod after it is DeadlineExceeded.

$ k delete po alpine

Ensure that the deletion completes right away and the pod is removed from the store.
Ensure that the associated containers will continue to run on the Node until .spec.terminationGracePeriodSeconds is passed.

/ # docker ps | grep alpine
f2fdf243db1a        alpine                                                                       "sh -c 'sleep 3600'"     3 minutes ago       Up 3 minutes                            k8s_alpine_alpine_default_c8aa37a1-d248-4831-a06f-e9ac4bac4a62_0

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.15.10

$ k version

Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.10", GitCommit:"1bea6c00a7055edef03f1d4bb58b773fa8917f11", GitTreeState:"clean", BuildDate:"2020-02-11T20:05:26Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

ialidzhikov · 2020-02-27T12:21:24Z

/sig api-machinery

zanetworker · 2020-02-27T15:50:40Z

/sig node

fedebongio · 2020-02-27T21:21:16Z

/remove-sig api-machinery

ialidzhikov · 2020-04-06T19:38:20Z

Is there any update on this issue?

fejta-bot · 2020-07-05T20:11:54Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ialidzhikov · 2020-07-05T21:26:25Z

/remove-lifecycle stale

fejta-bot · 2020-10-03T22:16:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ialidzhikov · 2020-10-04T12:24:34Z

/remove-lifecycle stale

rfranzke · 2020-10-22T07:27:12Z

Can we assign a SIG to look into this?

ialidzhikov · 2020-10-22T07:35:11Z

Well, sig/node label is present. @dchen1107, @derekwaynecarr can we get some attention on this issue? It is open since quite a long time and there is no feedback on it at all.

/cc @kubernetes/sig-node-bugs

k8s-ci-robot · 2020-10-22T07:35:18Z

@ialidzhikov: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

Well, sig/node label is present. @dchen1107, @derekwaynecarr can we get some attention on this issue? It is open since quite a long time and there is no feedback on it at all.

/cc @kubernetes/sig-node-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ialidzhikov · 2020-10-22T07:41:12Z

/priority important-longterm

zoeeer · 2020-11-12T08:05:12Z

Hi! I'm having a similar problem here.

I have set "activeDeadlineSeconds" on a Pod spec. When the container runs longer than that time limit, the Pod's Status will turn to Failed with Reason: DeadlineExceeded. However, the container's Status remains Running and seems stay that state forever until I mannually delete the pod. (Actually I'm not sure if container is terminated after the pod is deleted. The cluster is managed by my cloud service provider.)

This happens on both standalone Pod or Job workloads. As with Job, when setting "activeDeadlineSeconds" in Job Spec it ends as expected: after the time limit, all pods in job will be marked as Failed and seems terminated correctly. But when I set "activeDeadlineSeconds" in Pod Spec (Job.spec.template.spec), the pod's container remains Running after the pod reached its tiem limit and turned Failed.

My pod setup:

apiVersion: v1 
kind: Pod   
metadata:
  name: pod-timeout-test 
spec:  
  activeDeadlineSeconds: 20         # Pod Timeout
  containers:
  - image: busybox
    name: container-0 
    resources:  
      limits:
        cpu: 250m
        memory: 1024Mi
      requests:
        cpu: 250m
        memory: 1024Mi
    command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 120']
  imagePullSecrets: 
  - name: imagepullsecret

Status of pod:

-> % kubectl describe pod pod-timeout-test
Name:         pod-timeout-test
Namespace:    ******
Node:         **************/
Start Time:   Thu, 12 Nov 2020 14:54:01 +0800
Labels:       sys_enterprise_project_id=0
              tenant.kubernetes.io/domain-id=0a1391c2008025c20ff2c007fafd9520
Annotations:  cri.cci.io/container-type: secure-container
              k8s.v1.cni.cncf.io/networks: [{"name":"********-default-network","interface":"eth0","network_plane":"default"}]
              kubernetes.io/availablezone: cn-east-3a
Status:       Failed
Reason:       DeadlineExceeded
Message:      Pod was active on the node longer than the specified deadline
IP:           192.168.3.70
Containers:
  container-0:
    Container ID:  docker://758addc7ac9505297f0028c9c4080d32ca80ba2b65433856d59c24e0806add46
    Image:         busybox
    Image ID:      docker-pullable://busybox@sha256:4fe8827f51a5e11bb83afa8227cbccb402df840d32c6b633b7ad079bc8144100
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      echo "Hello, Kubernetes!" && sleep 120
    State:          Running
      Started:      Thu, 12 Nov 2020 14:54:04 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     250m
      memory:  1Gi
    Requests:
      cpu:        250m
      memory:     1Gi
    Environment:  <none>
    Mounts:       <none>
Volumes:          <none>
QoS Class:        Guaranteed
Node-Selectors:   node.cci.io/default-cpu-choice=true
                  node.cci.io/flavor=general-computing
Tolerations:      node.cci.io/allowed-on-shared-node:NoSchedule
                  node.cci.io/occupied=default:NoSchedule
                  node.kubernetes.io/memory-pressure:NoSchedule
                  node.kubernetes.io/not-ready:NoExecute for 300s
                  node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason                 Age                From                                          Message
  ----    ------                 ----               ----                                          -------
  Normal  Scheduled              32s                volcano                                       Successfully assigned *********/pod-timeout-test to cneast3a-pod04-kc1-common001-cna014
  Normal  Pulling                29s                kubelet, cneast3a-pod04-kc1-common001-cna014  Pulling image "busybox"
  Normal  Pulled                 29s                kubelet, cneast3a-pod04-kc1-common001-cna014  Successfully pulled image "busybox"
  Normal  SuccessfulCreate       29s                kubelet, cneast3a-pod04-kc1-common001-cna014  Created container container-0
  Normal  Started                29s                kubelet, cneast3a-pod04-kc1-common001-cna014  Started container container-0
  Normal  SuccessfulMountVolume  28s (x2 over 32s)  kubelet, cneast3a-pod04-kc1-common001-cna014  Successfully mounted volumes for pod "pod-timeout-test_cci-minieye-algo-test(c72c5e3c-e5d0-4344-ae41-e5c0579c9c64)"
  Normal  Killing                12s                kubelet, cneast3a-pod04-kc1-common001-cna014  Stopping container container-0
  Normal  DeadlineExceeded       10s (x3 over 12s)  kubelet, cneast3a-pod04-kc1-common001-cna014  Pod was active on the node longer than the specified deadline

zoeeer · 2020-11-12T08:35:09Z

More info:
Once the pod turned Failed, I can no longer access the container's shell (via command: kubectl exec -it pod-timeout-test -- /bin/sh) -- even before the container's sleep time runs out. So it's really confusing what state has the container run into.

fejta-bot · 2021-02-10T08:52:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

ialidzhikov · 2021-02-10T08:55:14Z

/remove-lifecycle stale

gjkim42 · 2022-01-02T03:56:19Z

Reproduced at master(v1.24.0-alpha.1-233-g7c013c3f64d) with containerd

By the way, is it a valid behavior using activeDeadlineSeconds at pod level? (not at job level)

/assign

gjkim42 · 2022-01-02T03:59:23Z

refs:
https://kubernetes.io/docs/concepts/workloads/pods/_print/#detailed-behavior
https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup

gjkim42 · 2022-03-16T16:15:29Z

The problem is that a pod after activeDeadlineSeconds goes into the PodFailed phase before all containers are killed.
And the other parts of the kubernetes think that the containers of the pod with PodFailed phase are already killed (so they think that the pod can be deleted immediately).

kubernetes/pkg/kubelet/kubelet_pods.go

Lines 1469 to 1476 in 525b8e5

    
           for _, podSyncHandler := range kl.PodSyncHandlers { 
        
           	if result := podSyncHandler.ShouldEvict(pod); result.Evict { 
        
           		s.Phase = v1.PodFailed 
        
           		s.Reason = result.Reason 
        
           		s.Message = result.Message 
        
           		break 
        
           	} 
        
           }

https://kubernetes.io/docs/concepts/workloads/pods/_print/#pod-phase

Maybe we need to redefine the PodFailed phase or make the pod go into the PodFailed phase after all its containers have terminated.

pacoxu · 2022-03-19T08:26:59Z

#98507 may fix this. @gjkim42 could you help review the fix?

k8s-triage-robot · 2022-06-17T08:43:40Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ialidzhikov · 2022-06-17T09:06:54Z

/remove-lifecycle stale

gjkim42 · 2022-06-17T13:37:33Z

/unassign

(lack of resources...)

k8s-triage-robot · 2022-09-15T14:33:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ialidzhikov · 2022-09-15T14:42:29Z

/remove-lifecycle stale

k8s-triage-robot · 2022-12-14T14:45:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ialidzhikov · 2022-12-14T15:22:04Z

/remove-lifecycle stale

sftim · 2022-12-21T16:01:48Z

/remove-area docker

We don't directly integrate with Docker any more.

gjkim42 · 2022-12-28T15:14:00Z

#88613 (comment)

/assign

gjkim42 · 2022-12-28T15:58:07Z

I confirmed that this issue is addressed at master.

I guess #108366 fixed this.
It prevents the illegal phase transition. (Failed phase with a running container is illegal.)

You can update kubernetes to v1.24+ to address this issue.

Let me know if there is still an issue.

/close

k8s-ci-robot · 2022-12-28T15:58:12Z

@gjkim42: Closing this issue.

In response to this:

I confirmed that this issue is addressed at master.

I guess #108366 fixed this.
It prevents the illegal phase transition. (Failed phase with a running container is illegal.)

You can update kubernetes to v1.24+ to address this issue.

Let me know if there is still an issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ialidzhikov added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 27, 2020

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 27, 2020

ialidzhikov mentioned this issue Feb 27, 2020

SIGTERM is not sent to terraformer process gardener-attic/gardener-extensions#597

Closed

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 27, 2020

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 27, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2020

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 22, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2021

k8s-ci-robot assigned gjkim42 Jan 2, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 17, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 17, 2022

k8s-ci-robot unassigned gjkim42 Jun 17, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2022

This was referenced Nov 9, 2022

modify force stop #113777

Closed

Forced deletion failed and the container remains #113778

Closed

The second time the pod is deleted the grace period does not take effect #113883

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022

k8s-ci-robot removed the area/docker label Dec 21, 2022

k8s-ci-robot assigned gjkim42 Dec 28, 2022

k8s-ci-robot closed this as completed Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod is removed from store but the containers are not terminated #88613

Pod is removed from store but the containers are not terminated #88613

ialidzhikov commented Feb 27, 2020

ialidzhikov commented Feb 27, 2020

zanetworker commented Feb 27, 2020

fedebongio commented Feb 27, 2020

ialidzhikov commented Apr 6, 2020

fejta-bot commented Jul 5, 2020

ialidzhikov commented Jul 5, 2020

fejta-bot commented Oct 3, 2020

ialidzhikov commented Oct 4, 2020

rfranzke commented Oct 22, 2020

ialidzhikov commented Oct 22, 2020

k8s-ci-robot commented Oct 22, 2020

ialidzhikov commented Oct 22, 2020

zoeeer commented Nov 12, 2020 •

edited

Loading

zoeeer commented Nov 12, 2020

fejta-bot commented Feb 10, 2021

ialidzhikov commented Feb 10, 2021

gjkim42 commented Jan 2, 2022 •

edited

Loading

gjkim42 commented Jan 2, 2022

gjkim42 commented Mar 16, 2022

pacoxu commented Mar 19, 2022

k8s-triage-robot commented Jun 17, 2022

ialidzhikov commented Jun 17, 2022

gjkim42 commented Jun 17, 2022

k8s-triage-robot commented Sep 15, 2022

ialidzhikov commented Sep 15, 2022

k8s-triage-robot commented Dec 14, 2022

ialidzhikov commented Dec 14, 2022

sftim commented Dec 21, 2022

gjkim42 commented Dec 28, 2022

gjkim42 commented Dec 28, 2022

k8s-ci-robot commented Dec 28, 2022

Pod is removed from store but the containers are not terminated #88613

Pod is removed from store but the containers are not terminated #88613

Comments

ialidzhikov commented Feb 27, 2020

ialidzhikov commented Feb 27, 2020

zanetworker commented Feb 27, 2020

fedebongio commented Feb 27, 2020

ialidzhikov commented Apr 6, 2020

fejta-bot commented Jul 5, 2020

ialidzhikov commented Jul 5, 2020

fejta-bot commented Oct 3, 2020

ialidzhikov commented Oct 4, 2020

rfranzke commented Oct 22, 2020

ialidzhikov commented Oct 22, 2020

k8s-ci-robot commented Oct 22, 2020

ialidzhikov commented Oct 22, 2020

zoeeer commented Nov 12, 2020 • edited Loading

zoeeer commented Nov 12, 2020

fejta-bot commented Feb 10, 2021

ialidzhikov commented Feb 10, 2021

gjkim42 commented Jan 2, 2022 • edited Loading

gjkim42 commented Jan 2, 2022

gjkim42 commented Mar 16, 2022

pacoxu commented Mar 19, 2022

k8s-triage-robot commented Jun 17, 2022

ialidzhikov commented Jun 17, 2022

gjkim42 commented Jun 17, 2022

k8s-triage-robot commented Sep 15, 2022

ialidzhikov commented Sep 15, 2022

k8s-triage-robot commented Dec 14, 2022

ialidzhikov commented Dec 14, 2022

sftim commented Dec 21, 2022

gjkim42 commented Dec 28, 2022

gjkim42 commented Dec 28, 2022

k8s-ci-robot commented Dec 28, 2022

zoeeer commented Nov 12, 2020 •

edited

Loading

gjkim42 commented Jan 2, 2022 •

edited

Loading