Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod is removed from store but the containers are not terminated #88613

Closed
ialidzhikov opened this issue Feb 27, 2020 · 40 comments
Closed

Pod is removed from store but the containers are not terminated #88613

ialidzhikov opened this issue Feb 27, 2020 · 40 comments
Assignees
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ialidzhikov
Copy link
Contributor

What happened:
Pod is removed from store but the associated containers can run on the Node for a very long time.

What you expected to happen:
I would expect to have a consistent behaviour and when a Pod is removed from the store, the associated containers to be terminated.

How to reproduce it (as minimally and precisely as possible):

  1. Apply the following Pod:
apiVersion: v1
kind: Pod
metadata:
  name: alpine
spec:
  activeDeadlineSeconds: 30
  containers:
  - command:
    - sh
    - -c
    - sleep 3600
    image: alpine:3.10.3
    imagePullPolicy: IfNotPresent
    name: alpine
  terminationGracePeriodSeconds: 600
  1. Ensure that after 30s (.spec.activeDeadlineSeconds) the Pod will be with .status.phase=Failed and .status.reason=DeadlineExceeded. Ensure that the container will receive SIGTERM signal at this point of the time.

  2. Delete the Pod after it is DeadlineExceeded.

$ k delete po alpine
  1. Ensure that the deletion completes right away and the pod is removed from the store.

  2. Ensure that the associated containers will continue to run on the Node until .spec.terminationGracePeriodSeconds is passed.

/ # docker ps | grep alpine
f2fdf243db1a        alpine                                                                       "sh -c 'sleep 3600'"     3 minutes ago       Up 3 minutes                            k8s_alpine_alpine_default_c8aa37a1-d248-4831-a06f-e9ac4bac4a62_0

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.15.10
$ k version

Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.10", GitCommit:"1bea6c00a7055edef03f1d4bb58b773fa8917f11", GitTreeState:"clean", BuildDate:"2020-02-11T20:05:26Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
@ialidzhikov ialidzhikov added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 27, 2020
@ialidzhikov
Copy link
Contributor Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 27, 2020
@zanetworker
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 27, 2020
@fedebongio
Copy link
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 27, 2020
@ialidzhikov
Copy link
Contributor Author

Is there any update on this issue?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2020
@ialidzhikov
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2020
@ialidzhikov
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2020
@rfranzke
Copy link
Contributor

Can we assign a SIG to look into this?

@ialidzhikov
Copy link
Contributor Author

Well, sig/node label is present. @dchen1107, @derekwaynecarr can we get some attention on this issue? It is open since quite a long time and there is no feedback on it at all.

/cc @kubernetes/sig-node-bugs

@k8s-ci-robot
Copy link
Contributor

@ialidzhikov: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

Well, sig/node label is present. @dchen1107, @derekwaynecarr can we get some attention on this issue? It is open since quite a long time and there is no feedback on it at all.

/cc @kubernetes/sig-node-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ialidzhikov
Copy link
Contributor Author

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 22, 2020
@zoeeer
Copy link

zoeeer commented Nov 12, 2020

Hi! I'm having a similar problem here.

I have set "activeDeadlineSeconds" on a Pod spec. When the container runs longer than that time limit, the Pod's Status will turn to Failed with Reason: DeadlineExceeded. However, the container's Status remains Running and seems stay that state forever until I mannually delete the pod. (Actually I'm not sure if container is terminated after the pod is deleted. The cluster is managed by my cloud service provider.)

This happens on both standalone Pod or Job workloads. As with Job, when setting "activeDeadlineSeconds" in Job Spec it ends as expected: after the time limit, all pods in job will be marked as Failed and seems terminated correctly. But when I set "activeDeadlineSeconds" in Pod Spec (Job.spec.template.spec), the pod's container remains Running after the pod reached its tiem limit and turned Failed.

My pod setup:

apiVersion: v1 
kind: Pod   
metadata:
  name: pod-timeout-test 
spec:  
  activeDeadlineSeconds: 20         # Pod Timeout
  containers:
  - image: busybox
    name: container-0 
    resources:  
      limits:
        cpu: 250m
        memory: 1024Mi
      requests:
        cpu: 250m
        memory: 1024Mi
    command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 120']
  imagePullSecrets: 
  - name: imagepullsecret

Status of pod:

-> % kubectl describe pod pod-timeout-test
Name:         pod-timeout-test
Namespace:    ******
Node:         **************/
Start Time:   Thu, 12 Nov 2020 14:54:01 +0800
Labels:       sys_enterprise_project_id=0
              tenant.kubernetes.io/domain-id=0a1391c2008025c20ff2c007fafd9520
Annotations:  cri.cci.io/container-type: secure-container
              k8s.v1.cni.cncf.io/networks: [{"name":"********-default-network","interface":"eth0","network_plane":"default"}]
              kubernetes.io/availablezone: cn-east-3a
Status:       Failed
Reason:       DeadlineExceeded
Message:      Pod was active on the node longer than the specified deadline
IP:           192.168.3.70
Containers:
  container-0:
    Container ID:  docker://758addc7ac9505297f0028c9c4080d32ca80ba2b65433856d59c24e0806add46
    Image:         busybox
    Image ID:      docker-pullable://busybox@sha256:4fe8827f51a5e11bb83afa8227cbccb402df840d32c6b633b7ad079bc8144100
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      echo "Hello, Kubernetes!" && sleep 120
    State:          Running
      Started:      Thu, 12 Nov 2020 14:54:04 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     250m
      memory:  1Gi
    Requests:
      cpu:        250m
      memory:     1Gi
    Environment:  <none>
    Mounts:       <none>
Volumes:          <none>
QoS Class:        Guaranteed
Node-Selectors:   node.cci.io/default-cpu-choice=true
                  node.cci.io/flavor=general-computing
Tolerations:      node.cci.io/allowed-on-shared-node:NoSchedule
                  node.cci.io/occupied=default:NoSchedule
                  node.kubernetes.io/memory-pressure:NoSchedule
                  node.kubernetes.io/not-ready:NoExecute for 300s
                  node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason                 Age                From                                          Message
  ----    ------                 ----               ----                                          -------
  Normal  Scheduled              32s                volcano                                       Successfully assigned *********/pod-timeout-test to cneast3a-pod04-kc1-common001-cna014
  Normal  Pulling                29s                kubelet, cneast3a-pod04-kc1-common001-cna014  Pulling image "busybox"
  Normal  Pulled                 29s                kubelet, cneast3a-pod04-kc1-common001-cna014  Successfully pulled image "busybox"
  Normal  SuccessfulCreate       29s                kubelet, cneast3a-pod04-kc1-common001-cna014  Created container container-0
  Normal  Started                29s                kubelet, cneast3a-pod04-kc1-common001-cna014  Started container container-0
  Normal  SuccessfulMountVolume  28s (x2 over 32s)  kubelet, cneast3a-pod04-kc1-common001-cna014  Successfully mounted volumes for pod "pod-timeout-test_cci-minieye-algo-test(c72c5e3c-e5d0-4344-ae41-e5c0579c9c64)"
  Normal  Killing                12s                kubelet, cneast3a-pod04-kc1-common001-cna014  Stopping container container-0
  Normal  DeadlineExceeded       10s (x3 over 12s)  kubelet, cneast3a-pod04-kc1-common001-cna014  Pod was active on the node longer than the specified deadline

@zoeeer
Copy link

zoeeer commented Nov 12, 2020

More info:
Once the pod turned Failed, I can no longer access the container's shell (via command: kubectl exec -it pod-timeout-test -- /bin/sh) -- even before the container's sleep time runs out. So it's really confusing what state has the container run into.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2021
@ialidzhikov
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2021
@gjkim42
Copy link
Member

gjkim42 commented Jan 2, 2022

Reproduced at master(v1.24.0-alpha.1-233-g7c013c3f64d) with containerd

By the way, is it a valid behavior using activeDeadlineSeconds at pod level? (not at job level)

/assign

@gjkim42
Copy link
Member

gjkim42 commented Mar 16, 2022

The problem is that a pod after activeDeadlineSeconds goes into the PodFailed phase before all containers are killed.
And the other parts of the kubernetes think that the containers of the pod with PodFailed phase are already killed (so they think that the pod can be deleted immediately).

for _, podSyncHandler := range kl.PodSyncHandlers {
if result := podSyncHandler.ShouldEvict(pod); result.Evict {
s.Phase = v1.PodFailed
s.Reason = result.Reason
s.Message = result.Message
break
}
}

https://kubernetes.io/docs/concepts/workloads/pods/_print/#pod-phase

Maybe we need to redefine the PodFailed phase or make the pod go into the PodFailed phase after all its containers have terminated.

@pacoxu
Copy link
Member

pacoxu commented Mar 19, 2022

#98507 may fix this. @gjkim42 could you help review the fix?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 17, 2022
@ialidzhikov
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 17, 2022
@gjkim42
Copy link
Member

gjkim42 commented Jun 17, 2022

/unassign

(lack of resources...)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2022
@ialidzhikov
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022
@ialidzhikov
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022
@sftim
Copy link
Contributor

sftim commented Dec 21, 2022

/remove-area docker

We don't directly integrate with Docker any more.

@gjkim42
Copy link
Member

gjkim42 commented Dec 28, 2022

#88613 (comment)

/assign

@gjkim42
Copy link
Member

gjkim42 commented Dec 28, 2022

I confirmed that this issue is addressed at master.

I guess #108366 fixed this.
It prevents the illegal phase transition. (Failed phase with a running container is illegal.)

You can update kubernetes to v1.24+ to address this issue.

Let me know if there is still an issue.

/close

@k8s-ci-robot
Copy link
Contributor

@gjkim42: Closing this issue.

In response to this:

I confirmed that this issue is addressed at master.

I guess #108366 fixed this.
It prevents the illegal phase transition. (Failed phase with a running container is illegal.)

You can update kubernetes to v1.24+ to address this issue.

Let me know if there is still an issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project