terminationGracePeriodSeconds greater than 10 minutes not working as expected #94435

wingman-chakra · 2020-09-02T15:54:09Z

What happened:
Pod with termination grace period of 3 hours is getting killed 10 minutes after SIGTERM

What you expected to happen:
I was expecting the pod to get full 3 hours before SIGKILL is sent

How to reproduce it (as minimally and precisely as possible):
Long running process with termination grace period greater than 20 minutes and delete pod it will get deleted in 10 minutes

Anything else we need to know?:
From kubectl get events:
33m Normal ScaleDown pod/jarvis-6f9c9c79d6-d7vhr deleting pod for node scale down
33m Normal Killing pod/jarvis-6f9c9c79d6-d7vhr Stopping container jarvis
23m Warning FailedKillPod pod/jarvis-6f9c9c79d6-d7vhr error killing pod: failed to "KillContainer" for "jarvis" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:30:33Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.42", GitCommit:"42bef28c2031a74fc68840fce56834ff7ea08518", GitTreeState:"clean", BuildDate:"2020-06-02T16:07:00Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
GKE
OS (e.g: cat /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Kernel (e.g. uname -a):
Linux chakradarraju-lenovo 4.15.0-112-generic Literal configuration of multiple objects (was Service up/down scripts) #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

wingman-chakra · 2020-09-02T15:56:28Z

/sig cluster-lifecycle

neolit123 · 2020-09-02T23:30:29Z

/remove-sig cluster-lifecycle
/sig scheduling

ahg-g · 2020-09-03T00:50:30Z

/remove-sig scheduling
/sig node

k8s-ci-robot · 2020-09-03T00:50:32Z

@ahg-g: The label(s) sig/ cannot be applied, because the repository doesn't have them

In response to this:

/remove-sig scheduling
/sig node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ahg-g · 2020-09-03T00:50:49Z

/sig node

pacoxu · 2020-09-03T07:37:37Z

Can you provide the whole pod yaml?

My pod with terminationGracePeriodSeconds: 1800(30mins) is still terminating after 24 minutes.

And please check whether your pod is evicted or deleted by other process. And can you test a pod (command: sleep 3600 with terminationGracePeriodSeconds: 1800 )? Is it related with your image(command)?

wingman-chakra · 2020-09-03T17:31:10Z

here is the yaml file for the production service that is facing this issue:
`apiVersion: v1
kind: Service
metadata:
name: jarvis
labels:
name: jarvis
spec:
type: ClusterIP
sessionAffinity: None
ports:

name: grpc
protocol: TCP
port: 50051
targetPort: 50051
name: http
port: 6653
selector:
app: jarvis

apiVersion: apps/v1
kind: Deployment
metadata:
name: jarvis
spec:
progressDeadlineSeconds: 10800
replicas: 3
selector:
matchLabels:
app: jarvis
template:
metadata:
name: jarvis
labels:
app: jarvis
spec:
nodeSelector:
pool: default
terminationGracePeriodSeconds: 10800
containers:
- name: jarvis
image: <insert_python_long_running_image>
imagePullPolicy: Always
env:
- name: ENV
value: k8s-prod
resources:
requests:
cpu: "900m"
memory: "4000Mi"
limits:
cpu: "1"
memory: "5000Mi"

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: jarvis
spec:
minReplicas: 3
maxReplicas: 25
metrics:

external:
metricName: kubernetes.io|node|cpu|total_cores
metricSelector:
matchLabels:
metadata.user_labels.pool: large-recorder
resource.labels.cluster_name: prod
targetAverageValue: "12"
type: External
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: jarvis
`

I tried to recreate this with minimal config, but this seems to work fine:
apiVersion: apps/v1 kind: Deployment metadata: name: kill-test spec: replicas: 2 selector: matchLabels: app: kill-test template: metadata: name: kill-test labels: app: kill-test spec: nodeSelector: pool: default containers: - name: kill-test image: node command: - node - -e - "process.on('SIGTERM', () => { console.log('caught'); setTimeout(() => {console.log('Done'); process.exit(0);}, 11 * 60 * 1000); }); setInterval(() => console.log('beep'), 60 * 1000);" terminationGracePeriodSeconds: 1800

and I did see "Warning FailedKillPod" in my minimal setup too, so that is a red herring, sorry for that.

Let me know how I can go about debugging why is the pod getting killed 10 mins after SIGTERM, even though I've configured it to wait 3 hours.

pacoxu · 2020-09-03T18:06:53Z

Does the pod killed by HPA with a grace period of 10minutes ?
@wingman-chakra

I will test with hpa later.

wingman-chakra · 2020-09-04T00:26:42Z

Yes, with HPA it gets killed in 10 minutes, I just noticed it does not happen consistently.

srikaratstrings · 2020-09-04T10:41:53Z

We did some digging. It looks like the nodes were getting killed by cluster autoscaler which only allows a maximum of 10 mins to gracefully shutdown. From reading through it sounds like there is no way to configure this limit.

pacoxu · 2020-09-04T11:06:45Z

not sure whether https://predictive-horizontal-pod-autoscaler.readthedocs.io/en/latest/user-guide/downscale-stabilization/ would help.

HPA support cool down delay
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-cooldown-delay

Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual.

shapeofarchitect · 2020-10-09T23:36:31Z

@wingman-chakra @pacoxu I experienced the same exact Issue in our clusters too , We operate in Azure AKS and we run our automated builds in the pods , One of our QA long running tests were set to run for 3 Hours , I set the grace period for the pods to be 3 hours (10800) but right at 1 hour and 7 mins or 8 mins kubernetes send SIGTERM and graceful termination of the pod happens.

In my case we are using the Selenium Chrome image https://github.com/SeleniumHQ/docker-selenium , this uses supervisord process which is the main process that receives the SIGTERM issued by container runtime (Kubelet) , despite the grace period I set in , it seems the process got terminated and below is the script that runs at that time.

https://github.com/SeleniumHQ/docker-selenium/blob/1a3b0e1cd6d9eb3f2d3b91a5f26e160ab50fcd6b/Video/entry_point.sh

We could just never run this for longer than 1hour and 10 mins at the max too. We also don't use HPA in our setup so I assume even In normal case pods can't run more than 1 hour and some mins ? Is this the correct assumption for this ?

Appreciate your thoughts !

13:24:39.860 INFO [ActiveSessionFactory.lambda$apply$11] - Matched factory org.openqa.selenium.grid.session.remote.ServicedSession$Factory (provider: org.openqa.selenium.chrome.ChromeDriverService)
Starting ChromeDriver 85.0.4183.83 (94abc2237ae0c9a4cb5f035431c8adfb94324633-refs/branch-heads/4183@{#1658}) on port 12330
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerati[1602249879.880][SEVERE]: bind() foanisl efdo:r  Cannot assisgun requested address (99)
ggestions on keeping ChromeDriver safe.
ChromeDriver was started successfully.
13:24:40.501 INFO [ProtocolHandshake.createSession] - Detected dialect: W3C
13:24:40.503 INFO [RemoteSession$Factory.lambda$performHandshake$0] - Started new session 090f50045c92aa047ff05f3bc26bcebc (org.openqa.selenium.chrome.ChromeDriverService)
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2020-10-09 13:25:03,568 WARN received SIGTERM indicating exit request
2020-10-09 13:25:03,569 INFO waiting for xvfb, selenium-standalone to die
2020-10-09 13:25:03,570 INFO stopped: selenium-standalone (terminated by SIGTERM)
2020-10-09 13:25:03,570 INFO stopped: xvfb (terminated by SIGTERM)
Shutdown complete

srikaratstrings · 2020-10-10T01:59:39Z

In our case, the issue was that cluster-autoscaler does not respect terminationGracePeriodSeconds. I will wait a maximum of 10 minutes before killing the pod.

…

On Sat, Oct 10, 2020 at 5:06 AM Vineet Gupta ***@***.***> wrote: @wingman-chakra <https://github.com/wingman-chakra> @pacoxu <https://github.com/pacoxu> I experienced the same exact Issue in our clusters too , We operate in Azure AKS and we run our automated builds in the pods , One of our QA long running tests were set to run for 3 Hours , I set the grace period for the pods to be 3 hours (10800) but right at 1 hour and 7 mins or 8 mins kubernetes send SIGTERM and graceful termination of the pod happens. In my case we are using the Selenium Chrome image https://github.com/SeleniumHQ/docker-selenium , this uses supervisord process which is the main process that receives the SIGTERM issued by container runtime (Kubelet) , despite the grace period I set in , it seems the process got terminated and below is the script that runs at that time. https://github.com/SeleniumHQ/docker-selenium/blob/1a3b0e1cd6d9eb3f2d3b91a5f26e160ab50fcd6b/Video/entry_point.sh We could just never run this for longer than 1hour and 10 mins at the max too. We also don't use HPA in our setup so I assume even In normal case pods can't run more than 1 hour and some mins ? Is this the correct assumption for this ? Appreciate your thoughts ! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#94435 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMHPL7ILQWBADCDKXXNHIW3SJ6NA3ANCNFSM4QTFZHJA> .

pacoxu · 2020-10-10T03:54:02Z

@shapeofarchitect it seems that supervisord send kill signal to all its subprocesses, and all process are killed gracefully after 70 minutes.

If you want to add more time, you may try to add a preStop with 'sleep infinity', so that terminating grace period 3h will work

shapeofarchitect · 2020-10-10T19:03:18Z

Unfortunately in my case I can't directly use preStop hook as I have to use this in services container in our gitlab pipelines. Also it seems that I can't edit the pod yaml in the running container for lifecycle hooks.

But I am still unsure why does API server/kubelet even send SIGTERM. what if I just don't want container to die at all.

So appreciate any other suggestions to get this move forward.

fejta-bot · 2021-01-08T20:08:22Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-02-07T20:53:24Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-03-09T21:39:03Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-03-09T21:39:12Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wingman-chakra added the kind/bug Categorizes issue or PR as related to a bug. label Sep 2, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 2, 2020

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 2, 2020

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Sep 2, 2020

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 3, 2020

k8s-ci-robot removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Sep 3, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2021

lucendio mentioned this issue Jan 18, 2021

sftd: add support for multiple SFT servers wireapp/wire-server#1325

Merged

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 7, 2021

k8s-ci-robot closed this as completed Mar 9, 2021

manno mentioned this issue Apr 6, 2021

POD draining does not work because individual containers terminate prematurely instead of waiting for the completion of all drain/preStops on all containers cloudfoundry-incubator/quarks-operator#1297

Closed

hussein-awala mentioned this issue Jun 27, 2023

K8s delete Terminating pod before terminationGracePeriodSeconds with still running tasks apache/airflow#32180

Closed

2 tasks

pebrc mentioned this issue Jul 19, 2023

Add further documentation for Logstash volumes elastic/cloud-on-k8s#7022

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terminationGracePeriodSeconds greater than 10 minutes not working as expected #94435

terminationGracePeriodSeconds greater than 10 minutes not working as expected #94435

wingman-chakra commented Sep 2, 2020

wingman-chakra commented Sep 2, 2020

neolit123 commented Sep 2, 2020

ahg-g commented Sep 3, 2020

k8s-ci-robot commented Sep 3, 2020

ahg-g commented Sep 3, 2020

pacoxu commented Sep 3, 2020

wingman-chakra commented Sep 3, 2020

pacoxu commented Sep 3, 2020

wingman-chakra commented Sep 4, 2020

srikaratstrings commented Sep 4, 2020

pacoxu commented Sep 4, 2020

shapeofarchitect commented Oct 9, 2020 •

edited

srikaratstrings commented Oct 10, 2020 via email

pacoxu commented Oct 10, 2020

shapeofarchitect commented Oct 10, 2020 •

edited

fejta-bot commented Jan 8, 2021

fejta-bot commented Feb 7, 2021

fejta-bot commented Mar 9, 2021

k8s-ci-robot commented Mar 9, 2021

terminationGracePeriodSeconds greater than 10 minutes not working as expected #94435

terminationGracePeriodSeconds greater than 10 minutes not working as expected #94435

Comments

wingman-chakra commented Sep 2, 2020

wingman-chakra commented Sep 2, 2020

neolit123 commented Sep 2, 2020

ahg-g commented Sep 3, 2020

k8s-ci-robot commented Sep 3, 2020

ahg-g commented Sep 3, 2020

pacoxu commented Sep 3, 2020

wingman-chakra commented Sep 3, 2020

pacoxu commented Sep 3, 2020

wingman-chakra commented Sep 4, 2020

srikaratstrings commented Sep 4, 2020

pacoxu commented Sep 4, 2020

shapeofarchitect commented Oct 9, 2020 • edited

srikaratstrings commented Oct 10, 2020 via email

pacoxu commented Oct 10, 2020

shapeofarchitect commented Oct 10, 2020 • edited

fejta-bot commented Jan 8, 2021

fejta-bot commented Feb 7, 2021

fejta-bot commented Mar 9, 2021

k8s-ci-robot commented Mar 9, 2021

shapeofarchitect commented Oct 9, 2020 •

edited

shapeofarchitect commented Oct 10, 2020 •

edited