New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
terminationGracePeriodSeconds greater than 10 minutes not working as expected #94435
Comments
/sig cluster-lifecycle |
/remove-sig cluster-lifecycle |
/remove-sig scheduling |
@ahg-g: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node |
here is the yaml file for the production service that is facing this issue:
apiVersion: apps/v1 apiVersion: autoscaling/v2beta1
I tried to recreate this with minimal config, but this seems to work fine: and I did see "Warning FailedKillPod" in my minimal setup too, so that is a red herring, sorry for that. Let me know how I can go about debugging why is the pod getting killed 10 mins after SIGTERM, even though I've configured it to wait 3 hours. |
Does the pod killed by HPA with a grace period of 10minutes ? I will test with hpa later. |
Yes, with HPA it gets killed in 10 minutes, I just noticed it does not happen consistently. |
We did some digging. It looks like the nodes were getting killed by cluster autoscaler which only allows a maximum of 10 mins to gracefully shutdown. From reading through it sounds like there is no way to configure this limit. |
not sure whether https://predictive-horizontal-pod-autoscaler.readthedocs.io/en/latest/user-guide/downscale-stabilization/ would help. HPA support cool down delay Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual. |
@wingman-chakra @pacoxu I experienced the same exact Issue in our clusters too , We operate in Azure AKS and we run our automated builds in the pods , One of our QA long running tests were set to run for 3 Hours , I set the grace period for the pods to be 3 hours (10800) but right at 1 hour and 7 mins or 8 mins kubernetes send SIGTERM and graceful termination of the pod happens. In my case we are using the Selenium Chrome image https://github.com/SeleniumHQ/docker-selenium , this uses We could just never run this for longer than 1hour and 10 mins at the max too. We also don't use HPA in our setup so I assume even In normal case pods can't run more than 1 hour and some mins ? Is this the correct assumption for this ? Appreciate your thoughts !
|
In our case, the issue was that cluster-autoscaler does not respect
terminationGracePeriodSeconds. I will wait a maximum of 10 minutes before
killing the pod.
…On Sat, Oct 10, 2020 at 5:06 AM Vineet Gupta ***@***.***> wrote:
@wingman-chakra <https://github.com/wingman-chakra> @pacoxu
<https://github.com/pacoxu> I experienced the same exact Issue in our
clusters too , We operate in Azure AKS and we run our automated builds in
the pods , One of our QA long running tests were set to run for 3 Hours , I
set the grace period for the pods to be 3 hours (10800) but right at 1 hour
and 7 mins or 8 mins kubernetes send SIGTERM and graceful termination of
the pod happens.
In my case we are using the Selenium Chrome image
https://github.com/SeleniumHQ/docker-selenium , this uses supervisord
process which is the main process that receives the SIGTERM issued by
container runtime (Kubelet) , despite the grace period I set in , it seems
the process got terminated and below is the script that runs at that time.
https://github.com/SeleniumHQ/docker-selenium/blob/1a3b0e1cd6d9eb3f2d3b91a5f26e160ab50fcd6b/Video/entry_point.sh
We could just never run this for longer than 1hour and 10 mins at the max
too. We also don't use HPA in our setup so I assume even In normal case
pods can't run more than 1 hour and some mins ? Is this the correct
assumption for this ?
Appreciate your thoughts !
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#94435 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMHPL7ILQWBADCDKXXNHIW3SJ6NA3ANCNFSM4QTFZHJA>
.
|
@shapeofarchitect it seems that supervisord send kill signal to all its subprocesses, and all process are killed gracefully after 70 minutes. If you want to add more time, you may try to add a preStop with 'sleep infinity', so that terminating grace period 3h will work |
Unfortunately in my case I can't directly use But I am still unsure why does API server/kubelet even send SIGTERM. what if I just don't want container to die at all. So appreciate any other suggestions to get this move forward. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
Pod with termination grace period of 3 hours is getting killed 10 minutes after SIGTERM
What you expected to happen:
I was expecting the pod to get full 3 hours before SIGKILL is sent
How to reproduce it (as minimally and precisely as possible):
Long running process with termination grace period greater than 20 minutes and delete pod it will get deleted in 10 minutes
Anything else we need to know?:
From kubectl get events:
33m Normal ScaleDown pod/jarvis-6f9c9c79d6-d7vhr deleting pod for node scale down
33m Normal Killing pod/jarvis-6f9c9c79d6-d7vhr Stopping container jarvis
23m Warning FailedKillPod pod/jarvis-6f9c9c79d6-d7vhr error killing pod: failed to "KillContainer" for "jarvis" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:30:33Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.42", GitCommit:"42bef28c2031a74fc68840fce56834ff7ea08518", GitTreeState:"clean", BuildDate:"2020-06-02T16:07:00Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
GKE
cat /etc/os-release
):NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
uname -a
):Linux chakradarraju-lenovo 4.15.0-112-generic Literal configuration of multiple objects (was Service up/down scripts) #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: