Stop timeout isn't respected at shutdown/reboot #77873

mrunalp · 2019-05-14T16:50:13Z

What happened:
Containers are getting terminated by systemd without respecting the terminationGracePeriodSeconds set in the pod yaml on reboot or shutdown of a node.

What you expected to happen:
terminationGracePeriodSeconds is respected by systemd when using systemd as the cgroup manager.

How to reproduce it (as minimally and precisely as possible):

Use systemd as the cgroup manager in your container runtime.
Create a pod yaml with terminationGracePeriodSeconds set to 120 seconds.
Reboot the node.
You will notice that the containers get SIGTERM, followed by systemd default stop timeout (typically 90 seconds) and then they are SIGKILLed.

Anything else we need to know?:
This can be fixed by passing the stop timeout to the containers as part of the CreateContainer CRI API. This will allow the container runtimes to set the systemd property for the scope to override the default stop timeout to the value set through terminationGracePeriodSeconds.
This needs changes across the stack as runc doesn't currently provide a way to set the TimeoutStopUSec for the systemd scope created for a container.
The behavior with cgroupfs cgroup manager will need further investigation.

Environment:

Kubernetes version (use kubectl version): All versions.

The text was updated successfully, but these errors were encountered:

mrunalp · 2019-05-14T16:50:33Z

cc: @derekwaynecarr @dchen1107

derekwaynecarr · 2019-05-29T18:39:52Z

while this impacts all pods, it is particularly an issue with static pods or daemon set backed pods which typically are not drained before a maintenance action.

/milestone v1.15

yujuhong · 2019-05-30T00:01:23Z

@derekwaynecarr IIUC, this is a feature request, since kubernetes has never supported this. Is there a reason why this is marked milestone 1.15 when we are already past enhancement freeze and almost reaching code freeze?

Random-Liu · 2019-05-30T00:30:40Z

I'm not sure whether we want the user configured grace period to block node reboot.

Do people only reboot node after draining pods?

If not, say I have a 10min grace period pod, will it block the node from reboot for 10min? Is it fair to the other pods which have 10s grace period? It seems that they may have a 10min downtime or rescheduled to other nodes unnecessarily?

mrunalp · 2019-05-30T00:37:37Z

We can restrict it similar to how privileged is restricted. This is useful for daemon sets and static pods that are system owned.

…

On May 29, 2019, at 5:32 PM, Lantao Liu ***@***.***> wrote: I'm not sure whether we want the user configured grace period to block node reboot. If I specify a 10min grace period, will it block the node from reboot for 10min? Is it fair to the other pods? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Random-Liu · 2019-05-30T00:39:01Z

We can restrict it similar to how privileged is restricted. This is useful for daemon sets and static pods that are system owned.

If that is the case, it seems that we need to define Kubernetes api for that. Kubelet doesn't know whether a pod belongs to a daemonset or not.

BTW, maybe it makes sense to block node reboot for that long, I'm not sure, either. Just thinking out loud.

mattjmcnaughton · 2019-05-30T12:13:09Z

/sig node

soggiest · 2019-06-07T03:29:17Z

Is this issue release blocking or can we move it to 1.16?

liggitt · 2019-06-08T02:12:26Z

If this is not a regression in 1.15, I'd expect it can be moved

soggiest · 2019-06-09T17:16:07Z

/milestone v1.16

josiahbjorgaard · 2019-08-23T16:53:02Z

Hi all, code freeze for v1.16 is coming up on Aug. 29 (in just 6 days). This issue is set for milestone v1.16, will it be solved by then? It will need to have a PR merged beforehand. If not, we will remove the v1.16 milestone.

liggitt · 2019-09-03T16:27:54Z

I'm not aware of any activity on this issue targeting 1.16

fejta-bot · 2019-12-02T16:48:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

dtolmachev · 2019-12-06T21:28:43Z

@mrunalp
I've found this issue trying to fix the same problem as it is described in "How to reproduce it" section.

We run static pod with Docker container. Also we have kubelet and containerd on VM.

The problem: I want to make graceful shutdown with 60s duration for container. But currently, we have only 10 seconds. That's because Docker receives SIGTERM and it is interpreted as docker stop with default 10s timeout
It is possible to set --stop-timeout for docker run but we run container with kubelet and terminationGracePeriodSeconds in pod.yaml doesn't change the shutdown duration.
The pull-request in containerd project is still open.

Can I help with finishing this PR? What job needs to be done?

Thanks a lot for supporting Kubernetes and containerd!

fejta-bot · 2020-01-05T22:12:14Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-02-04T22:54:14Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-02-04T22:54:22Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mrunalp · 2020-02-24T22:23:11Z

Can this be reopened? I am opening a PR to address this.

dims · 2020-02-24T22:48:37Z

/reopen

k8s-ci-robot · 2020-02-24T22:48:39Z

@dims: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mrunalp · 2020-02-24T22:53:22Z

@dims Thanks :)

Enable passing of sandbox's termination grace period down to OCI runtime, as an annotation for systemd. This is a glue between * kubernetes/kubernetes#88495 and * opencontainers/runc#2224 (or containers/crun#266) that is a part of the fix for * kubernetes/kubernetes#77873 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Enable passing of kubernetes termination grace period down to OCI runtime, as an annotation for systemd. This builds on top of * opencontainers/runc#2224 (or containers/crun#266) and is part of the fix for * kubernetes/kubernetes#77873 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Enable passing of kubernetes termination grace period down to OCI runtime, as an annotation for systemd. This builds on top of * opencontainers/runc#2224 (or containers/crun#266) and is part of the fix for * kubernetes/kubernetes#77873 (cherry picked from commit 1f85692) Conflicts: a minor conflict in server/container_create_linux.go Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

fejta-bot · 2020-03-25T23:24:19Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-03-25T23:24:35Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Enable passing of kubernetes termination grace period down to OCI runtime, as an annotation for systemd. This builds on top of * opencontainers/runc#2224 (or containers/crun#266) and is part of the fix for * kubernetes/kubernetes#77873 (cherry picked from commit 1f85692) Conflicts: a minor conflict in server/container_create_linux.go Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

mrunalp added the kind/bug Categorizes issue or PR as related to a bug. label May 14, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 14, 2019

This was referenced May 22, 2019

Add support for specifying systemd timeout property opencontainers/runc#2062

Closed

cri: Add termination_grace_period for Pod Sandbox #78490

Closed

k8s-ci-robot added this to the v1.15 milestone May 29, 2019

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 30, 2019

k8s-ci-robot modified the milestones: v1.15, v1.16 Jun 9, 2019

liggitt removed this from the v1.16 milestone Sep 3, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 5, 2020

k8s-ci-robot closed this as completed Feb 4, 2020

kolyshkin mentioned this issue Feb 7, 2020

Allow to set systemd unit properties via annotations opencontainers/runc#2224

Merged

3 tasks

mrunalp mentioned this issue Feb 24, 2020

Add termination grace period to pod sandbox config #88495

Closed

k8s-ci-robot reopened this Feb 24, 2020

kolyshkin mentioned this issue Feb 25, 2020

[WIP] CreateContainer: pass TerminationGracePeriod cri-o/cri-o#3317

Closed

1 task

kolyshkin mentioned this issue Feb 26, 2020

CreateContainer: pass TerminationGracePeriod cri-o/cri-o#3328

Merged

This was referenced Mar 7, 2020

[1.16] CreateContainer: pass TerminationGracePeriod cri-o/cri-o#3370

Closed

[1.17] CreateContainer: pass TerminationGracePeriod cri-o/cri-o#3371

Merged

k8s-ci-robot closed this as completed Mar 25, 2020

nkinkade mentioned this issue Dec 17, 2020

Investigate whether kured is actually safe to use on the platform m-lab/k8s-support#529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop timeout isn't respected at shutdown/reboot #77873

Stop timeout isn't respected at shutdown/reboot #77873

mrunalp commented May 14, 2019

mrunalp commented May 14, 2019

derekwaynecarr commented May 29, 2019

yujuhong commented May 30, 2019

Random-Liu commented May 30, 2019 •

edited

mrunalp commented May 30, 2019 via email

Random-Liu commented May 30, 2019 •

edited

mattjmcnaughton commented May 30, 2019

soggiest commented Jun 7, 2019

liggitt commented Jun 8, 2019

soggiest commented Jun 9, 2019

josiahbjorgaard commented Aug 23, 2019

liggitt commented Sep 3, 2019

fejta-bot commented Dec 2, 2019

dtolmachev commented Dec 6, 2019

fejta-bot commented Jan 5, 2020

fejta-bot commented Feb 4, 2020

k8s-ci-robot commented Feb 4, 2020

mrunalp commented Feb 24, 2020

dims commented Feb 24, 2020

k8s-ci-robot commented Feb 24, 2020

mrunalp commented Feb 24, 2020

fejta-bot commented Mar 25, 2020

k8s-ci-robot commented Mar 25, 2020

Stop timeout isn't respected at shutdown/reboot #77873

Stop timeout isn't respected at shutdown/reboot #77873

Comments

mrunalp commented May 14, 2019

mrunalp commented May 14, 2019

derekwaynecarr commented May 29, 2019

yujuhong commented May 30, 2019

Random-Liu commented May 30, 2019 • edited

mrunalp commented May 30, 2019 via email

Random-Liu commented May 30, 2019 • edited

mattjmcnaughton commented May 30, 2019

soggiest commented Jun 7, 2019

liggitt commented Jun 8, 2019

soggiest commented Jun 9, 2019

josiahbjorgaard commented Aug 23, 2019

liggitt commented Sep 3, 2019

fejta-bot commented Dec 2, 2019

dtolmachev commented Dec 6, 2019

fejta-bot commented Jan 5, 2020

fejta-bot commented Feb 4, 2020

k8s-ci-robot commented Feb 4, 2020

mrunalp commented Feb 24, 2020

dims commented Feb 24, 2020

k8s-ci-robot commented Feb 24, 2020

mrunalp commented Feb 24, 2020

fejta-bot commented Mar 25, 2020

k8s-ci-robot commented Mar 25, 2020

Random-Liu commented May 30, 2019 •

edited

Random-Liu commented May 30, 2019 •

edited