[Proposal] Improved graceful shutdown (zero downtime) #18914

ReToCode · 2018-03-09T11:37:19Z

Problem / Motivation

We operate more than 3500 containers on a large OpenShift cluster. A lot of of applications have the same problem with the current termination process. To achieve zero downtime in rolling updates, pod restarts and evacuation of nodes due to maintenance an application has to do the following things:

Pod has to be killed due to some of the above mentioned events
Kubernetes/OpenShift sends a SIGTERM signal
Application has to catch the SIGTERM signal
Application has to set its readyness probe to false to stop getting new traffic
Application has to wait until no new connections are sent via service
In OpenShift the application also has to wait until the HA-Proxy is reloaded to not get any traffic from there. Also the HA-Proxy has it's own health check which is not in sync with the readyness state of Kubernetes. To signal HA-Proxy to not send any more traffic the HTTP listener port has to be closed.
The application has to finish it's active requests (within the terminationGracePeriodSeconds) and then terminate itself

@sbb we implemented this behaviour for Spring Boot 1 & 2 with this extension library:
https://github.com/SchweizerischeBundesbahnen/springboot-graceful-shutdown

But this solution only works for java apps. All other languages/webservers have to implement the same thing again and again. We talked to a lot of people/companies that use OpenShift/Kubernetes and all of them struggle with this issue. Thus, I would like to propose a solution where the container platform handles the termination a bit differently.

Proposal

Introduce a pod new life cycle state, something like "TerminationPreparation", in this state

Stop the readyness-check and set the container to "NotReady", stop sending SDN traffic to that container
In OpenShift remove the container from the HA-Proxy config, wait until all HA-Proxies are reloaded
Introduce a new setting where an application can define how long it needs to finish existing requests. This could also be something like the readyness/liveness checks (e.g. terminationPreparationGracePeriodSeconds).
If this time is up, or the application signals that is is done processing requests, send the SIGTERM
Applications still can handle that signal if they have to do things like cleanup, but it is no longer mandatory for "zero downtime"

This would massively improve the availability of our applications during any form of container termination. Developers would no longer need to take care of that manually.

ReToCode · 2018-03-09T11:41:54Z

As discussed in advance, FYI: @sreber84, @eberlec, @saturnism, @smarterclayton, @knobunc.

smarterclayton · 2018-03-09T23:03:30Z

Just so I understand, using preStop isn't an option? I.e.:

add a preStop hook wait_for_done.sh that:
waits 10s
optionally sends a signal to the process to tell it to stop accepting new requests
waits until the process is done before exiting

won't solve your issue?

I agree that the wait 10s in preStop is ugly, but your wait script can control how long before termination you have to exit, even if your process doesn't work. I.e. if you have a 60s timeout on requests, you should be able to set termination grace period to 120s, set preStop to wait 60s + 10s for buffer, then exit. Then you'll get 50s for graceful shutdown before SIGKILL gets sent.

ReToCode · 2018-03-10T15:37:40Z

This should work yes, but this way all the apps/devs would have to add that themselves.

At least the first part (stop sending new traffic to an app before sending the SIGTERM) seems like a good idea on platform level. In a "classic environment" one would remove an app from a load balancer before even thinking about stopping it. I think it would help a lot of people/apps if this would be changed globally. I heard that quite a lot of people struggle with the same problem.

I agree on the second part (to finish active requets before quitting). This is the apps responsability and can also be done in the way you described.

jwforres · 2018-03-20T13:42:01Z

@openshift/sig-pod @openshift/sig-networking

sjenning · 2018-03-21T03:45:11Z

This would involve upstream as we are talking about core kube components here. I agree this is an issue and I do hear about it in the community.

In fact, we talked about this proposal in the sig-node meeting which handles this problem on the pod bring-up side vs the tear-down side.
https://docs.google.com/document/d/1VFZbc_IqPf_Msd-jul7LKTmGjvQ5qRldYOFV0lGqxf8/edit

I do think that an additional pod state would be difficult to get accepted upstream.

It seems that this could be accomplished if the Pod was removed from the Endpoints when the deletionTimestamp was set, indicating the Pod is terminating. Then the pod could set whatever terminationGracePeriodSeconds required as a timeout for draining connections. If the drain completes early, the process and simple exit.

I'm unsure about if Pods are removed from Endpoint once the deletionTimestamp is set or when the Pod is actually deleted.

Maybe I'm not grasping the nuance.

smarterclayton · 2018-03-21T14:19:11Z

A deleted pod is considered not ready. There's a *very* old issue for this that is similar kubernetes/kubernetes#13364 kubernetes/kubernetes#20473

…

On Tue, Mar 20, 2018 at 11:45 PM, Seth Jennings ***@***.***> wrote: This would involve upstream as we are talking about core kube components here. I agree this is an issue and I do hear about it in the community. In fact, we talked about this proposal in the sig-node meeting which handles this problem on the pod bring-up side vs the tear-down side. https://docs.google.com/document/d/1VFZbc_IqPf_Msd- jul7LKTmGjvQ5qRldYOFV0lGqxf8/edit I do think that an additional pod state would be difficult to get accepted upstream. It seems that this could be accomplished if the Pod was removed from the Endpoints when the deletionTimestamp was set, indicating the Pod is terminating. Then the pod could set whatever terminationGracePeriodSeconds required as a timeout for draining connections. If the drain completes early, the process and simple exit. I'm unsure about if Pods are removed from Endpoint once the deletionTimestamp is set or when the Pod is actually deleted. Maybe I'm not grasping the nuance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18914 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p-C1mAeKwLBf8LHiTQ9BdIzSd6wkks5tgczLgaJpZM4SkJAl> .

smarterclayton · 2018-03-21T14:19:23Z

Agree this is something that needs to get some real attention. On Wed, Mar 21, 2018 at 10:19 AM, Clayton Coleman <ccoleman@redhat.com> wrote:

…

A deleted pod is considered not ready. There's a *very* old issue for this that is similar kubernetes/kubernetes#13364 kubernetes/kubernetes#20473 On Tue, Mar 20, 2018 at 11:45 PM, Seth Jennings ***@***.***> wrote: > This would involve upstream as we are talking about core kube components > here. I agree this is an issue and I do hear about it in the community. > > In fact, we talked about this proposal in the sig-node meeting which > handles this problem on the pod bring-up side vs the tear-down side. > https://docs.google.com/document/d/1VFZbc_IqPf_Msd-jul7LKTmG > jvQ5qRldYOFV0lGqxf8/edit > > I do think that an additional pod state would be difficult to get > accepted upstream. > > It seems that this could be accomplished if the Pod was removed from the > Endpoints when the deletionTimestamp was set, indicating the Pod is > terminating. Then the pod could set whatever > terminationGracePeriodSeconds required as a timeout for draining > connections. If the drain completes early, the process and simple exit. > > I'm unsure about if Pods are removed from Endpoint once the > deletionTimestamp is set or when the Pod is actually deleted. > > Maybe I'm not grasping the nuance. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#18914 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABG_p-C1mAeKwLBf8LHiTQ9BdIzSd6wkks5tgczLgaJpZM4SkJAl> > . >

ReToCode · 2018-03-27T07:52:01Z

Thanks for the feedback. As far as I am concerned, a new pod state is not mandatory. Anything that helps to improve the situation is welcome :) I agree that this hat so be fixed in kubernetes first and then openshift just needs to add the ha-proxy part.

eoftedal · 2018-04-25T08:10:26Z

The current state seems to be contradictory to the Openshift documentation. On https://docs.openshift.com/container-platform/3.7/dev_guide/deployments/advanced_deployment_strategies.html we can read:

On shutdown, OpenShift Container Platform will send a TERM signal to the processes in the container. Application code, on receiving SIGTERM, should stop accepting new connections. This will ensure that load balancers route traffic to other active instances. The application code should then wait until all open connections are closed (or gracefully terminate individual connections at the next opportunity) before exiting.

However Openshift will continue to send requests to the pod for some seconds after sending it SIGTERM, and these requests will fail if we stop accepting connections.

eoftedal · 2018-04-25T15:46:26Z

The current behaviour is quite weird and unexpected:
"Dear pod, please die. Also please handle these requests"
The service should remove the pod from the LB before signalling TERM.

ghost · 2018-04-26T10:44:46Z

The problem with the recommended process "application code [...] should stop accepting new Connections" in java environments this is part of the application server.
@ReToCode created a workaround for Tomcat (Embedded/SpringBoot), as Tomcat doesn't support graceful shutdown at all.
Undertow supports it, but reacts with a HTTP Return Code 503 during shutdown process.
I adopted that logic for SpringBoot with Tomcat.
But this only works for routes (loadbalancer) without "SSL-Passthrough". It doesn't work for services.

I fully agree this should be handled by OpenShift (Kubernetes). Before pods are evicted they should be removed from services and routes (load balancers).

A couple of other project teams in my customer's company (operating a large OpenShift cluster) are struggling with the same problem.

vikaschoudhary16 · 2018-04-27T07:41:10Z

@jmencak this seems familiar :)

knobunc · 2018-05-09T13:03:27Z

The problem is that there is no tight coupling between the pieces of the system. The router only learns that the backing pods are gone when the endpoint updates. BUT the router can't immediately reload (reloads are rate limited, and even when it can immediately reload, a reload can take a few seconds to a minute depending on the number of routes and the speed of the box).

With haproxy 1.8 we can make some dynamic changes to the running router, so we don't need to do a reload for a lot of changes so the responsiveness will be greatly improved.

But for now, you need to make sure that you have some delay between when termination is started and when the pod exits. You can either add a SIGTERM handler to the process, or have a PreStop hook registered that sleeps for a little while.

https://bugzilla.redhat.com/show_bug.cgi?id=1573207#c5

jwforres assigned sjenning and knobunc Mar 20, 2018

jwforres added the kind/enhancement label Mar 20, 2018

knobunc closed this as completed May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Improved graceful shutdown (zero downtime) #18914

[Proposal] Improved graceful shutdown (zero downtime) #18914

ReToCode commented Mar 9, 2018

ReToCode commented Mar 9, 2018

smarterclayton commented Mar 9, 2018 •

edited

ReToCode commented Mar 10, 2018

jwforres commented Mar 20, 2018

sjenning commented Mar 21, 2018

smarterclayton commented Mar 21, 2018 via email

smarterclayton commented Mar 21, 2018 via email

ReToCode commented Mar 27, 2018

eoftedal commented Apr 25, 2018

eoftedal commented Apr 25, 2018 •

edited

ghost commented Apr 26, 2018 •

edited by ghost

vikaschoudhary16 commented Apr 27, 2018

knobunc commented May 9, 2018

[Proposal] Improved graceful shutdown (zero downtime) #18914

[Proposal] Improved graceful shutdown (zero downtime) #18914

Comments

ReToCode commented Mar 9, 2018

Problem / Motivation

Proposal

ReToCode commented Mar 9, 2018

smarterclayton commented Mar 9, 2018 • edited

ReToCode commented Mar 10, 2018

jwforres commented Mar 20, 2018

sjenning commented Mar 21, 2018

smarterclayton commented Mar 21, 2018 via email

smarterclayton commented Mar 21, 2018 via email

ReToCode commented Mar 27, 2018

eoftedal commented Apr 25, 2018

eoftedal commented Apr 25, 2018 • edited

ghost commented Apr 26, 2018 • edited by ghost

vikaschoudhary16 commented Apr 27, 2018

knobunc commented May 9, 2018

smarterclayton commented Mar 9, 2018 •

edited

eoftedal commented Apr 25, 2018 •

edited

ghost commented Apr 26, 2018 •

edited by ghost