Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Improved graceful shutdown (zero downtime) #18914

Closed
ReToCode opened this issue Mar 9, 2018 · 13 comments
Closed

[Proposal] Improved graceful shutdown (zero downtime) #18914

ReToCode opened this issue Mar 9, 2018 · 13 comments
Assignees

Comments

@ReToCode
Copy link
Member

ReToCode commented Mar 9, 2018

Problem / Motivation

We operate more than 3500 containers on a large OpenShift cluster. A lot of of applications have the same problem with the current termination process. To achieve zero downtime in rolling updates, pod restarts and evacuation of nodes due to maintenance an application has to do the following things:

  • Pod has to be killed due to some of the above mentioned events
  • Kubernetes/OpenShift sends a SIGTERM signal
  • Application has to catch the SIGTERM signal
  • Application has to set its readyness probe to false to stop getting new traffic
  • Application has to wait until no new connections are sent via service
  • In OpenShift the application also has to wait until the HA-Proxy is reloaded to not get any traffic from there. Also the HA-Proxy has it's own health check which is not in sync with the readyness state of Kubernetes. To signal HA-Proxy to not send any more traffic the HTTP listener port has to be closed.
  • The application has to finish it's active requests (within the terminationGracePeriodSeconds) and then terminate itself

@sbb we implemented this behaviour for Spring Boot 1 & 2 with this extension library:
https://github.com/SchweizerischeBundesbahnen/springboot-graceful-shutdown

But this solution only works for java apps. All other languages/webservers have to implement the same thing again and again. We talked to a lot of people/companies that use OpenShift/Kubernetes and all of them struggle with this issue. Thus, I would like to propose a solution where the container platform handles the termination a bit differently.

Proposal

Introduce a pod new life cycle state, something like "TerminationPreparation", in this state

  • Stop the readyness-check and set the container to "NotReady", stop sending SDN traffic to that container
  • In OpenShift remove the container from the HA-Proxy config, wait until all HA-Proxies are reloaded
  • Introduce a new setting where an application can define how long it needs to finish existing requests. This could also be something like the readyness/liveness checks (e.g. terminationPreparationGracePeriodSeconds).
  • If this time is up, or the application signals that is is done processing requests, send the SIGTERM
  • Applications still can handle that signal if they have to do things like cleanup, but it is no longer mandatory for "zero downtime"

This would massively improve the availability of our applications during any form of container termination. Developers would no longer need to take care of that manually.

@ReToCode
Copy link
Member Author

ReToCode commented Mar 9, 2018

As discussed in advance, FYI: @sreber84, @eberlec, @saturnism, @smarterclayton, @knobunc.

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 9, 2018

Just so I understand, using preStop isn't an option? I.e.:

  1. add a preStop hook wait_for_done.sh that:
  2. waits 10s
  3. optionally sends a signal to the process to tell it to stop accepting new requests
  4. waits until the process is done before exiting

won't solve your issue?

I agree that the wait 10s in preStop is ugly, but your wait script can control how long before termination you have to exit, even if your process doesn't work. I.e. if you have a 60s timeout on requests, you should be able to set termination grace period to 120s, set preStop to wait 60s + 10s for buffer, then exit. Then you'll get 50s for graceful shutdown before SIGKILL gets sent.

@ReToCode
Copy link
Member Author

This should work yes, but this way all the apps/devs would have to add that themselves.

At least the first part (stop sending new traffic to an app before sending the SIGTERM) seems like a good idea on platform level. In a "classic environment" one would remove an app from a load balancer before even thinking about stopping it. I think it would help a lot of people/apps if this would be changed globally. I heard that quite a lot of people struggle with the same problem.

I agree on the second part (to finish active requets before quitting). This is the apps responsability and can also be done in the way you described.

@jwforres
Copy link
Member

@openshift/sig-pod @openshift/sig-networking

@sjenning
Copy link
Contributor

This would involve upstream as we are talking about core kube components here. I agree this is an issue and I do hear about it in the community.

In fact, we talked about this proposal in the sig-node meeting which handles this problem on the pod bring-up side vs the tear-down side.
https://docs.google.com/document/d/1VFZbc_IqPf_Msd-jul7LKTmGjvQ5qRldYOFV0lGqxf8/edit

I do think that an additional pod state would be difficult to get accepted upstream.

It seems that this could be accomplished if the Pod was removed from the Endpoints when the deletionTimestamp was set, indicating the Pod is terminating. Then the pod could set whatever terminationGracePeriodSeconds required as a timeout for draining connections. If the drain completes early, the process and simple exit.

I'm unsure about if Pods are removed from Endpoint once the deletionTimestamp is set or when the Pod is actually deleted.

Maybe I'm not grasping the nuance.

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 21, 2018 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 21, 2018 via email

@ReToCode
Copy link
Member Author

Thanks for the feedback. As far as I am concerned, a new pod state is not mandatory. Anything that helps to improve the situation is welcome :) I agree that this hat so be fixed in kubernetes first and then openshift just needs to add the ha-proxy part.

@eoftedal
Copy link

The current state seems to be contradictory to the Openshift documentation. On https://docs.openshift.com/container-platform/3.7/dev_guide/deployments/advanced_deployment_strategies.html we can read:

On shutdown, OpenShift Container Platform will send a TERM signal to the processes in the container. Application code, on receiving SIGTERM, should stop accepting new connections. This will ensure that load balancers route traffic to other active instances. The application code should then wait until all open connections are closed (or gracefully terminate individual connections at the next opportunity) before exiting.

However Openshift will continue to send requests to the pod for some seconds after sending it SIGTERM, and these requests will fail if we stop accepting connections.

@eoftedal
Copy link

eoftedal commented Apr 25, 2018

The current behaviour is quite weird and unexpected:
"Dear pod, please die. Also please handle these requests"
The service should remove the pod from the LB before signalling TERM.

@ghost
Copy link

ghost commented Apr 26, 2018

The problem with the recommended process "application code [...] should stop accepting new Connections" in java environments this is part of the application server.
@ReToCode created a workaround for Tomcat (Embedded/SpringBoot), as Tomcat doesn't support graceful shutdown at all.
Undertow supports it, but reacts with a HTTP Return Code 503 during shutdown process.
I adopted that logic for SpringBoot with Tomcat.
But this only works for routes (loadbalancer) without "SSL-Passthrough". It doesn't work for services.

I fully agree this should be handled by OpenShift (Kubernetes). Before pods are evicted they should be removed from services and routes (load balancers).

A couple of other project teams in my customer's company (operating a large OpenShift cluster) are struggling with the same problem.

@vikaschoudhary16
Copy link
Contributor

@jmencak this seems familiar :)

@knobunc
Copy link
Contributor

knobunc commented May 9, 2018

The problem is that there is no tight coupling between the pieces of the system. The router only learns that the backing pods are gone when the endpoint updates. BUT the router can't immediately reload (reloads are rate limited, and even when it can immediately reload, a reload can take a few seconds to a minute depending on the number of routes and the speed of the box).

With haproxy 1.8 we can make some dynamic changes to the running router, so we don't need to do a reload for a lot of changes so the responsiveness will be greatly improved.

But for now, you need to make sure that you have some delay between when termination is started and when the pod exits. You can either add a SIGTERM handler to the process, or have a PreStop hook registered that sleeps for a little while.

https://bugzilla.redhat.com/show_bug.cgi?id=1573207#c5

@knobunc knobunc closed this as completed May 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants