Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[close #1802] Close listeners on SIGTERM #1808
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls
This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.
This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a
I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.
This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the
After giving it about 24 hours, I'm definitely seeing different behavior, although I can't quite make sense of it. I typically get 1-2 H13 errors during a downscale or restart, spread throughout the day. Yesterday I only saw one instance of H13 errors, but it was a burst of 37 errors. Quite unusual.
I'm hoping that was an anomaly. Going to keep this patch running for now.
Spoke too soon. Got 4 H13 errors last night.
Dug into my logs and found this error which correlates exactly with the H13 errors. This is the only instance of this error in my logs, even though I've had many downscale events.
LMK if there's any more info that would be helpful.
I’ve seen that error before, it happens because the child got closed before the parent tried to wait on it. When there is no child we should rescue that exception and keep going.
Heroku sends TERM to all processes not just the one it spawned. So sometimes the children close before the parent can send a TERM to them.
I’ll update later today with a fix for that and let you know.
If fixing that error doesn't resolve the H13 then I would suggest setting