Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[close #1802] Close listeners on SIGTERM #1808

Merged
merged 3 commits into from Jun 10, 2019

Conversation

@schneems
Copy link
Contributor

commented May 30, 2019

Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls exit 0. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a ECONNRESET error, this would indicate the request was accepted but then closed. Instead we want ECONNREFUSED.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

$ while m test/test_integration.rb:235; do :; done

Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the -w 2 config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

[close #1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

 ## Test


This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

 ## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

Co-authored-by: Danny Fallon <Danny.fallon.ie+github@gmail.com>
Co-authored-by:  Richard Schneeman <richard.schneeman+foo@gmail.com>
@nateberkopec

This comment has been minimized.

Copy link
Member

commented Jun 3, 2019

The maintainers think this looks good, but could use feedback. If you're experiencing issues with closed connections during shutdowns please give this a shot ❤️

@electron0zero

This comment has been minimized.

Copy link

commented Jun 4, 2019

We had this issue, our workaround was to send SIGTERM to nginx(we run puma behind nginx) and then sleep for 15 seconds and send SIGTERM to puma.

@adamlogic

This comment has been minimized.

Copy link

commented Jun 4, 2019

I see H13 errors all the time when downscaling with Rails Autoscale. I've deployed this patch, and so far so good.

Screen Shot 2019-06-04 at 1 54 53 PM

It's only been a couple hours, I'll post another update tomorrow.

@schneems

This comment has been minimized.

Copy link
Contributor Author

commented Jun 4, 2019

Awesome, thanks for the info @adamlogic!

@adamlogic

This comment has been minimized.

Copy link

commented Jun 5, 2019

After giving it about 24 hours, I'm definitely seeing different behavior, although I can't quite make sense of it. I typically get 1-2 H13 errors during a downscale or restart, spread throughout the day. Yesterday I only saw one instance of H13 errors, but it was a burst of 37 errors. Quite unusual.

Screen Shot 2019-06-05 at 11 58 30 AM

I'm hoping that was an anomaly. Going to keep this patch running for now.

@schneems

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

There was also an incident at that time, so it might be related? Thanks for running this, keep me updated.

@schneems

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2019

How did we do on the next 24 hours?

One theory on where this might be going south is that the default behavior for Puma is not to drain the socket backlog. There's a setting for it though. Just a theory, haven't tested.

@adamlogic

This comment has been minimized.

Copy link

commented Jun 6, 2019

No H13's in the last 36+ hours. 👍

Screen Shot 2019-06-06 at 12 24 49 PM

lib/puma/cluster.rb Outdated Show resolved Hide resolved
@adamlogic

This comment has been minimized.

Copy link

commented Jun 7, 2019

Spoke too soon. Got 4 H13 errors last night.

Screen Shot 2019-06-07 at 8 56 32 AM

Dug into my logs and found this error which correlates exactly with the H13 errors. This is the only instance of this error in my logs, even though I've had many downscale events.

2019-06-06T21:00:59.137527+00:00 heroku web.2 - - State changed from up to down
2019-06-06T21:01:00.403377+00:00 heroku web.2 - - Stopping all processes with SIGTERM
2019-06-06T21:01:00.428773+00:00 heroku web.2 - - Stopping all processes with SIGTERM
2019-06-06T21:01:01.404286+00:00 app web.2 - - [4] - Gracefully shutting down workers...
2019-06-06T21:01:01.408456+00:00 app web.2 - - bundler: failed to load command: puma (/app/vendor/bundle/ruby/2.5.0/bin/puma)
2019-06-06T21:01:01.408661+00:00 app web.2 - - Errno::ECHILD: No child processes
2019-06-06T21:01:01.408666+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `waitpid'
2019-06-06T21:01:01.408669+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `block in stop_workers'
2019-06-06T21:01:01.408672+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `each'
2019-06-06T21:01:01.408674+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `stop_workers'
2019-06-06T21:01:01.408676+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:406:in `block in setup_signals'
2019-06-06T21:01:01.408678+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:531:in `rescue in run'
2019-06-06T21:01:01.408680+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:494:in `run'
2019-06-06T21:01:01.408682+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/launcher.rb:186:in `run'
2019-06-06T21:01:01.408684+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cli.rb:80:in `run'
2019-06-06T21:01:01.408686+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/bin/puma:10:in `<top (required)>'
2019-06-06T21:01:01.408689+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bin/puma:23:in `load'
2019-06-06T21:01:01.408691+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bin/puma:23:in `<top (required)>'
2019-06-06T21:01:01.975425+00:00 heroku web.2 - - Process exited with status 1

LMK if there's any more info that would be helpful.

@schneems

This comment has been minimized.

Copy link
Contributor Author

commented Jun 7, 2019

I’ve seen that error before, it happens because the child got closed before the parent tried to wait on it. When there is no child we should rescue that exception and keep going.

Heroku sends TERM to all processes not just the one it spawned. So sometimes the children close before the parent can send a TERM to them.

I’ll update later today with a fix for that and let you know.

schneems added some commits Jun 7, 2019

Handle case where children are sent TERM
On Heroku, and potentially other platforms, SIGTERM is sent to ALL processes, not just the parent process. This means that by the time the parent worker tries to wait on it's children to shut down, they may not exist. When that happens an `Errno::ECHILD` error is raised as seen in this comment #1808 (comment).

In that situation we don't really care that there's no child, we want to continue the shutdown process in an orderly fashion so we can safely ignore the error and move on.
Make Launcher#close_binder_listeners public
To avoid a `send` call in `cluster.rb` and to indicate to the maintainers that other classes use this method, we will make it public.
@schneems

This comment has been minimized.

Copy link
Contributor Author

commented Jun 7, 2019

@adamlogic I just added 8c78ee2 which should resolve that error. I don't know if that's the cause of the H13 or just covering up some other behavior.

If fixing that error doesn't resolve the H13 then I would suggest setting drain_on_shutdown in your config/puma.rb to see if it helps (but please don't set this until after we verify that my above patch works, I would like to rule out one thing at a time and that's easier to do if we only change one thing).

@adamlogic

This comment has been minimized.

Copy link

commented Jun 7, 2019

👍 Thanks! Just deployed with 84ce04d.

I'll post another update on Monday.

@adamlogic

This comment has been minimized.

Copy link

commented Jun 10, 2019

No H13 errors or Errno::ECHILD errors in the past 72 hours. 🎉

Screen Shot 2019-06-10 at 6 45 46 AM

@nateberkopec

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

That's the cleanest looking Heroku metrics timeline I've ever seen

@schneems schneems merged commit 5e64ed9 into puma:master Jun 10, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

schneems added a commit to schneems/puma that referenced this pull request Jun 25, 2019

Handle case where children are sent TERM
On Heroku, and potentially other platforms, SIGTERM is sent to ALL processes, not just the parent process. This means that by the time the parent worker tries to wait on it's children to shut down, they may not exist. When that happens an `Errno::ECHILD` error is raised as seen in this comment puma#1808 (comment).

In that situation we don't really care that there's no child, we want to continue the shutdown process in an orderly fashion so we can safely ignore the error and move on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.