Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H13 (connection closed with no response) from Puma on Heroku #1802

Closed
schneems opened this issue May 21, 2019 · 11 comments

Comments

@schneems
Copy link
Contributor

commented May 21, 2019

For full description see reproduction repository https://github.com/schneems/puma_connection_closed_reproduction.

Note: The hanging issue noted in there is due to a Ruby 2.6 bug and has been fixed in master.

Possible cause and fix

It looks like durring SIGTERM the socket connection is never explicitly closed. The only way it is closed is when the process calls exit(0). Instead, we can manually close the socket which will prevent additional incoming requests but will still allow us to respond to requests that have already been accepted.

@schneems

This comment has been minimized.

Copy link
Contributor Author

commented May 21, 2019

Internal Heroku support ticket tracking: 696993

@nateberkopec

This comment has been minimized.

Copy link
Member

commented May 21, 2019

To be clear this is exactly what you were talking about at Railsconf, right?

@schneems

This comment has been minimized.

Copy link
Contributor Author

commented May 21, 2019

Correct, this is the issue I brought up at RailsConf. I needed a parking place to talk about it and leave notes in public.

@mockdeep

This comment has been minimized.

Copy link

commented May 21, 2019

Thanks for looking into this @schneems. I'm from the Heroku ticket, didn't want to respond there so that it can auto-close. Please, let me know if there's any additional info I can help you with.

@dannyfallon

This comment has been minimized.

Copy link
Contributor

commented May 23, 2019

2019-05-10T18:40:56.785664+00:00 app[web.1]: [4] - Worker 0 (pid: 8) booted, phase: 0
2019-05-10T18:40:56.787225+00:00 app[web.1]: [4] - Worker 1 (pid: 11) booted, phase: 0
2019-05-10T18:40:57.475171+00:00 heroku[web.1]: State changed from starting to up
2019-05-10T18:41:06.847732+00:00 heroku[router]: at=info method=GET path="/?request_thread=1" host=ruby-h13.herokuapp.com request_id=d2f9b0fe-2250-47d5-8aea-84d3ecd79ae7 dyno=web.1 connect=1ms service=2ms status=200 bytes=104 protocol=https
2019-05-10T18:41:06.848018+00:00 app[web.1]: [4] - Gracefully shutting down workers...
2019-05-10T18:41:06.859330+00:00 heroku[router]: at=error code=H13 desc="Connection closed without response" method=GET path="/?request_thread=6" host=ruby-h13.herokuapp.com request_id=05696319-a6ff-4fad-b219-6dd043536314 dyno=web.1 connect=0ms service=5ms status=503 bytes=0 protocol=https
2019-05-10T18:41:06.905734+00:00 heroku[web.1]: Process exited with status 143
2019-05-10T18:41:06.937369+00:00 heroku[web.1]: State changed from up to crashed
2019-05-10T18:41:07.862567+00:00 heroku[router]: at=error code=H10 desc="App crashed" method=GET path="/?request_thread=3" host=ruby-h13.herokuapp.com request_id=f939302a-e88a-482a-89c9-dfcddf125ca0 dyno=web.1 connect=0ms service= status=503 bytes= protocol=https

The H10s are expected? You've got .01 seconds between the log line acking the SIGTERM and the H13. Did it H13 because the worker died? Was that an already inflight request and we've got problems with Puma not allowing draining? What's the Heroku forced kill situation look like i.e. if I trap a SIGTERM and do nothing, at what point would you come in and kill the dyno and is that a factor here?

@schneems

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2019

Heroku gives a process 10 seconds to shut down before sending a SIGKILL.

If you’re running my reproduction the reason a H10 happens is because the main process died so Heroku reports it as “crashed” https://devcenter.heroku.com/articles/error-codes#h10-app-crashed. This is because we are sending ourselves a sigterm versus the platform sending us one.

I think the H13 comes because even though the worker is dead and cannot process requests, the socket is still open and accepted a request that it could not respond to. It only closes the connection when the main process exits via an exit() call.

@nateberkopec nateberkopec added the bug label May 28, 2019

schneems added a commit to schneems/puma that referenced this issue May 30, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

## Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

schneems added a commit to schneems/puma that referenced this issue May 30, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

## Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

schneems added a commit to schneems/puma that referenced this issue May 30, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

schneems pushed a commit to schneems/puma that referenced this issue May 30, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

 ## Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

 ## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

schneems pushed a commit to schneems/puma that referenced this issue May 30, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

 ## Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

 ## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

schneems added a commit to schneems/puma that referenced this issue May 30, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

 ## Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

 ## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

schneems added a commit to schneems/puma that referenced this issue May 30, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

 ## Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

 ## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

Co-authored-by: Danny Fallon <Danny.fallon.ie+github@gmail.com>
Co-authored-by:  Richard Schneeman <richard.schneeman+foo@gmail.com>

schneems added a commit to schneems/puma that referenced this issue May 30, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

 ## Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

 ## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

Co-authored-by: Danny Fallon <Danny.fallon.ie+github@gmail.com>
Co-authored-by:  Richard Schneeman <richard.schneeman+foo@gmail.com>
@schneems

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

@mockdeep can you try to run your app with my branch? #1808

gem 'puma', github: 'schneems/puma', branch: 'schneems/sigterm-problem'
@mockdeep

This comment has been minimized.

Copy link

commented Jun 1, 2019

@schneems I've put in a PR that'll hopefully get deployed early next week. I'll give it a few days and get back to you after probably another week. Thanks so much for digging into this!

@schneems

This comment has been minimized.

Copy link
Contributor Author

commented Jun 7, 2019

You get a chance to try this out @mockdeep ?

@mockdeep

This comment has been minimized.

Copy link

commented Jun 7, 2019

Hey @schneems I just looked over our logs and we haven't seen an H13 error since we deployed this on the 4th. I'll keep an eye on it and let you know if things change, but right now it's looking like it may have done the trick. Thanks!

@schneems schneems closed this in 184e151 Jun 10, 2019

schneems added a commit that referenced this issue Jun 10, 2019

Merge pull request #1808 from schneems/schneems/sigterm-problem
[close #1802] Close listeners on SIGTERM
@mockdeep

This comment has been minimized.

Copy link

commented Jun 10, 2019

Just checked again and still no errors. Thanks again!

schneems added a commit to schneems/puma that referenced this issue Jun 25, 2019

[close puma#1802] Close listeners on SIGTERM
Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

 ## Test


This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

 ## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

Co-authored-by: Danny Fallon <Danny.fallon.ie+github@gmail.com>
Co-authored-by:  Richard Schneeman <richard.schneeman+foo@gmail.com>

MSP-Greg added a commit to MSP-Greg/puma that referenced this issue Aug 23, 2019

test_integration.erb - fix intermittent test
Fixes test_sigterm_closes_listeners_on_forked_servers

When the test was added in PR puma#1802, it used the rackup/1second.ru file, which provided a fixed one second delay.

PR puma#1903 removed the 1second.ru file and added a sleep.ru file, which allowed variable timing.  But, the 'sleep1' path was added to only one of the test's two connect calls.

The test has passed most of the time, but intermittently fails.  This adds the 1 second delay to the 2nd connect call, which should make it stable.

nateberkopec added a commit that referenced this issue Aug 23, 2019

test_integration.erb - fix intermittent test (#1924)
Fixes test_sigterm_closes_listeners_on_forked_servers

When the test was added in PR #1802, it used the rackup/1second.ru file, which provided a fixed one second delay.

PR #1903 removed the 1second.ru file and added a sleep.ru file, which allowed variable timing.  But, the 'sleep1' path was added to only one of the test's two connect calls.

The test has passed most of the time, but intermittently fails.  This adds the 1 second delay to the 2nd connect call, which should make it stable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.