Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not cleanly closing ssl sockets on phased_restart? #1002

Closed
mattyb opened this issue Jun 16, 2016 · 8 comments
Closed

not cleanly closing ssl sockets on phased_restart? #1002

mattyb opened this issue Jun 16, 2016 · 8 comments

Comments

@mattyb
Copy link

mattyb commented Jun 16, 2016

We're running apache bench against a running puma server bound to ssl sockets. When we perform a puma phased restart, we get a batch of SSL errors in apache bench each time a worker goes down. With verbose ab flags, a healthy connection includes messages like

SSL/TLS State [connect] SSLv3 flush data
read from 0x7f1e770b74e0 [0x7f1e770aab33] (5 bytes => -1 (0xFFFFFFFFFFFFFFFF))
read from 0x7f1e770b74e0 [0x7f1e770aab33] (5 bytes => 5 (0x5))
0000 - 16 03 03 00 aa                                    .....

Where a failed connection looks like

SSL/TLS State [connect] SSLv3 flush data
read from 0x7f1e770b77a0 [0x7f1e770aab33] (5 bytes => -1 (0xFFFFFFFFFFFFFFFF))
read from 0x7f1e770b77a0 [0x7f1e770aab33] (5 bytes => 0 (0x0))
SSL handshake failed (5).

I believe that this is the reason that our AWS ELB returns 504 errors each time we perform a phased restart. cc @timabdulla, who reported similar symptoms in #957

@evanphx
Copy link
Member

evanphx commented Jul 24, 2016

If it's the same issue as #957, you can configured the persistent_timeout now. Please try that and if it doesn't work, reopen this issue.

@evanphx evanphx closed this as completed Jul 24, 2016
@mattyb
Copy link
Author

mattyb commented Jul 24, 2016

Thanks @evanphx, but adjusting timeouts doesn't affect this. This error occurs without going through the ELB.

@evanphx evanphx reopened this Jul 24, 2016
@evanphx
Copy link
Member

evanphx commented Jul 24, 2016

Hm, ok. I wonder if perhaps on a worker exit persistent sockets aren't being closed. That's not an issue for normal HTTP sockets because the process exit will close the socket, but would be an issue for SSL sockets because the close actually sends data back across telling the other side that it's closed. I'll investigate.

@evanphx
Copy link
Member

evanphx commented Jul 25, 2016

Nope, close is called on all sockets. Oh, are you using actioncable or websockets or something else that perhaps uses hijack? If so, I'm betting they're not closing the sockets down on shutdown...

@mattyb
Copy link
Author

mattyb commented Jul 25, 2016

I've reproduced the error with just a puma config file (no rack app) here: https://github.com/mattyb/puma-socket-test

@evanphx
Copy link
Member

evanphx commented Jul 25, 2016

Thanks so much for the reproduction! I'm working on it now.

@evanphx
Copy link
Member

evanphx commented Jul 26, 2016

@mattyb That should fix your problem. Do you have a way to try it out easily before I release the fix?

@mattyb
Copy link
Author

mattyb commented Jul 26, 2016

Thanks for working on this! I no longer see the SSL errors popping up in the console and there are new messages in the verbose logs about connections being closed. Unfortunately, even with the reproduction code updated to the 46416cb commit, I still see failed requests from ab.

Concurrency Level:      25
Time taken for tests:   21.217 seconds
Complete requests:      3000
Failed requests:        21
   (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)

It's possible that this is a problem with ab's implementation, though I believe I'm using the latest version. I confirmed that there are no errors if I bind tcp without ssl. I assume the fact that all the errors are failures in "Length" is relevant.

I could try the new commit on our real app in our staging environment and see if the ELB handles whatever error is happening more cleanly, but it seems like something may still be awry in puma.

nateberkopec pushed a commit that referenced this issue Nov 21, 2016
The problem was a few points:

* We were not clearing the reactor on a normal stop, which is what
  is used in a phased restart.
* On close, SSL sockets were not sending the shutdown message.
* SSL sockets that were completely uninitialized ended up sitting
  in reactor and could not actually be shutdown because there were
  not initialized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants