-
-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ssl_do_handshake can hang with small buffer #7967
Comments
Wouldn't this impact non-blocking sockets too? I've not tried it, but I'm not sure what is special about blocking sockets in this scenario - except of course for non-blocking sockets SSL_do_handshake() would return - but always give SSL_ERROR_WANT_WRITE until the session tickets were written. I wonder whether this ever actually occurs in a real world scenario, i.e. not in some "test" application? I just did a quick test and observed an s_client <-> s_server interaction. Two session ticket TCP packets were sent* each of length 341 bytes, i.e. the client would need to have a buffer of less than 642 bytes before this becomes a problem. Would we ever reasonably expect clients with buffers that small? Possible fixes and/or workarounds might be:
* aside: I wonder this should be optimised so that no "flush" occurs when we know there are more session tickets to write out, in order that all session tickets go into a single TCP packet where possible? |
Getting an SSL_ERROR_WANT_WRITE should not prevent us from calling
SSL_read() if we know we can read data. I'm not sure if we
currently will then process that data. Don't we have some other
open issue about this?
|
If we are writing out session tickets then we are in the "init" state. If you call SSL_read() while in the "init" state then we immediately go back into the state machine code and start trying to write the tickets out again. |
Yeah, the code where I hit this actually uses memory BIOs and handles the I/O itself (to make it easy to run over arbitrary transports, not just OS sockets).
Yeah, this is why #7948 emphasized the way automatic session tickets can cause data loss, not the way it can deadlock with small buffers :-). The two issues have the same cause, though, and I just posted there about why I think this is important, even though it seems like it should only matter in rare edge cases.
I guess if these were all implemented, then that would be sufficient to let me work around it in my library for my users – I could unconditionally suppress the automatic sending, and then either let A In the long run I suspect the only fully-satisfactory solution would be to add a TLS extension that lets the client request tickets as part of the opening handshake. This would help because right now, the problems all happen because clients don't have any way to know whether session tickets are incoming at session startup. If this extension existed, then openssl could make the client-side |
You can argue that sending after the finished message from the
client is received and processed we are no longer in init, just
trying to write (non-application) data.
|
This is an interesting nuisance! We're still mulling things over on our end, but I think we're largely leaning towards this option, perhaps even by default. It's the most straightforward. For application protocols where the server never writes, the client may also never call |
One thing I'm not clear about is why the client would hang if it's
not blocking on the write.
|
Well, that's not entirely true - at least not in OpenSSL (don't know about BoringSSL). If a bi-di shutdown occurs then we recently changed things to process the tickets at that point. |
We discard (Usually the sequence of events is the programmer doesn't call But, yeah, one cannot guarantee TLS 1.3 will always behave like TLS 1.2 w.r.t. session resumption all the time. That was already hopeless with post-handshake tickets. ( |
It's currently a mess because of how openssl v1.1.1 handles session tickets. There isn't consensus yet about whether this is an openssl bug or what: python-trio#819 openssl/openssl#7948 openssl/openssl#7967
It's currently a mess because of how openssl v1.1.1 handles session tickets. There isn't consensus yet about whether this is an openssl bug or what: python-trio#819 openssl/openssl#7948 openssl/openssl#7967
Sounds an awful lot like https://tools.ietf.org/html/draft-wood-tls-ticketrequests-01 |
I'm not sure that quite does it since the server still needs to know the client will be blocking its write on that read. But that turns our nice 1-RTT handshake into a 2-RTT one, so we don't really want that. |
@davidben that proposal as currently written doesn't quite solve things, but I think a variant would. Suppose we made it so servers who received a No matter which handshake mode we're in, the client's |
@njsmith you are welcome to subscribe and raise this topic in the context of that document at tls@ietf.org |
@njsmith Not quite. The server doesn't send tickets until it receives the client Finished flight. This is necessary for the tickets to, e.g, incorporate any client certificates. That is, the handshake diagram for a full handshake is: C->S: ClientHello That means, in a client-speaks-first protocol (this issue doesn't exist in a server-speaks-first protocol), waiting for the tickets costs an extra RTT. |
@davidben oh darn, you're right. So what I said is fine for handshakes that are resuming a session, or where the server doesn't request a client certificate, but not when establishing a new session with client auth. |
Most implementations are not going to send the NewSessionTicket early, even in those cases. It's not just that the NewSessionTicket is deferred. The resumption secret is a function of the client second flight. |
I see... it looks like the relevant bit of RFC 8446 is:
So it's doable in principle (and might actually be worthwhile anyway for cases like HTTP/1.1 where browsers want to open multiple connections ASAP), but it requires special-case code. |
Hmm, I just realized yet another wrinkle. Bi-di shutdown has been suggested as a potential workaround for this issue... but as Kurt pointed out
I.e., there is no such thing as "bidi shutdown" in TLS 1.3. I can't make the |
Yeah, bidi shutdown has never really been reliable unless you control both endpoints, which is clearly not going to be the case for a generic library like yours. |
Just wanted to check in whether anyone has further thoughts here. It's still a blocker for my library supporting TLS 1.3. It looks like boringssl has switched to sending tickets on the first call to https://boringssl.googlesource.com/boringssl/+/777a239175c26fcaa4c6c2049fedc90e859bd9b6 If openssl could make a similar change in 1.1.1c (or so), then that would be excellent. Otherwise, I'd at least like some documentation about exactly what assumptions openssl is making about the underlying transport (e.g., what's the minimum amount of buffering that's guaranteed to work?). |
Ping |
Seems to affect my app (which, like trio, is a wrapper); disabling tsl 1.3 fixes the hang. (Bad systems which have the hang: What buffer needs to be made bigger, exactly? |
We wouldn't make a change like that in a stable branch. I also suspect that, while it might solve this problem, it will cause other applications to break (e.g. applications that don't ever write application data). It is possible to reduce the ticket size in 1.1.1 which might make it small enough to fit in the "small buffer", e.g. by setting the number of tickets to 1 (SSL_CTX_set_num_tickets) and by using stateful session tickets (see SSL_OP_NO_TICKET). I could also see an option being added to send tickets on demand (PRs welcome), but such an option would not be backported to 1.1.1 (as a matter of policy we don't allow new features in stable branches). In that way you could mimic the boringssl behaviour by setting the automatic number of tickets to 0, and manually sending tickets when applications perform a write. Or alternatively an option could be added which switches the default behaviour from automatically sending tickets after the handshake to sending the tickets on SSL_write() (again PRs welcome). As above such an option would not be backported. |
Thanks, Matt. As you suggested, adding this to my app seems to work around the problem for me:
|
I described another workaround here: #7948 (comment) |
I have tried @dankegel workaround but I'm still getting a hang. Here is my backtrace if it helps. I'm talking to sentry.io from a centos8 with very recent openssl (openssl-libs-1.1.1c-2.el8.x86_64) Backtrace:
The fix I tried:
Should I use OPENSSL_VERSION_NUMBER >= 0x10101000L as suggested in here. Or is it that I should not pass in SSL_OP_CIPHER_SERVER_PREFERENCE ? |
I don't expect either the |
Is this still relevant? |
I personally have worked-around this, so nothing to say on my end ...
… On Nov 10, 2022, at 6:46 AM, Tomáš Mráz ***@***.***> wrote:
Is this still relevant?
—
Reply to this email directly, view it on GitHub <#7967 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC2O6UK3GFOW6GLG2SJSCLLWHUDFNANCNFSM4GMSMYPA>.
You are receiving this because you commented.
|
I believe I am hitting this issue as well in the context of python-websockets/websockets#1245. The test suite hangs randomly. Unfortunately, I'm doing all this from Python, which makes it hard for me to provide you with the sequence of OpenSSL calls :-( It is happening in the following sequence over the loopback interface. Client socket
Server socket
Running this in a loop triggers the bug after a few iterations. The workaround suggested above made the issue go away: import ssl
CLIENT_CONTEXT.options |= ssl.OP_NO_TLSv1_3 This isn't adding much new information. I'm only confirming that it happens and that it's quite difficult to track down from a higher level language (here, Python). You call |
As a complement to the above, here's what the situation looks like in Wireshark. openssl-7967-packet-captures.zip
One of them succeeds (OK) and one hangs (KO). At the network layer, everything looks identical until the 14th packet. Then, the first one continue while the second one hangs, terminating 1 second later because I have a timeout for closing the socket after 1 second. This isn't going to help you solve the issue. I'm only illustrating how frustrating it is to debug. I'm not completely sure why the error occurs randomly. My hypothesis is: the hang occurs when the OS interrupts the server thread before it can write the session tickets and yields control to the client thread, which sends data. If I insert a small delay on the client side between "completing the TLS handshake" and "sending data", the issue doesn't occur anymore, which is consistent with that hypothesis. If I could insert a delay on the server side before sending the session tickets and produce the issue reliably (rather than randomly), that would strengthen my hypothesis. Unfortunately, I don't have a realistic way to do that. I can put together a minimal reproduction in Python if that helps. If I were you, probably that wouldn't help me... so I'm not doing it now... but happy to do it if you'd like. |
From #7948 by @njsmith:
I assume that this is with both sides having blocking sockets. Both sides doing a write and blocking on it is always going to deadlock.
One difference between TLS 1.2 and 1.3 is that in TLS 1.2 the server sends the last message of the handshake, while in TLS 1.3 the client does it. In TLS 1.3 the client can directly start sending after it has sent the finish message, while in TLS 1.2 it needs to wait for finish message from the server.
OpenSSL seems to currently hang in SSL_do_handshake() to sent the session tickets. I guess we could make SSL_do_handshake() return when we received the finish message from the client, but I don't think this solves anything. We would then just have to send them in SSL_read(), and we'd hang in SSL_read() instead of SSL_do_handshake().
I currently don't see how we can support TLS 1.3 with blocking sockets on both the client and the server where the first thing the client wants to do is write and the server wants to send session tickets.
The text was updated successfully, but these errors were encountered: