Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ssl_do_handshake can hang with small buffer #7967

Open
kroeckx opened this issue Dec 29, 2018 · 31 comments
Open

ssl_do_handshake can hang with small buffer #7967

kroeckx opened this issue Dec 29, 2018 · 31 comments
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 help wanted severity: important Important bugs affecting a released version triaged: bug The issue/pr is/fixes a bug

Comments

@kroeckx
Copy link
Member

kroeckx commented Dec 29, 2018

From #7948 by @njsmith:

My test suite literally started deadlocking when I upgraded to 1.1.1. It's a test where the client does SSL_do_handshakeSSL_write, while the server does SSL_do_handshakeSSL_read, and the transport layer between them has minimal buffering. This is the same communication pattern used by e.g. HTTP/1.1, and it worked fine with previous versions of openssl.

The deadlock happens because the server's SSL_do_handshake doesn't return until it finishes writing out the session tickets, but by the time the server starts trying to write the tickets, the client's SSL_do_handshake has already completed and the client has switched to SSL_write. Neither write can complete because neither side is reading → deadlock.

I could make a standalone reproducer (in Python, say), if that would be helpful, but the problem is very straightforward. As long as the server's SSL_do_handshake sends session tickets but the client's SSL_do_handshake doesn't read them, you have a deadlock hazard here.

I assume that this is with both sides having blocking sockets. Both sides doing a write and blocking on it is always going to deadlock.

One difference between TLS 1.2 and 1.3 is that in TLS 1.2 the server sends the last message of the handshake, while in TLS 1.3 the client does it. In TLS 1.3 the client can directly start sending after it has sent the finish message, while in TLS 1.2 it needs to wait for finish message from the server.

OpenSSL seems to currently hang in SSL_do_handshake() to sent the session tickets. I guess we could make SSL_do_handshake() return when we received the finish message from the client, but I don't think this solves anything. We would then just have to send them in SSL_read(), and we'd hang in SSL_read() instead of SSL_do_handshake().

I currently don't see how we can support TLS 1.3 with blocking sockets on both the client and the server where the first thing the client wants to do is write and the server wants to send session tickets.

@mattcaswell
Copy link
Member

Wouldn't this impact non-blocking sockets too? I've not tried it, but I'm not sure what is special about blocking sockets in this scenario - except of course for non-blocking sockets SSL_do_handshake() would return - but always give SSL_ERROR_WANT_WRITE until the session tickets were written.

I wonder whether this ever actually occurs in a real world scenario, i.e. not in some "test" application? I just did a quick test and observed an s_client <-> s_server interaction. Two session ticket TCP packets were sent* each of length 341 bytes, i.e. the client would need to have a buffer of less than 642 bytes before this becomes a problem. Would we ever reasonably expect clients with buffers that small?

Possible fixes and/or workarounds might be:

  • We could delay sending the tickets until SSL_write() is called on the server (which might be never for some applications). If we went this route I'd suggest making it a non-default option
  • We could introduce an option to enable sending of the tickets "on demand" rather than automatically. Again I'd think this would be a non-default scenario.
  • The application could configure itself to send no tickets (obviously at the cost of not being able to do resumption)
  • We could introduce some kind of "give up" capability when writing the session tickets, e.g. if we've tried X times and we still can't write then don't bother. This wouldn't help with blocking sockets though.
  • We just document this as a restriction

* aside: I wonder this should be optimised so that no "flush" occurs when we know there are more session tickets to write out, in order that all session tickets go into a single TCP packet where possible?

@kroeckx
Copy link
Member Author

kroeckx commented Jan 2, 2019 via email

@mattcaswell
Copy link
Member

Getting an SSL_ERROR_WANT_WRITE should not prevent us from calling
SSL_read() if we know we can read data. I'm not sure if we
currently will then process that data. Don't we have some other
open issue about this?

If we are writing out session tickets then we are in the "init" state. If you call SSL_read() while in the "init" state then we immediately go back into the state machine code and start trying to write the tickets out again.

@njsmith
Copy link

njsmith commented Jan 3, 2019

Wouldn't this impact non-blocking sockets too?

Yeah, the code where I hit this actually uses memory BIOs and handles the I/O itself (to make it easy to run over arbitrary transports, not just OS sockets).

I wonder whether this ever actually occurs in a real world scenario, i.e. not in some "test" application? I just did a quick test and observed an s_client <-> s_server interaction. Two session ticket TCP packets were sent* each of length 341 bytes, i.e. the client would need to have a buffer of less than 642 bytes before this becomes a problem. Would we ever reasonably expect clients with buffers that small?

Yeah, this is why #7948 emphasized the way automatic session tickets can cause data loss, not the way it can deadlock with small buffers :-). The two issues have the same cause, though, and I just posted there about why I think this is important, even though it seems like it should only matter in rare edge cases.

We could delay sending the tickets until SSL_write() is called on the server (which might be never for some applications). If we went this route I'd suggest making it a non-default option
We could introduce an option to enable sending of the tickets "on demand" rather than automatically. Again I'd think this would be a non-default scenario.
The application could configure itself to send no tickets (obviously at the cost of not being able to do resumption)

I guess if these were all implemented, then that would be sufficient to let me work around it in my library for my users – I could unconditionally suppress the automatic sending, and then either let SSL_write take care of it or implement my own tracking and auto-sending in my write wrapper. Unfortunately, right now the only workaround seems to be to disable session tickets entirely for all users, with no way to recover, which is really not great :-(. (And even doing that requires some pretty arcane tweaks at the C level.)

A SSL_write_ticket function seems like a reasonable thing to have in any case, just to expose the TLS 1.3 functionality that a ticket can be sent at any time?

In the long run I suspect the only fully-satisfactory solution would be to add a TLS extension that lets the client request tickets as part of the opening handshake. This would help because right now, the problems all happen because clients don't have any way to know whether session tickets are incoming at session startup. If this extension existed, then openssl could make the client-side SSL_do_handshake request tickets unconditionally, and if the ServerHello confirmed that the extension was understood and session tickets were coming, the client-side SSL_do_handshake would wait until the tickets were received before returning. But I have no idea how hard it would be to get such an extension standardized, and obviously it would still be nice to have some plan in the mean time.

@kroeckx
Copy link
Member Author

kroeckx commented Jan 3, 2019 via email

@davidben
Copy link
Contributor

davidben commented Jan 3, 2019

We could delay sending the tickets until SSL_write() is called on the server (which might be never for some applications). If we went this route I'd suggest making it a non-default option

This is an interesting nuisance! We're still mulling things over on our end, but I think we're largely leaning towards this option, perhaps even by default. It's the most straightforward. For application protocols where the server never writes, the client may also never call SSL_read, so it wouldn't pick up the ticket anyway. (Of course, the client code could be 1.3-aware and pump SSL_read to pick up tickets, but then the server code could also be 1.3-aware and pump an empty SSL_write or some dedicated API to flush the post-handshake bits.)

@kroeckx
Copy link
Member Author

kroeckx commented Jan 3, 2019 via email

@mattcaswell
Copy link
Member

For application protocols where the server never writes, the client may also never call SSL_read, so it wouldn't pick up the ticket anyway.

Well, that's not entirely true - at least not in OpenSSL (don't know about BoringSSL). If a bi-di shutdown occurs then we recently changed things to process the tickets at that point.

@davidben
Copy link
Contributor

davidben commented Jan 4, 2019

We discard SSL_shutdown tickets to avoid confusing callers with unexpected callback calls. The only code I've ever seen do bidi shutdown (it's kinda pointless and the API is awful) does it on accident in teardown code, at a point where getting a ticket callback is questionable. Given, in practice, servers only really send tickets at the beginning, and the ticket will have been picked up already, we thought picking up tickets in that weird edge case wasn't worth the breakage risk.

(Usually the sequence of events is the programmer doesn't call SSL_shutdown at all, notices session resumption is bust due to OpenSSL junking the sessions on SSL_free, finds out they need to call SSL_shutdown, and calls it twice because they found mention of that somewhere. Setting quiet shutdown by default is probably more compatible with what existing OpenSSL code wants than the actual behavior.)

But, yeah, one cannot guarantee TLS 1.3 will always behave like TLS 1.2 w.r.t. session resumption all the time. That was already hopeless with post-handshake tickets. (SSL_do_handshake used to make the session available at the client. Now the client needs to pump SSL_read a bit.) The best we can do is preserve resumption most of the time. Otherwise, connections will still work, but we may not resume in weird cases unless you incorporate a few more things into your program.

njsmith added a commit to njsmith/trio that referenced this issue Jan 7, 2019
It's currently a mess because of how openssl v1.1.1 handles session
tickets. There isn't consensus yet about whether this is an openssl
bug or what:

python-trio#819
openssl/openssl#7948
openssl/openssl#7967
njsmith added a commit to njsmith/trio that referenced this issue Jan 7, 2019
It's currently a mess because of how openssl v1.1.1 handles session
tickets. There isn't consensus yet about whether this is an openssl
bug or what:

python-trio#819
openssl/openssl#7948
openssl/openssl#7967
@kaduk
Copy link
Contributor

kaduk commented Jan 8, 2019

In the long run I suspect the only fully-satisfactory solution would be to add a TLS extension that lets the client request tickets as part of the opening handshake.

Sounds an awful lot like https://tools.ietf.org/html/draft-wood-tls-ticketrequests-01

@davidben
Copy link
Contributor

davidben commented Jan 8, 2019

I'm not sure that quite does it since the server still needs to know the client will be blocking its write on that read. But that turns our nice 1-RTT handshake into a 2-RTT one, so we don't really want that.

@njsmith
Copy link

njsmith commented Jan 8, 2019

@davidben that proposal as currently written doesn't quite solve things, but I think a variant would. Suppose we made it so servers who received a ticket_request extension in the ClientHello confirmed that by echoing it back in EncryptedExtensions, and that when they did this echo that was a guarantee that they would send the specified number of tickets immediately after they sent Finished.

No matter which handshake mode we're in, the client's SSL_do_handshake always reads until it sees the server's Finished. With the extension above, the client could instead keep reading until it sees the expected number of session tickets. And since the client receives EncryptedExtensions before it receives Finished, this can all be done reliably without any extra round trips.

@kaduk
Copy link
Contributor

kaduk commented Jan 8, 2019

@njsmith you are welcome to subscribe and raise this topic in the context of that document at tls@ietf.org

@davidben
Copy link
Contributor

davidben commented Jan 8, 2019

@njsmith Not quite. The server doesn't send tickets until it receives the client Finished flight. This is necessary for the tickets to, e.g, incorporate any client certificates. That is, the handshake diagram for a full handshake is:

C->S: ClientHello
S->C: ServerHello, EncryptedExtensions, CertificateRequest*, Certificate, CertificateVerify, Finished
C->S: Certificate*, CertificateVerify*, Finished
(formally handshake done, but in practice followed by...)
S->C: NewSessionTicket
(* = optional)

That means, in a client-speaks-first protocol (this issue doesn't exist in a server-speaks-first protocol), waiting for the tickets costs an extra RTT.

@njsmith
Copy link

njsmith commented Jan 8, 2019

@davidben oh darn, you're right. So what I said is fine for handshakes that are resuming a session, or where the server doesn't request a client certificate, but not when establishing a new session with client auth.

@davidben
Copy link
Contributor

davidben commented Jan 8, 2019

Most implementations are not going to send the NewSessionTicket early, even in those cases. It's not just that the NewSessionTicket is deferred. The resumption secret is a function of the client second flight.

@njsmith
Copy link

njsmith commented Jan 9, 2019

I see... it looks like the relevant bit of RFC 8446 is:

Note: Although the resumption master secret depends on the client’s second flight, servers which do not request client authentication MAY compute the remainder of the transcript independently and then send a NewSessionTicket immediately upon sending its Finished rather than waiting for the client Finished. This might be appropriate in cases where the client is expected to open multiple TLS connections in parallel and would benefit from the reduced overhead of a resumption handshake, for example.

So it's doable in principle (and might actually be worthwhile anyway for cases like HTTP/1.1 where browsers want to open multiple connections ASAP), but it requires special-case code.

@njsmith
Copy link

njsmith commented Jan 11, 2019

Hmm, I just realized yet another wrinkle. Bi-di shutdown has been suggested as a potential workaround for this issue... but as Kurt pointed out

TLS 1.3 makes it very explicit that sending a close notify only closes the write direction

I.e., there is no such thing as "bidi shutdown" in TLS 1.3. I can't make the close method in my library wait for the peer to send close_notify back, because I don't know anything about the protocol being used, and maybe the peer will just ignore the close_notify and keep going indefinitely. (Also, even in TLS 1.2, attempting to do bidi shutdown will deadlock if the peer never reads.)

@kaduk
Copy link
Contributor

kaduk commented Jan 11, 2019

Yeah, bidi shutdown has never really been reliable unless you control both endpoints, which is clearly not going to be the case for a generic library like yours.

@njsmith
Copy link

njsmith commented May 10, 2019

Just wanted to check in whether anyone has further thoughts here. It's still a blocker for my library supporting TLS 1.3.

It looks like boringssl has switched to sending tickets on the first call to SSL_write, which solves the problem:

https://boringssl.googlesource.com/boringssl/+/777a239175c26fcaa4c6c2049fedc90e859bd9b6

If openssl could make a similar change in 1.1.1c (or so), then that would be excellent. Otherwise, I'd at least like some documentation about exactly what assumptions openssl is making about the underlying transport (e.g., what's the minimum amount of buffering that's guaranteed to work?).

@njsmith
Copy link

njsmith commented May 25, 2019

Ping

@dankegel
Copy link

dankegel commented May 29, 2019

Seems to affect my app (which, like trio, is a wrapper); disabling tsl 1.3 fixes the hang.

(Bad systems which have the hang:
Ubuntu 19.04, which uses libssl-dev 1.1.1b-1ubuntu2,
and mac with home-compiled openssl 1.1.1b;
Good systems which lack the hang:
Ubuntu 18.04, which uses 1.1.0g-2ubuntu4.3, and earlier Ubuntu.)

What buffer needs to be made bigger, exactly?

@mattcaswell
Copy link
Member

If openssl could make a similar change in 1.1.1c (or so), then that would be excellent. Otherwise, I'd at least like some documentation about exactly what assumptions openssl is making about the underlying transport (e.g., what's the minimum amount of buffering that's guaranteed to work?).

We wouldn't make a change like that in a stable branch. I also suspect that, while it might solve this problem, it will cause other applications to break (e.g. applications that don't ever write application data).

It is possible to reduce the ticket size in 1.1.1 which might make it small enough to fit in the "small buffer", e.g. by setting the number of tickets to 1 (SSL_CTX_set_num_tickets) and by using stateful session tickets (see SSL_OP_NO_TICKET).

I could also see an option being added to send tickets on demand (PRs welcome), but such an option would not be backported to 1.1.1 (as a matter of policy we don't allow new features in stable branches). In that way you could mimic the boringssl behaviour by setting the automatic number of tickets to 0, and manually sending tickets when applications perform a write.

Or alternatively an option could be added which switches the default behaviour from automatically sending tickets after the handshake to sending the tickets on SSL_write() (again PRs welcome). As above such an option would not be backported.

@dankegel
Copy link

dankegel commented May 30, 2019

Thanks, Matt. As you suggested, adding this to my app seems to work around the problem for me:

+#ifdef SSL_OP_NO_TLSv1_3
+      /* (partially?) work around hang in openssl 1.1.1b.
+      * https://github.com/openssl/openssl/issues/7967
+      *
+      * Alternate approach: leave num_tickets alone, do
+      *    options |= SSL_OP_NO_TLSv1_3;
+      */
+
+      options |= SSL_OP_NO_TICKET;
+      SSL_CTX_set_num_tickets (context, 0);
+#endif
    SSL_CTX_set_options (context, options);

@njsmith
Copy link

njsmith commented Dec 8, 2019

I described another workaround here: #7948 (comment)

@bsergean
Copy link

I have tried @dankegel workaround but I'm still getting a hang. Here is my backtrace if it helps. I'm talking to sentry.io from a centos8 with very recent openssl (openssl-libs-1.1.1c-2.el8.x86_64)

Backtrace:

Thread 6 (Thread 0x7f72648ea700 (LWP 13)):
#0  0x00007f72677f6b94 in read () from /lib64/libpthread.so.0
#1  0x00007f7267cd1a62 in sock_read () from /lib64/libcrypto.so.1.1
#2  0x00007f7267ccb22e in bread_conv () from /lib64/libcrypto.so.1.1
#3  0x00007f7267cca093 in bio_read_intern () from /lib64/libcrypto.so.1.1
#4  0x00007f7267cca647 in BIO_read () from /lib64/libcrypto.so.1.1
#5  0x00007f726811f0bf in ssl3_read_n () from /lib64/libssl.so.1.1
#6  0x00007f7268123ae6 in ssl3_get_record () from /lib64/libssl.so.1.1
#7  0x00007f7268120c80 in ssl3_read_bytes () from /lib64/libssl.so.1.1
#8  0x00007f7268154c7d in tls_get_message_header () from /lib64/libssl.so.1.1
#9  0x00007f726814a458 in state_machine.part () from /lib64/libssl.so.1.1
#10 0x00007f7268135728 in SSL_do_handshake () from /lib64/libssl.so.1.1

The fix I tried:

-            SSL_CTX_set_options(
-                ctx, SSL_OP_NO_SSLv2 | SSL_OP_NO_SSLv3 | SSL_OP_CIPHER_SERVER_PREFERENCE);
+            int options = SSL_OP_NO_SSLv2 | SSL_OP_NO_SSLv3 | SSL_OP_CIPHER_SERVER_PREFERENCE;
+
+#ifdef SSL_OP_NO_TLSv1_3
+            // (partially?) work around hang in openssl 1.1.1b, by disabling TLS V1.3
+            // https://github.com/openssl/openssl/issues/7967
+            options |= SSL_OP_NO_TLSv1_3;
+#endif
+            SSL_CTX_set_options(ctx, options);
         }
         return ctx;
     }

Should I use OPENSSL_VERSION_NUMBER >= 0x10101000L as suggested in here.

Or is it that I should not pass in SSL_OP_CIPHER_SERVER_PREFERENCE ?

@kaduk
Copy link
Contributor

kaduk commented Mar 16, 2020

I don't expect either the OPENSSL_VERSION_NUMBER check or avoiding SSL_OP_CIPHER_SERVER_PREFERENCE to make a difference.

@kroeckx kroeckx added branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch branch: master Merge to master branch severity: important Important bugs affecting a released version triaged: bug The issue/pr is/fixes a bug labels Oct 4, 2020
@t8m
Copy link
Member

t8m commented Nov 10, 2022

Is this still relevant?

@bsergean
Copy link

bsergean commented Nov 10, 2022 via email

@aaugustin
Copy link

I believe I am hitting this issue as well in the context of python-websockets/websockets#1245. The test suite hangs randomly.

Unfortunately, I'm doing all this from Python, which makes it hard for me to provide you with the sequence of OpenSSL calls :-(

It is happening in the following sequence over the loopback interface.

Client socket

  • Connect to server
  • Perform TLS handshake
  • send(<286 bytes>) <-- this is a WebSocket connection handshake; sending a HTTP request would be the same
  • recv() <-- waiting for the response from the server

Server socket

  • Accept connection
  • Perform TLS handshake
  • recv(1024) <-- sometimes, I never receive anything here, despite send() completing successfully on the client!

Running this in a loop triggers the bug after a few iterations.

The workaround suggested above made the issue go away:

import ssl
CLIENT_CONTEXT.options |= ssl.OP_NO_TLSv1_3

This isn't adding much new information. I'm only confirming that it happens and that it's quite difficult to track down from a higher level language (here, Python). You call SSLSocket.sendall(...) on one end; sometimes the data shows up in SSLSocket.recv() on the other end and sometimes it doesn't, even though Wireshark shows the exact same sequence of packets.

@aaugustin
Copy link

As a complement to the above, here's what the situation looks like in Wireshark.

openssl-7967-packet-captures.zip

ok.pcapng and ko.pcapng are two runs of the same code described above, with the associated Pre-Master Secret Log file.

One of them succeeds (OK) and one hangs (KO). At the network layer, everything looks identical until the 14th packet. Then, the first one continue while the second one hangs, terminating 1 second later because I have a timeout for closing the socket after 1 second.

This isn't going to help you solve the issue. I'm only illustrating how frustrating it is to debug.


I'm not completely sure why the error occurs randomly. My hypothesis is: the hang occurs when the OS interrupts the server thread before it can write the session tickets and yields control to the client thread, which sends data.

If I insert a small delay on the client side between "completing the TLS handshake" and "sending data", the issue doesn't occur anymore, which is consistent with that hypothesis.

If I could insert a delay on the server side before sending the session tickets and produce the issue reliably (rather than randomly), that would strengthen my hypothesis. Unfortunately, I don't have a realistic way to do that.

I can put together a minimal reproduction in Python if that helps. If I were you, probably that wouldn't help me... so I'm not doing it now... but happy to do it if you'd like.

@t8m t8m added help wanted branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 and removed branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch labels Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 help wanted severity: important Important bugs affecting a released version triaged: bug The issue/pr is/fixes a bug
Projects
Status: New
Development

No branches or pull requests

9 participants