Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ssl_do_handshake can hang with small buffer #7967

Open
kroeckx opened this issue Dec 29, 2018 · 24 comments

Comments

@kroeckx
Copy link
Member

@kroeckx kroeckx commented Dec 29, 2018

From #7948 by @njsmith:

My test suite literally started deadlocking when I upgraded to 1.1.1. It's a test where the client does SSL_do_handshakeSSL_write, while the server does SSL_do_handshakeSSL_read, and the transport layer between them has minimal buffering. This is the same communication pattern used by e.g. HTTP/1.1, and it worked fine with previous versions of openssl.

The deadlock happens because the server's SSL_do_handshake doesn't return until it finishes writing out the session tickets, but by the time the server starts trying to write the tickets, the client's SSL_do_handshake has already completed and the client has switched to SSL_write. Neither write can complete because neither side is reading → deadlock.

I could make a standalone reproducer (in Python, say), if that would be helpful, but the problem is very straightforward. As long as the server's SSL_do_handshake sends session tickets but the client's SSL_do_handshake doesn't read them, you have a deadlock hazard here.

I assume that this is with both sides having blocking sockets. Both sides doing a write and blocking on it is always going to deadlock.

One difference between TLS 1.2 and 1.3 is that in TLS 1.2 the server sends the last message of the handshake, while in TLS 1.3 the client does it. In TLS 1.3 the client can directly start sending after it has sent the finish message, while in TLS 1.2 it needs to wait for finish message from the server.

OpenSSL seems to currently hang in SSL_do_handshake() to sent the session tickets. I guess we could make SSL_do_handshake() return when we received the finish message from the client, but I don't think this solves anything. We would then just have to send them in SSL_read(), and we'd hang in SSL_read() instead of SSL_do_handshake().

I currently don't see how we can support TLS 1.3 with blocking sockets on both the client and the server where the first thing the client wants to do is write and the server wants to send session tickets.

@mattcaswell

This comment has been minimized.

Copy link
Member

@mattcaswell mattcaswell commented Jan 2, 2019

Wouldn't this impact non-blocking sockets too? I've not tried it, but I'm not sure what is special about blocking sockets in this scenario - except of course for non-blocking sockets SSL_do_handshake() would return - but always give SSL_ERROR_WANT_WRITE until the session tickets were written.

I wonder whether this ever actually occurs in a real world scenario, i.e. not in some "test" application? I just did a quick test and observed an s_client <-> s_server interaction. Two session ticket TCP packets were sent* each of length 341 bytes, i.e. the client would need to have a buffer of less than 642 bytes before this becomes a problem. Would we ever reasonably expect clients with buffers that small?

Possible fixes and/or workarounds might be:

  • We could delay sending the tickets until SSL_write() is called on the server (which might be never for some applications). If we went this route I'd suggest making it a non-default option
  • We could introduce an option to enable sending of the tickets "on demand" rather than automatically. Again I'd think this would be a non-default scenario.
  • The application could configure itself to send no tickets (obviously at the cost of not being able to do resumption)
  • We could introduce some kind of "give up" capability when writing the session tickets, e.g. if we've tried X times and we still can't write then don't bother. This wouldn't help with blocking sockets though.
  • We just document this as a restriction

* aside: I wonder this should be optimised so that no "flush" occurs when we know there are more session tickets to write out, in order that all session tickets go into a single TCP packet where possible?

@kroeckx

This comment has been minimized.

Copy link
Member Author

@kroeckx kroeckx commented Jan 2, 2019

@mattcaswell

This comment has been minimized.

Copy link
Member

@mattcaswell mattcaswell commented Jan 3, 2019

Getting an SSL_ERROR_WANT_WRITE should not prevent us from calling
SSL_read() if we know we can read data. I'm not sure if we
currently will then process that data. Don't we have some other
open issue about this?

If we are writing out session tickets then we are in the "init" state. If you call SSL_read() while in the "init" state then we immediately go back into the state machine code and start trying to write the tickets out again.

@njsmith

This comment has been minimized.

Copy link

@njsmith njsmith commented Jan 3, 2019

Wouldn't this impact non-blocking sockets too?

Yeah, the code where I hit this actually uses memory BIOs and handles the I/O itself (to make it easy to run over arbitrary transports, not just OS sockets).

I wonder whether this ever actually occurs in a real world scenario, i.e. not in some "test" application? I just did a quick test and observed an s_client <-> s_server interaction. Two session ticket TCP packets were sent* each of length 341 bytes, i.e. the client would need to have a buffer of less than 642 bytes before this becomes a problem. Would we ever reasonably expect clients with buffers that small?

Yeah, this is why #7948 emphasized the way automatic session tickets can cause data loss, not the way it can deadlock with small buffers :-). The two issues have the same cause, though, and I just posted there about why I think this is important, even though it seems like it should only matter in rare edge cases.

We could delay sending the tickets until SSL_write() is called on the server (which might be never for some applications). If we went this route I'd suggest making it a non-default option
We could introduce an option to enable sending of the tickets "on demand" rather than automatically. Again I'd think this would be a non-default scenario.
The application could configure itself to send no tickets (obviously at the cost of not being able to do resumption)

I guess if these were all implemented, then that would be sufficient to let me work around it in my library for my users – I could unconditionally suppress the automatic sending, and then either let SSL_write take care of it or implement my own tracking and auto-sending in my write wrapper. Unfortunately, right now the only workaround seems to be to disable session tickets entirely for all users, with no way to recover, which is really not great :-(. (And even doing that requires some pretty arcane tweaks at the C level.)

A SSL_write_ticket function seems like a reasonable thing to have in any case, just to expose the TLS 1.3 functionality that a ticket can be sent at any time?

In the long run I suspect the only fully-satisfactory solution would be to add a TLS extension that lets the client request tickets as part of the opening handshake. This would help because right now, the problems all happen because clients don't have any way to know whether session tickets are incoming at session startup. If this extension existed, then openssl could make the client-side SSL_do_handshake request tickets unconditionally, and if the ServerHello confirmed that the extension was understood and session tickets were coming, the client-side SSL_do_handshake would wait until the tickets were received before returning. But I have no idea how hard it would be to get such an extension standardized, and obviously it would still be nice to have some plan in the mean time.

@kroeckx

This comment has been minimized.

Copy link
Member Author

@kroeckx kroeckx commented Jan 3, 2019

@davidben

This comment has been minimized.

Copy link
Contributor

@davidben davidben commented Jan 3, 2019

We could delay sending the tickets until SSL_write() is called on the server (which might be never for some applications). If we went this route I'd suggest making it a non-default option

This is an interesting nuisance! We're still mulling things over on our end, but I think we're largely leaning towards this option, perhaps even by default. It's the most straightforward. For application protocols where the server never writes, the client may also never call SSL_read, so it wouldn't pick up the ticket anyway. (Of course, the client code could be 1.3-aware and pump SSL_read to pick up tickets, but then the server code could also be 1.3-aware and pump an empty SSL_write or some dedicated API to flush the post-handshake bits.)

@kroeckx

This comment has been minimized.

Copy link
Member Author

@kroeckx kroeckx commented Jan 3, 2019

@mattcaswell

This comment has been minimized.

Copy link
Member

@mattcaswell mattcaswell commented Jan 4, 2019

For application protocols where the server never writes, the client may also never call SSL_read, so it wouldn't pick up the ticket anyway.

Well, that's not entirely true - at least not in OpenSSL (don't know about BoringSSL). If a bi-di shutdown occurs then we recently changed things to process the tickets at that point.

@davidben

This comment has been minimized.

Copy link
Contributor

@davidben davidben commented Jan 4, 2019

We discard SSL_shutdown tickets to avoid confusing callers with unexpected callback calls. The only code I've ever seen do bidi shutdown (it's kinda pointless and the API is awful) does it on accident in teardown code, at a point where getting a ticket callback is questionable. Given, in practice, servers only really send tickets at the beginning, and the ticket will have been picked up already, we thought picking up tickets in that weird edge case wasn't worth the breakage risk.

(Usually the sequence of events is the programmer doesn't call SSL_shutdown at all, notices session resumption is bust due to OpenSSL junking the sessions on SSL_free, finds out they need to call SSL_shutdown, and calls it twice because they found mention of that somewhere. Setting quiet shutdown by default is probably more compatible with what existing OpenSSL code wants than the actual behavior.)

But, yeah, one cannot guarantee TLS 1.3 will always behave like TLS 1.2 w.r.t. session resumption all the time. That was already hopeless with post-handshake tickets. (SSL_do_handshake used to make the session available at the client. Now the client needs to pump SSL_read a bit.) The best we can do is preserve resumption most of the time. Otherwise, connections will still work, but we may not resume in weird cases unless you incorporate a few more things into your program.

njsmith added a commit to njsmith/trio that referenced this issue Jan 7, 2019
It's currently a mess because of how openssl v1.1.1 handles session
tickets. There isn't consensus yet about whether this is an openssl
bug or what:

python-trio#819
openssl/openssl#7948
openssl/openssl#7967
njsmith added a commit to njsmith/trio that referenced this issue Jan 7, 2019
It's currently a mess because of how openssl v1.1.1 handles session
tickets. There isn't consensus yet about whether this is an openssl
bug or what:

python-trio#819
openssl/openssl#7948
openssl/openssl#7967
@kaduk

This comment has been minimized.

Copy link
Contributor

@kaduk kaduk commented Jan 8, 2019

In the long run I suspect the only fully-satisfactory solution would be to add a TLS extension that lets the client request tickets as part of the opening handshake.

Sounds an awful lot like https://tools.ietf.org/html/draft-wood-tls-ticketrequests-01

@davidben

This comment has been minimized.

Copy link
Contributor

@davidben davidben commented Jan 8, 2019

I'm not sure that quite does it since the server still needs to know the client will be blocking its write on that read. But that turns our nice 1-RTT handshake into a 2-RTT one, so we don't really want that.

@njsmith

This comment has been minimized.

Copy link

@njsmith njsmith commented Jan 8, 2019

@davidben that proposal as currently written doesn't quite solve things, but I think a variant would. Suppose we made it so servers who received a ticket_request extension in the ClientHello confirmed that by echoing it back in EncryptedExtensions, and that when they did this echo that was a guarantee that they would send the specified number of tickets immediately after they sent Finished.

No matter which handshake mode we're in, the client's SSL_do_handshake always reads until it sees the server's Finished. With the extension above, the client could instead keep reading until it sees the expected number of session tickets. And since the client receives EncryptedExtensions before it receives Finished, this can all be done reliably without any extra round trips.

@kaduk

This comment has been minimized.

Copy link
Contributor

@kaduk kaduk commented Jan 8, 2019

@njsmith you are welcome to subscribe and raise this topic in the context of that document at tls@ietf.org

@davidben

This comment has been minimized.

Copy link
Contributor

@davidben davidben commented Jan 8, 2019

@njsmith Not quite. The server doesn't send tickets until it receives the client Finished flight. This is necessary for the tickets to, e.g, incorporate any client certificates. That is, the handshake diagram for a full handshake is:

C->S: ClientHello
S->C: ServerHello, EncryptedExtensions, CertificateRequest*, Certificate, CertificateVerify, Finished
C->S: Certificate*, CertificateVerify*, Finished
(formally handshake done, but in practice followed by...)
S->C: NewSessionTicket
(* = optional)

That means, in a client-speaks-first protocol (this issue doesn't exist in a server-speaks-first protocol), waiting for the tickets costs an extra RTT.

@njsmith

This comment has been minimized.

Copy link

@njsmith njsmith commented Jan 8, 2019

@davidben oh darn, you're right. So what I said is fine for handshakes that are resuming a session, or where the server doesn't request a client certificate, but not when establishing a new session with client auth.

@davidben

This comment has been minimized.

Copy link
Contributor

@davidben davidben commented Jan 8, 2019

Most implementations are not going to send the NewSessionTicket early, even in those cases. It's not just that the NewSessionTicket is deferred. The resumption secret is a function of the client second flight.

@njsmith

This comment has been minimized.

Copy link

@njsmith njsmith commented Jan 9, 2019

I see... it looks like the relevant bit of RFC 8446 is:

Note: Although the resumption master secret depends on the client’s second flight, servers which do not request client authentication MAY compute the remainder of the transcript independently and then send a NewSessionTicket immediately upon sending its Finished rather than waiting for the client Finished. This might be appropriate in cases where the client is expected to open multiple TLS connections in parallel and would benefit from the reduced overhead of a resumption handshake, for example.

So it's doable in principle (and might actually be worthwhile anyway for cases like HTTP/1.1 where browsers want to open multiple connections ASAP), but it requires special-case code.

@njsmith

This comment has been minimized.

Copy link

@njsmith njsmith commented Jan 11, 2019

Hmm, I just realized yet another wrinkle. Bi-di shutdown has been suggested as a potential workaround for this issue... but as Kurt pointed out

TLS 1.3 makes it very explicit that sending a close notify only closes the write direction

I.e., there is no such thing as "bidi shutdown" in TLS 1.3. I can't make the close method in my library wait for the peer to send close_notify back, because I don't know anything about the protocol being used, and maybe the peer will just ignore the close_notify and keep going indefinitely. (Also, even in TLS 1.2, attempting to do bidi shutdown will deadlock if the peer never reads.)

@kaduk

This comment has been minimized.

Copy link
Contributor

@kaduk kaduk commented Jan 11, 2019

Yeah, bidi shutdown has never really been reliable unless you control both endpoints, which is clearly not going to be the case for a generic library like yours.

@njsmith

This comment has been minimized.

Copy link

@njsmith njsmith commented May 10, 2019

Just wanted to check in whether anyone has further thoughts here. It's still a blocker for my library supporting TLS 1.3.

It looks like boringssl has switched to sending tickets on the first call to SSL_write, which solves the problem:

https://boringssl.googlesource.com/boringssl/+/777a239175c26fcaa4c6c2049fedc90e859bd9b6

If openssl could make a similar change in 1.1.1c (or so), then that would be excellent. Otherwise, I'd at least like some documentation about exactly what assumptions openssl is making about the underlying transport (e.g., what's the minimum amount of buffering that's guaranteed to work?).

@njsmith

This comment has been minimized.

Copy link

@njsmith njsmith commented May 25, 2019

Ping

@dankegel

This comment has been minimized.

Copy link

@dankegel dankegel commented May 29, 2019

Seems to affect my app (which, like trio, is a wrapper); disabling tsl 1.3 fixes the hang.

(Bad systems which have the hang:
Ubuntu 19.04, which uses libssl-dev 1.1.1b-1ubuntu2,
and mac with home-compiled openssl 1.1.1b;
Good systems which lack the hang:
Ubuntu 18.04, which uses 1.1.0g-2ubuntu4.3, and earlier Ubuntu.)

What buffer needs to be made bigger, exactly?

@mattcaswell

This comment has been minimized.

Copy link
Member

@mattcaswell mattcaswell commented May 29, 2019

If openssl could make a similar change in 1.1.1c (or so), then that would be excellent. Otherwise, I'd at least like some documentation about exactly what assumptions openssl is making about the underlying transport (e.g., what's the minimum amount of buffering that's guaranteed to work?).

We wouldn't make a change like that in a stable branch. I also suspect that, while it might solve this problem, it will cause other applications to break (e.g. applications that don't ever write application data).

It is possible to reduce the ticket size in 1.1.1 which might make it small enough to fit in the "small buffer", e.g. by setting the number of tickets to 1 (SSL_CTX_set_num_tickets) and by using stateful session tickets (see SSL_OP_NO_TICKET).

I could also see an option being added to send tickets on demand (PRs welcome), but such an option would not be backported to 1.1.1 (as a matter of policy we don't allow new features in stable branches). In that way you could mimic the boringssl behaviour by setting the automatic number of tickets to 0, and manually sending tickets when applications perform a write.

Or alternatively an option could be added which switches the default behaviour from automatically sending tickets after the handshake to sending the tickets on SSL_write() (again PRs welcome). As above such an option would not be backported.

@dankegel

This comment has been minimized.

Copy link

@dankegel dankegel commented May 30, 2019

Thanks, Matt. As you suggested, adding this to my app seems to work around the problem for me:

+#ifdef SSL_OP_NO_TLSv1_3
+      /* (partially?) work around hang in openssl 1.1.1b.
+      * https://github.com/openssl/openssl/issues/7967
+      *
+      * Alternate approach: leave num_tickets alone, do
+      *    options |= SSL_OP_NO_TLSv1_3;
+      */
+
+      options |= SSL_OP_NO_TICKET;
+      SSL_CTX_set_num_tickets (context, 0);
+#endif
    SSL_CTX_set_options (context, options);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.