Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
AFAICT openssl's strategy for handling TLS 1.3 session tickets makes it impossible to reliably implement communication patterns where the server never sends application-level data #7948
I maintain a Python networking library called Trio, and I've been struggling to get it working with Openssl v1.1.1/TLS 1.3. We use openssl with memory BIOs and have an extensive test suite that passes with earlier openssl versions, but hits a number of problems after upgrading to v1.1.1. The main issue seems to be the session tickets that openssl sends after the handshake in server mode, and how they affect connections where the client never calls
Due diligence: I found these previous issues/PRs that are all about the exact same issue that I'm facing, but having read them carefully I still can't figure out how to make this work: #6342, #6904, #6944. Also, for reference, this is the Trio bug: python-trio/trio#819
TCP (as I understand it)
Let's ignore TLS for the moment and just talk about TCP. Suppose we have a client that connects, sends some data, and then disconnects, without ever calling
# Socket client sock = connect(...) while ...: sock.send(...) sock.close() # Socket server (safe) sock = accept() while ...: sock.recv(...) if eof: sock.close() break
This isn't a terribly common pattern, but it's perfectly legal and reliable. Call this the safe pattern.
But, TCP has a gotcha: suppose we have the server send a bit of data, and change nothing else. In particular the client still never calls
# Socket server (unsafe) sock = accept() sock.send(<one byte of data>) # <--- This is the only line that's different while ...: sock.recv(...) if eof: sock.close() break
Now, this is a little funny looking, because the server is sending some data that the client will never read. But, whatever, that shouldn't affect anything, right? This shouldn't affect the data being sent from the client→server, right?
Well, that would be logical, but it's wrong! In the unsafe pattern, an arbitrary amount of the data sent by the client can be lost, even though the client code didn't change at all.
This happens because of arcane details of how TCP works: in the safe pattern, when the client calls
So in this case, the client's kernel might send a RST, instead of or in addition to the FIN. And then when the server's kernel sees an RST packet, it discards all buffered data. So, if there's any data that the client sent that's still sitting in the server kernel buffers, it disappears forever.
What does this have to do with TLS?
Generally speaking, it should be possible to take any application that uses raw TCP, and switch it to use TLS-over-TCP instead, right? (With the notable exception of half-closed connections, but never mind.) So let's port our client/server to use TLS:
# TLS-over-TCP client tlssock = connect(...) tlssock.do_handshake() while ...: tlssock.send(...) tlssock.send_close_notify() # Often skipped in practice, but let's be standards-compliant tlssock.close_tcp() # TLS-over-TCP server ("safe") tlssock = accept() tlssock.do_handshake() while ...: tlssock.recv(...) if eof: tlssock.send_close_notify() tlssock.close_tcp() break
Now here's the issue: with TLS 1.2 and earlier, if we follow the "safe pattern" at the application layer, like this, then openssl will ultimately translate that into the "safe pattern" at the TCP layer. But with TLS 1.3, openssl's habit of sending session tickets after the handshake means that this exact same code now produces the unsafe pattern at the TCP layer. The server→client session tickets could cause the client→server application data to be lost.
#6944 changes how openssl reacts to getting notified of a client close while sending session tickets, so that the server can keep calling
There's also a secondary problem, but it's more theoretical: if the server→client buffer is small enough, then this code could deadlock at the handshake – the server's call to
What to do?
Of course we can disable session tickets, but this is difficult in our case because Python's openssl bindings don't expose
We could do full bidirectional-shutdown, as suggested here. This seems problematic, though – the RFCs explicitly say that bidi shutdown is never required, bidi shutdown has always been useless before, and I've never heard of any existing software that does it. Trio doesn't have any way to force its peers to use it. If OpenSSL is going to say that all TLS 1.3 implementations that want to interoperate with OpenSSL have to switch to bidi shutdown, then that seems like a huge ask. Also, isn't TLS 1.3 supposed to reduce the number of round-trips we make?
The best idea I have is:
Have you read the shutdown documentation after #7188 has been merged?
Did you read the wiki about TLS 1.3, in particular the section about session?
Yes. But generally speaking, any encapsulation of a protocol by an other protocol causes problems. Generally, it's hard to abstract things away. And you point it out yourself above that there is an unsafe pattern in TCP.
You really can't skip sending the close notify. If you skip that you're vulnerable to a truncation attack. The other side should react to that with a protocol error.
The standards compliant case would be to receive the close notify before closing the tcp socket. The other side could send back a close notify alert, closing the connection without waiting for it can cause your unsafe TCP pattern. So this really is the unsafe TLS example.
That you need to close the TLS layer before you close the TCP layer is one of the many ways in which TLS leaks things to the application.
I'm sure that you can cause any TCP connection to deadlock.
The only case where the bidirectional shutdown is required is when you want to resume the session. I currently don't see when only a one way shutdown shouldn't work, as long as you disable tickets.
Any TLS 1.3 implementation that wants to support resumption really is going to have the same problem. And you really want to support resumption.
So for clients that only send data and never receive any, TLS 1.3 really forces you to either disable resumption or do a bidirectional shutdown.
Only at the start of the connection, to get the data faster.
You really also want a client that only writes to support resumption. In that case if SSL_write() is never called, you can never resume the session.
It's true that this kind of encapsulation is difficult, but previous versions of openssl managed it, and current openssl manages it except when using TLS 1.3 with session tickets.
For sure. But it used to be that if you used the "safe pattern" at the application level, that would map onto the "safe pattern" at the TCP level, and vice-versa – that's a non-leaky abstraction. The regression is that now if you use the safe pattern at the application level, openssl will silently convert that into the unsafe pattern at the TCP level. That's a leaky abstraction.
Yeah, Trio supports two modes: by default it does unidirectional close_notify and expects unidirectional close_notify. Or, if you set
Actually, no! It's not at all obvious, but AFAICT it really is true that safe patterns used to map to safe patterns and vice-versa, even if you have a mix of peers using unidirectional and bidi shutdown.
In your example, say that our program implements bidi shutdown, and we're talking to a peer that does unidirectional shutdown. It sends us a close_notify, and we send one back. You're right, our close_notify may provoke them to send a RST... but the fact that we've already received their close_notify means that we've already emptied our receive buffer, so their RST can't cause any harm. Disaster is narrowly averted.
Another tricky case is when our peer performs a unidirectional shutdown while we're in the middle of sending data. In this case, the data we're sending might provoke them to send a RST back, and that might cause their close_notify to be lost, and now we don't know where the end of their data is. But, this still doesn't break the abstraction, because in this case plain TCP suffers from exactly the same problem: if they close their socket when we're in the middle of sending data, then their FIN may be lost, and we don't know where the end of their data is. So the semantics are annoying, but they're predictable and consistent regardless of whether you're using TLS.
Not sure what you mean. Empirically, this is something that we abstracted away and it used to work fine. Trio's
??? "Writing reliable network protocols is impossible, so there's no point in trying" is not the attitude I was hoping to hear from an openssl dev :-(.
Empirically, we have plenty of protocol implementations that survive our deadlock torture test just fine, including previous versions of openssl.
And actually, now that I think about it, this deadlock thing is actually more of a practical problem than I realized, because it means that with openssl 1.1.1 we can no longer use the deadlock torture test to test protocols that run on top of TLS :-(
Right, but here I'm talking about the case where you don't disable tickets. If tickets aren't disabled, then @mattcaswell pointed out in #6904 that the example client/server can be made to work again by turning on bidi shutdown on the client, so it's a possible workaround.
My understanding is that bidi shutdown on the client is sufficient to prevent data loss in cases like my example client/server, but it isn't sufficient to support resumption, because openssl doesn't process session tickets that it receives after sending a close_notify.
In any case, sure, it would be nice if our example client/server could support resumption. It's too bad that with TLS 1.3, they can't, without further changes. But that doesn't mean it's OK to stop transmitting their data reliably! Our client/server here are only relying on the traditional guarantees made by TCP and all previous versions of openssl.
In your quote you dropped my third bullet point, which partially addresses this?
Besides... current openssl has exactly the same flaw in practice: if you have a protocol where the server never calls
The safe pattern for both TCP and TLS is to make sure that both ends agree all data has been sent before closing the connection. TLS has a mechanism for that, and it's to send the close notify in both directions.
I don't know enough about http(s), but is it always clear from the protocol when it's finished sending the data? Is a truncation attack possible?
In https tests I've done I've seen any combination of close notify that's possible, including no and bidirectional.
TLS 1.3 makes it very explicit that sending a close notify only closes the write direction. The other peer is still allowed to send application data back. If it still wants to send application data, and not yet the close notify, you do have a problem. In TCP you can also do a shutdown(SHUT_WR), the connection isn't fully closed.
OpenSSL has supported that for a very long time, even when the older TLS standards didn't say that was supported.
That's not at all what I'm saying. I'm saying that if both sides of a connection are doing the wrong thing, it's very easy to cause a deadlock. If you think there is a bug in OpenSSL that makes it deadlock while talking to itself, file a bug. Example code would be nice.
We've fixed processing session tickets after sending close notify in #7114.
Oh, you want a server that never calls SSL_write() to call some other function instead if it wants to be able to resume? That suddenly changes servers not to support resumption any more, and if the server changes to do it, the client still has the same problem that it needs to read the session.
Oh wow, I'd missed this, and it's super exciting, thank you! That fixes the one place where it used to be impossible to abstract away the difference between TLS and other byte-stream transports. That's really my main concern here... I want to be able to write protocol code in a generic way, that works the same on TCP, TLS, or whatever other transport makes sense. The openssl 1.1.1 session ticket handling breaks that guarantee.
In the past it's always been safe to take any program that worked correctly using plain TCP, and switch it to use TLS with unidirectional close_notify. I guess you can argue that unidirectional close_notify was always "wrong" in some vague existential sense, but it was explicitly allowed by the RFCs and it worked in practice, which seems more important to me.
This is the bug you are asking me to file :-). My test suite literally started deadlocking when I upgraded to 1.1.1. It's a test where the client does
The deadlock happens because the server's
I could make a standalone reproducer (in Python, say), if that would be helpful, but the problem is very straightforward. As long as the server's
Oh, excellent, thanks for the update.
I think this is confusing and we need to break it down by cases :-)
Case 1: client/server where the server sends application data to the client:
Case 2: client/server where the server never sends application data to client, with non-bidirectional close:
Case 3: client/server where the server never sends application data to the client, with bidi close:
So both strategies have some downsides compared to the old TLS 1.2 way of doing things, but make slightly different trade-offs about who suffers if they don't update their programs, and what kind of suffering they experience. It seems like openssl is saying that it's OK if some "case 2" users lose data, because in exchange, some "case 3" users will get slightly lower latency.
My impression is that for existing apps, "case 1" is far more common than "case 2", and "case 2" is far more common than "case 3".
I don't think it's a good trade-off to sacrifice correctness in the relatively common case, in order to speed up the relatively uncommon case.
To be clear though, I'm not like, wedded to that particular proposal or anything. It's just one idea, and I totally get that these are tricky issues and TLS 1.3's new session ticket design forces implementors to make awkward trade-offs. But I think for something as fundamental and widely-used as openssl, it's worth getting the edge cases right.
At least one concern I have is what happens now in the case where the server never sends data, the client tries to do a bidirectional shutdown (that it didn't used to do) and the server just closes the connection instead. The client should get an error in that case. I think the above is also broken behaviour. The server should not assume that the client doesn't want the bidirectional shutdown. I think the only correct behaviour is to send the close notify alert and not wait for the other side to send it back before closing the TCP connection. And that works only in the case where it's clear the other side is never going to send data back. I don't know if this actually happens now, but I assume it does. Which means that fixing the client to support resumption with TLS 1.3 might instead result in the client getting an error.
Is this the scenario you're worried about?
The server here has violated the spec – you aren't supposed to just close the socket like that. So the client TLS library should report an unclean shutdown. But... beyond that, I don't think anything changes? The server did receive all the data, and the client will safely read any session tickets before it sees a clean TCP-level shutdown and reports an unclean TLS-level shutdown.
I might not be understanding what you're worried about.
Where I feel like I'm missing something is, I don't understand how this scenario relates to session tickets :-). It's generally true that peers shouldn't violate the spec and doing so has some consequences, but the consequences don't seem to be any different here than they would be in any other case?
TLS 1.3 will force a client that only sends to change behaviour in case it wants session resumption. And I'm just worrried that that might break something. Note that you safe example of TLS with TLS 1.2 also produces the unsafe TCP pattern, the server will try to send back a close notify. TLS 1.3 just sends more.
I think I already explained why this isn't true up above? I'll try explaining again in the hopes that I was just unclear, but if there's a deeper disagreement lmk.
The "unsafe TCP pattern" requires a very specific combination of things all happen together:
With TLS 1.2, I think you're talking about the case where the client (peer A) sends a unidirectional close_notify and then closes its socket, and then the server (peer B) does what the spec says it should do: after it receives the client's close_notify, it sends back its own close_notify. This means that conditions 1, 2, and 3 are all satisfied... but condition 4 is not. If the server has received the client's close_notify, then by definition the server has already read all the client's data, and can't possibly lose any of it, no matter whether it sends its own close_notify back or not.
You're right that in the example you give as safe, the server will always properly process the close notify of the client in TLS 1.2. But at the same time, the TCP connection is not closed properly, because the server will attempt to write a close notify back, and there will be some error. I think this is an unsafe pattern that just happens to work. Using the same pattern in TLS 1.3, the server will also get an error like in the case of TLS 1.2, but it might get it earlier than in the case of TLS 1.2. It might now get that error before it processed the application data. So I think that even when the RFC suggest it's possible, that it was never a good idea to do it. Since there might be applications that do this, something needs to change. I'm currently not sure if that something is openssl, or the client application. And I currently don't know of any real affected client other than some test suite.
Yeah, I guess it's again a matter of definitions... I don't know if anyone intentionally designed this behavior, or if it's an accidental outcome of several different features in TCP and its common implementations. But either way, the behavior has been standard and documented for decades, and common protocols like websockets are designed around it.
Yeah, I don't know of any real affected applications either, and I totally sympathize with your reluctance to change things based on a weird corner case like this. The reason I think this is important, though, is the way it breaks abstractions.
I'm not writing an application; I'm writing a generic networking library. I guess most applications that use TLS don't use the openssl C API directly, but go through some layer like Trio (my library), or the node.js tls module, or similar. With TLS 1.2, it was certainly difficult for networking library authors to understand all the details of the openssl API and to abstract away the differences between TCP vs. TLS-over-TCP, but it was possible, and that was a cost paid once by the networking libraries, instead of over and over by every applications developer.
Like you, I have no idea whether any of my users are writing applications that would be broken by the new session ticket behavior, so I have to assume the worst. If I can't abstract away the differences, then I have to take on the burden of figuring out exactly what openssl does and doesn't guarantee, how that translates into the higher-level API that my library provides, and then teach all my users about these rare corner cases, just case they happen to ever write an app that would run into them.
There's been a ton of work in recent years to make TLS more universal and accessible. Right now in Trio it's extremely simple... you write
But this all relies on being able to abstract away the difference between TCP and TLS. If that goes away, then it doesn't really matter that it's only in an obscure edge condition; we still risk losing all these features.