AFAICT openssl's strategy for handling TLS 1.3 session tickets makes it impossible to reliably implement communication patterns where the server never sends application-level data #7948
AFAICT openssl's strategy for handling TLS 1.3 session tickets makes it impossible to reliably implement communication patterns where the server never sends application-level data #7948
I maintain a Python networking library called Trio, and I've been struggling to get it working with Openssl v1.1.1/TLS 1.3. We use openssl with memory BIOs and have an extensive test suite that passes with earlier openssl versions, but hits a number of problems after upgrading to v1.1.1. The main issue seems to be the session tickets that openssl sends after the handshake in server mode, and how they affect connections where the client never calls
Due diligence: I found these previous issues/PRs that are all about the exact same issue that I'm facing, but having read them carefully I still can't figure out how to make this work: #6342, #6904, #6944. Also, for reference, this is the Trio bug: python-trio/trio#819
TCP (as I understand it)
Let's ignore TLS for the moment and just talk about TCP. Suppose we have a client that connects, sends some data, and then disconnects, without ever calling
# Socket client sock = connect(...) while ...: sock.send(...) sock.close() # Socket server (safe) sock = accept() while ...: sock.recv(...) if eof: sock.close() break
This isn't a terribly common pattern, but it's perfectly legal and reliable. Call this the safe pattern.
But, TCP has a gotcha: suppose we have the server send a bit of data, and change nothing else. In particular the client still never calls
# Socket server (unsafe) sock = accept() sock.send(<one byte of data>) # <--- This is the only line that's different while ...: sock.recv(...) if eof: sock.close() break
Now, this is a little funny looking, because the server is sending some data that the client will never read. But, whatever, that shouldn't affect anything, right? This shouldn't affect the data being sent from the client→server, right?
Well, that would be logical, but it's wrong! In the unsafe pattern, an arbitrary amount of the data sent by the client can be lost, even though the client code didn't change at all.
This happens because of arcane details of how TCP works: in the safe pattern, when the client calls
So in this case, the client's kernel might send a RST, instead of or in addition to the FIN. And then when the server's kernel sees an RST packet, it discards all buffered data. So, if there's any data that the client sent that's still sitting in the server kernel buffers, it disappears forever.
What does this have to do with TLS?
Generally speaking, it should be possible to take any application that uses raw TCP, and switch it to use TLS-over-TCP instead, right? (With the notable exception of half-closed connections, but never mind.) So let's port our client/server to use TLS:
# TLS-over-TCP client tlssock = connect(...) tlssock.do_handshake() while ...: tlssock.send(...) tlssock.send_close_notify() # Often skipped in practice, but let's be standards-compliant tlssock.close_tcp() # TLS-over-TCP server ("safe") tlssock = accept() tlssock.do_handshake() while ...: tlssock.recv(...) if eof: tlssock.send_close_notify() tlssock.close_tcp() break
Now here's the issue: with TLS 1.2 and earlier, if we follow the "safe pattern" at the application layer, like this, then openssl will ultimately translate that into the "safe pattern" at the TCP layer. But with TLS 1.3, openssl's habit of sending session tickets after the handshake means that this exact same code now produces the unsafe pattern at the TCP layer. The server→client session tickets could cause the client→server application data to be lost.
#6944 changes how openssl reacts to getting notified of a client close while sending session tickets, so that the server can keep calling
There's also a secondary problem, but it's more theoretical: if the server→client buffer is small enough, then this code could deadlock at the handshake – the server's call to
What to do?
Of course we can disable session tickets, but this is difficult in our case because Python's openssl bindings don't expose
We could do full bidirectional-shutdown, as suggested here. This seems problematic, though – the RFCs explicitly say that bidi shutdown is never required, bidi shutdown has always been useless before, and I've never heard of any existing software that does it. Trio doesn't have any way to force its peers to use it. If OpenSSL is going to say that all TLS 1.3 implementations that want to interoperate with OpenSSL have to switch to bidi shutdown, then that seems like a huge ask. Also, isn't TLS 1.3 supposed to reduce the number of round-trips we make?
The best idea I have is:
The text was updated successfully, but these errors were encountered:
Have you read the shutdown documentation after #7188 has been merged?
Did you read the wiki about TLS 1.3, in particular the section about session?
Yes. But generally speaking, any encapsulation of a protocol by an other protocol causes problems. Generally, it's hard to abstract things away. And you point it out yourself above that there is an unsafe pattern in TCP.
You really can't skip sending the close notify. If you skip that you're vulnerable to a truncation attack. The other side should react to that with a protocol error.
The standards compliant case would be to receive the close notify before closing the tcp socket. The other side could send back a close notify alert, closing the connection without waiting for it can cause your unsafe TCP pattern. So this really is the unsafe TLS example.
That you need to close the TLS layer before you close the TCP layer is one of the many ways in which TLS leaks things to the application.
I'm sure that you can cause any TCP connection to deadlock.
The only case where the bidirectional shutdown is required is when you want to resume the session. I currently don't see when only a one way shutdown shouldn't work, as long as you disable tickets.
Any TLS 1.3 implementation that wants to support resumption really is going to have the same problem. And you really want to support resumption.
So for clients that only send data and never receive any, TLS 1.3 really forces you to either disable resumption or do a bidirectional shutdown.
Only at the start of the connection, to get the data faster.
You really also want a client that only writes to support resumption. In that case if SSL_write() is never called, you can never resume the session.
It's true that this kind of encapsulation is difficult, but previous versions of openssl managed it, and current openssl manages it except when using TLS 1.3 with session tickets.
For sure. But it used to be that if you used the "safe pattern" at the application level, that would map onto the "safe pattern" at the TCP level, and vice-versa – that's a non-leaky abstraction. The regression is that now if you use the safe pattern at the application level, openssl will silently convert that into the unsafe pattern at the TCP level. That's a leaky abstraction.
Yeah, Trio supports two modes: by default it does unidirectional close_notify and expects unidirectional close_notify. Or, if you set
Actually, no! It's not at all obvious, but AFAICT it really is true that safe patterns used to map to safe patterns and vice-versa, even if you have a mix of peers using unidirectional and bidi shutdown.
In your example, say that our program implements bidi shutdown, and we're talking to a peer that does unidirectional shutdown. It sends us a close_notify, and we send one back. You're right, our close_notify may provoke them to send a RST... but the fact that we've already received their close_notify means that we've already emptied our receive buffer, so their RST can't cause any harm. Disaster is narrowly averted.
Another tricky case is when our peer performs a unidirectional shutdown while we're in the middle of sending data. In this case, the data we're sending might provoke them to send a RST back, and that might cause their close_notify to be lost, and now we don't know where the end of their data is. But, this still doesn't break the abstraction, because in this case plain TCP suffers from exactly the same problem: if they close their socket when we're in the middle of sending data, then their FIN may be lost, and we don't know where the end of their data is. So the semantics are annoying, but they're predictable and consistent regardless of whether you're using TLS.
Not sure what you mean. Empirically, this is something that we abstracted away and it used to work fine. Trio's
??? "Writing reliable network protocols is impossible, so there's no point in trying" is not the attitude I was hoping to hear from an openssl dev :-(.
Empirically, we have plenty of protocol implementations that survive our deadlock torture test just fine, including previous versions of openssl.
And actually, now that I think about it, this deadlock thing is actually more of a practical problem than I realized, because it means that with openssl 1.1.1 we can no longer use the deadlock torture test to test protocols that run on top of TLS :-(
Right, but here I'm talking about the case where you don't disable tickets. If tickets aren't disabled, then @mattcaswell pointed out in #6904 that the example client/server can be made to work again by turning on bidi shutdown on the client, so it's a possible workaround.
My understanding is that bidi shutdown on the client is sufficient to prevent data loss in cases like my example client/server, but it isn't sufficient to support resumption, because openssl doesn't process session tickets that it receives after sending a close_notify.
In any case, sure, it would be nice if our example client/server could support resumption. It's too bad that with TLS 1.3, they can't, without further changes. But that doesn't mean it's OK to stop transmitting their data reliably! Our client/server here are only relying on the traditional guarantees made by TCP and all previous versions of openssl.
In your quote you dropped my third bullet point, which partially addresses this?
Besides... current openssl has exactly the same flaw in practice: if you have a protocol where the server never calls
The safe pattern for both TCP and TLS is to make sure that both ends agree all data has been sent before closing the connection. TLS has a mechanism for that, and it's to send the close notify in both directions.
I don't know enough about http(s), but is it always clear from the protocol when it's finished sending the data? Is a truncation attack possible?
In https tests I've done I've seen any combination of close notify that's possible, including no and bidirectional.
TLS 1.3 makes it very explicit that sending a close notify only closes the write direction. The other peer is still allowed to send application data back. If it still wants to send application data, and not yet the close notify, you do have a problem. In TCP you can also do a shutdown(SHUT_WR), the connection isn't fully closed.
OpenSSL has supported that for a very long time, even when the older TLS standards didn't say that was supported.
That's not at all what I'm saying. I'm saying that if both sides of a connection are doing the wrong thing, it's very easy to cause a deadlock. If you think there is a bug in OpenSSL that makes it deadlock while talking to itself, file a bug. Example code would be nice.
We've fixed processing session tickets after sending close notify in #7114.
Oh, you want a server that never calls SSL_write() to call some other function instead if it wants to be able to resume? That suddenly changes servers not to support resumption any more, and if the server changes to do it, the client still has the same problem that it needs to read the session.
Oh wow, I'd missed this, and it's super exciting, thank you! That fixes the one place where it used to be impossible to abstract away the difference between TLS and other byte-stream transports. That's really my main concern here... I want to be able to write protocol code in a generic way, that works the same on TCP, TLS, or whatever other transport makes sense. The openssl 1.1.1 session ticket handling breaks that guarantee.
In the past it's always been safe to take any program that worked correctly using plain TCP, and switch it to use TLS with unidirectional close_notify. I guess you can argue that unidirectional close_notify was always "wrong" in some vague existential sense, but it was explicitly allowed by the RFCs and it worked in practice, which seems more important to me.
This is the bug you are asking me to file :-). My test suite literally started deadlocking when I upgraded to 1.1.1. It's a test where the client does
The deadlock happens because the server's
I could make a standalone reproducer (in Python, say), if that would be helpful, but the problem is very straightforward. As long as the server's
Oh, excellent, thanks for the update.
I think this is confusing and we need to break it down by cases :-)
Case 1: client/server where the server sends application data to the client:
Case 2: client/server where the server never sends application data to client, with non-bidirectional close:
Case 3: client/server where the server never sends application data to the client, with bidi close:
So both strategies have some downsides compared to the old TLS 1.2 way of doing things, but make slightly different trade-offs about who suffers if they don't update their programs, and what kind of suffering they experience. It seems like openssl is saying that it's OK if some "case 2" users lose data, because in exchange, some "case 3" users will get slightly lower latency.
My impression is that for existing apps, "case 1" is far more common than "case 2", and "case 2" is far more common than "case 3".
I don't think it's a good trade-off to sacrifice correctness in the relatively common case, in order to speed up the relatively uncommon case.
To be clear though, I'm not like, wedded to that particular proposal or anything. It's just one idea, and I totally get that these are tricky issues and TLS 1.3's new session ticket design forces implementors to make awkward trade-offs. But I think for something as fundamental and widely-used as openssl, it's worth getting the edge cases right.
At least one concern I have is what happens now in the case where the server never sends data, the client tries to do a bidirectional shutdown (that it didn't used to do) and the server just closes the connection instead. The client should get an error in that case. I think the above is also broken behaviour. The server should not assume that the client doesn't want the bidirectional shutdown. I think the only correct behaviour is to send the close notify alert and not wait for the other side to send it back before closing the TCP connection. And that works only in the case where it's clear the other side is never going to send data back. I don't know if this actually happens now, but I assume it does. Which means that fixing the client to support resumption with TLS 1.3 might instead result in the client getting an error.
Is this the scenario you're worried about?
The server here has violated the spec – you aren't supposed to just close the socket like that. So the client TLS library should report an unclean shutdown. But... beyond that, I don't think anything changes? The server did receive all the data, and the client will safely read any session tickets before it sees a clean TCP-level shutdown and reports an unclean TLS-level shutdown.
I might not be understanding what you're worried about.
Where I feel like I'm missing something is, I don't understand how this scenario relates to session tickets :-). It's generally true that peers shouldn't violate the spec and doing so has some consequences, but the consequences don't seem to be any different here than they would be in any other case?
TLS 1.3 will force a client that only sends to change behaviour in case it wants session resumption. And I'm just worrried that that might break something. Note that you safe example of TLS with TLS 1.2 also produces the unsafe TCP pattern, the server will try to send back a close notify. TLS 1.3 just sends more.
I think I already explained why this isn't true up above? I'll try explaining again in the hopes that I was just unclear, but if there's a deeper disagreement lmk.
The "unsafe TCP pattern" requires a very specific combination of things all happen together:
With TLS 1.2, I think you're talking about the case where the client (peer A) sends a unidirectional close_notify and then closes its socket, and then the server (peer B) does what the spec says it should do: after it receives the client's close_notify, it sends back its own close_notify. This means that conditions 1, 2, and 3 are all satisfied... but condition 4 is not. If the server has received the client's close_notify, then by definition the server has already read all the client's data, and can't possibly lose any of it, no matter whether it sends its own close_notify back or not.
You're right that in the example you give as safe, the server will always properly process the close notify of the client in TLS 1.2. But at the same time, the TCP connection is not closed properly, because the server will attempt to write a close notify back, and there will be some error. I think this is an unsafe pattern that just happens to work. Using the same pattern in TLS 1.3, the server will also get an error like in the case of TLS 1.2, but it might get it earlier than in the case of TLS 1.2. It might now get that error before it processed the application data. So I think that even when the RFC suggest it's possible, that it was never a good idea to do it. Since there might be applications that do this, something needs to change. I'm currently not sure if that something is openssl, or the client application. And I currently don't know of any real affected client other than some test suite.
Yeah, I guess it's again a matter of definitions... I don't know if anyone intentionally designed this behavior, or if it's an accidental outcome of several different features in TCP and its common implementations. But either way, the behavior has been standard and documented for decades, and common protocols like websockets are designed around it.
Yeah, I don't know of any real affected applications either, and I totally sympathize with your reluctance to change things based on a weird corner case like this. The reason I think this is important, though, is the way it breaks abstractions.
I'm not writing an application; I'm writing a generic networking library. I guess most applications that use TLS don't use the openssl C API directly, but go through some layer like Trio (my library), or the node.js tls module, or similar. With TLS 1.2, it was certainly difficult for networking library authors to understand all the details of the openssl API and to abstract away the differences between TCP vs. TLS-over-TCP, but it was possible, and that was a cost paid once by the networking libraries, instead of over and over by every applications developer.
Like you, I have no idea whether any of my users are writing applications that would be broken by the new session ticket behavior, so I have to assume the worst. If I can't abstract away the differences, then I have to take on the burden of figuring out exactly what openssl does and doesn't guarantee, how that translates into the higher-level API that my library provides, and then teach all my users about these rare corner cases, just case they happen to ever write an app that would run into them.
There's been a ton of work in recent years to make TLS more universal and accessible. Right now in Trio it's extremely simple... you write
But this all relies on being able to abstract away the difference between TCP and TLS. If that goes away, then it doesn't really matter that it's only in an obscure edge condition; we still risk losing all these features.
@rdp Disabling tickets is quite tricky; you need to use both
We did find and deploy a workaround for this that I think makes TLS 1.3 work correctly on all openssl versions, while still supporting session tickets. Instead of letting openssl do I/O directly, we use MemoryBIO, and manually copy data to/from the socket. After
This is convoluted and gross, but I believe it is a correct and full workaround. And unless the openssl devs decide to fix this bug and backport it to v1.1.1, I guess this is the only way to get correct TLS 1.3 support from openssl on RHEL 8.
Analysis of why this workaround is correct: python-trio/trio#819 (comment)
Example implementation: python-trio/trio#1171
I sense some dissatisfaction with how this issue has been addressed up to now. @njsmith and @kroeckx have different views about it and I am not in the position to judge who’s arguments are more conclusive. Maybe some other member of @openssl/committers can chime in and help resolve the ‘deadlock’ in this discussion?
I'm not sure why we stopped discussing this. So I understand that what you is to have the same effect as only writing the session data on the first SSL_write() or SSL_shutdown(). I don't think doing it at SSL_shutdown() was ever mentioned, and doing it at the first SSL_write() or SSL_shutdown() might actually be better that what we do now.
In summary, the problem is that the client might have send all it's data, send a close_notify, and closed the connection before the server tries send the session ticket. The server tries to write the tickets before it reads the data from the client. When the server writes the tickets after the client closed the connection, the TCP connection will be reset, and the server can't actually read what the client send. The solution is not to write any ticket as part of the handshake, but at a later time when we would send something anyway. We might still get reset, but we would have received all the data. It also doesn't change anything for a client that does a bidirectional shutdown.
This can also come up in non-closing scenarios. Consider an HTTP/1.1 client. HTTP/1.1 never reads and writes at the same time. Typically, the client sends the entire HTTP request and only tries to read once it's finished with that. This means OpenSSL will deadlock if neither the ticket (which scales with client cert size) nor the HTTP request (which is not under OpenSSL's control) fit in transport receive buffers. BoringSSL defers the ticket write to the first
If you get reset, you won't (reliably) receive the data. TCP treats resets as error conditions and will drop buffers, truncate sequence numbers, stop trying to retransmit, etc., when it happens. The first half of the bug report discusses this. We ran into a similar situation with trying to make client certificate alerts (which are now post-handshake from the client's perspective in TLS 1.3) reliably delivered, though I think that one may be unfixable.
However, deferring the ticket write to the first
This does mean the reset is avoided by saying a server which never writes never sends tickets, but I don't see another option under this I/O pattern. Also protocols without server writes likely have no client reads, so the client won't pick up the ticket anyway.
In general, the TLS library should avoid writing "out of turn" or it will confuse application-level handling of transport backpressure and resets. (See also mentions of surprising I/O patterns in #8677. I think OpenSSL only defers if the write buffer is busy. This may still cause TCP resets in this I/O pattern if the peer's TLS library sends KeyUpdates transparently. BoringSSL unconditionally defers to
Oh, sorry, I just realized by "send something anyway", you might have meant the
On the topic of
It's true that ignoring
However, there is no interop requirement to not send
HTTP/2 frames are self-delimited and streams have an explicit close, so there isn't a truncation problem there. Transport EOF should not be considered a clean stream closure, whether or not it's authenticated.
HTTP/1.1 is a complex mess of a text protocol and has potential truncation problems. First, the header block is delimited by two newlines. It is important that an HTTPS parser either enforce
Second, there are three ways to delimit message bodies:
(1) and (2) are self-delimited. You don't need
(3) has truncation problems and needs
Text protocols are the worst.
On Sun, Dec 08, 2019 at 06:55:38AM -0800, David Benjamin wrote: > > The solution is not to write any ticket as part of the handshake, but at a later time when we would send something anyway. We might still get reset, but we would have received all the data. > > If you get reset, you won't (reliably) receive the data [...] Oh, sorry, I just realized by "send something anyway", you might have meant the `close_notify`. I think that is fine as long as the server send `close_notify` until after it's done reading whatever it would have tried to read. Then the reset won't drop anything because it's all been read. But this is also super weird. IMO TCP is being overaggressive in classifying out-of-turn writes as an error condition. Or perhaps it needs a notion of droppable "FIN data". But it's a bit late to fix that. :-/
So don't call SSL_shutdown() just after accept, but wait until the client has send you the close notify before you call it. @mattcaswell: Ca you implement that logic that we don't send the tickets until we have an other reason to send something?
This is a workaround for an OpenSSL TLS 1.3 bug that results in data loss when one-way protocols are used and a connection is closed by the client right after sending data. "TLS 1.3 session tickets makes it impossible to reliably implement communication patterns where the server never sends application-level data." - openssl/openssl#10880 - openssl/openssl#7948 Signed-off-by: László Várady <email@example.com>
I can do. Should that be something we do by default or should it be an option? Making it the default would be quite a significant change of behaviour and I suspect might break other applications that now assume the tickets will be sent immediately. Also would we backport whatever we implement to 1.1.1?
Having just pushed #11416 for review I can say that doing this on top of that would be pretty elegant and result in some code reduction if non-optional. Alas, I'm inclined to agree with the concerns that this would break things and is not appropriate for 1.1.1, and maybe not even for 3.0
For what it's worth, this issue is not just hypothetical. The scenario described by #7948 (comment) is exactly what happens during an FTPS upload when the data connection uses TLSv1.3, where the server only ever reads, and the client only ever writes, resulting in an
The workaround I used was to accept TLSv1.3 session tickets for the data connection TLS handshake, but not to renew them, so that the server does not do any TCP writes on the data connection for uploads using TLSv1.3.
I don't think anyone was claiming it was purely theoretical, just that the people with the expertise to write the patch have a lot of conflicting demands on their time.
This is a workaround for an OpenSSL TLS 1.3 bug that results in data loss when one-way protocols are used and a connection is closed by the client right after sending data. "TLS 1.3 session tickets makes it impossible to reliably implement communication patterns where the server never sends application-level data." - openssl/openssl#10880 - openssl/openssl#7948 Backported from OSE: 28c8013ca35be06387cf692c9ba1baee6af33511 Signed-off-by: László Várady <firstname.lastname@example.org> Signed-off-by: Attila Szakacs <email@example.com>