Skip to content

AFAICT openssl's strategy for handling TLS 1.3 session tickets makes it impossible to reliably implement communication patterns where the server never sends application-level data #7948

@njsmith

Description

@njsmith

I maintain a Python networking library called Trio, and I've been struggling to get it working with Openssl v1.1.1/TLS 1.3. We use openssl with memory BIOs and have an extensive test suite that passes with earlier openssl versions, but hits a number of problems after upgrading to v1.1.1. The main issue seems to be the session tickets that openssl sends after the handshake in server mode, and how they affect connections where the client never calls SSL_read.

Due diligence: I found these previous issues/PRs that are all about the exact same issue that I'm facing, but having read them carefully I still can't figure out how to make this work: #6342, #6904, #6944. Also, for reference, this is the Trio bug: python-trio/trio#819

TCP (as I understand it)

Let's ignore TLS for the moment and just talk about TCP. Suppose we have a client that connects, sends some data, and then disconnects, without ever calling recv. In pseudo-code:

# Socket client
sock = connect(...)
while ...:
    sock.send(...)
sock.close()

# Socket server (safe)
sock = accept()
while ...:
    sock.recv(...)
    if eof:
        sock.close()
        break

This isn't a terribly common pattern, but it's perfectly legal and reliable. Call this the safe pattern.

But, TCP has a gotcha: suppose we have the server send a bit of data, and change nothing else. In particular the client still never calls recv:

# Socket server (unsafe)
sock = accept()
sock.send(<one byte of data>)   # <--- This is the only line that's different
while ...:
    sock.recv(...)
    if eof:
        sock.close()
        break

Now, this is a little funny looking, because the server is sending some data that the client will never read. But, whatever, that shouldn't affect anything, right? This shouldn't affect the data being sent from the client→server, right?

Well, that would be logical, but it's wrong! In the unsafe pattern, an arbitrary amount of the data sent by the client can be lost, even though the client code didn't change at all.

This happens because of arcane details of how TCP works: in the safe pattern, when the client calls close, the client's kernel sends a FIN packet, and the server's kernel queues that up behind all the other data in the server's receive buffer, and everything proceeds in an orderly fashion. But in the unsafe pattern, the client has incoming data. And if there's incoming data before or after a close, then RFC 1122 says:

If such a host issues a CLOSE call while received data is still pending in TCP, or if new data is received after CLOSE is called, its TCP SHOULD send a RST to show that data was lost.

So in this case, the client's kernel might send a RST, instead of or in addition to the FIN. And then when the server's kernel sees an RST packet, it discards all buffered data. So, if there's any data that the client sent that's still sitting in the server kernel buffers, it disappears forever.

References:

What does this have to do with TLS?

Generally speaking, it should be possible to take any application that uses raw TCP, and switch it to use TLS-over-TCP instead, right? (With the notable exception of half-closed connections, but never mind.) So let's port our client/server to use TLS:

# TLS-over-TCP client
tlssock = connect(...)
tlssock.do_handshake()
while ...:
    tlssock.send(...)
tlssock.send_close_notify()  # Often skipped in practice, but let's be standards-compliant
tlssock.close_tcp()

# TLS-over-TCP server ("safe")
tlssock = accept()
tlssock.do_handshake()
while ...:
    tlssock.recv(...)
    if eof:
        tlssock.send_close_notify()
        tlssock.close_tcp()
        break

Now here's the issue: with TLS 1.2 and earlier, if we follow the "safe pattern" at the application layer, like this, then openssl will ultimately translate that into the "safe pattern" at the TCP layer. But with TLS 1.3, openssl's habit of sending session tickets after the handshake means that this exact same code now produces the unsafe pattern at the TCP layer. The server→client session tickets could cause the client→server application data to be lost.

#6944 changes how openssl reacts to getting notified of a client close while sending session tickets, so that the server can keep calling recv. But that doesn't help if the kernel has already discarded the buffer that recv is trying to read out of. The problem here is all inside the TCP stacks; there's nothing openssl can do about it, except avoid sending the session tickets in the first place.

There's also a secondary problem, but it's more theoretical: if the server→client buffer is small enough, then this code could deadlock at the handshake – the server's call to SSL_do_handshake won't return until the client calls SSL_read, but the client is calling SSL_send, which will eventually block until the server calls SSL_read, but the server can't because it's waiting for the client to call SSL_read... I'm not sure if there are any realistic cases where people use buffers that are small enough to trigger this though. (We have some torture tests with small buffers to flush out problems like this, which of course did catch it.)

What to do?

Of course we can disable session tickets, but this is difficult in our case because Python's openssl bindings don't expose SSL_set_num_tickets and SSL_CTX_set_session_ticket_cb. Also, as a generic networking library, we wouldn't want to disable session tickets in general. But, we also really don't want to have to explain to our users that enabling TLS is just a matter of switching from a SocketStream to a SSLStream, the APIs are identical except that if you happen to know that your client might not ever read data, then on the server side you have to call this special API before the handshake. That's a super leaky abstraction.

We could do full bidirectional-shutdown, as suggested here. This seems problematic, though – the RFCs explicitly say that bidi shutdown is never required, bidi shutdown has always been useless before, and I've never heard of any existing software that does it. Trio doesn't have any way to force its peers to use it. If OpenSSL is going to say that all TLS 1.3 implementations that want to interoperate with OpenSSL have to switch to bidi shutdown, then that seems like a huge ask. Also, isn't TLS 1.3 supposed to reduce the number of round-trips we make?

The best idea I have is:

  • don't sent tickets automatically after the TLS 1.3 handshake
  • automatically send tickets the first time the server calls SSL_write
  • also provide an explicit SSL_write_tickets function to send tickets immediately

Metadata

Metadata

Assignees

Labels

branch: 1.1.1Merge to OpenSSL_1_1_1-stable branchbranch: masterMerge to master branchseverity: importantImportant bugs affecting a released versiontriaged: bugThe issue/pr is/fixes a bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions