QUIC I/O Architecture Design Document #19770

hlandau · 2022-11-25T12:49:32Z

This document describes the chosen internal I/O architecture for our QUIC implementation and the underlying motivations.

Feedback welcomed.

See also #19769.

kroeckx · 2022-11-25T14:07:26Z

doc/designs/quic-design/quic-io-arch.md

+changes needed on different platforms in the common case of an OS network
+socket. (Use of an `int` here is strictly incorrect for Windows; however, this
+style of usage is prevalent in the OpenSSL codebase, so for consistency we
+continue the pattern here.)


Since it's already a union, I see no reason to just have the int.

Is the idea to #ifdef WIN32 and use SOCKET in that case?

It's possible, but use of int for Win32 SOCKETs is so prolific in the codebase I'm not sure there's really a benefit.

doc/designs/quic-design/quic-io-arch.md

hlandau · 2022-11-28T13:04:55Z

Updated: tweaked wording, added diagram.

doc/designs/quic-design/quic-io-arch.md

mattcaswell · 2022-12-05T16:30:11Z

doc/designs/quic-design/quic-io-arch.md

+
+- The fact that a socket is writeable does not necessarily mean that a datagram
+  of the size we wish to send is writeable, so a send(3) call could block
+  anyway.


Why does this matter? If we are blocked on write we can't send anything that might occur as the result of a timeout anyway. I would expect write blocks to be much shorter and more transient than read blocks (read blocks are dependent on the remote peer to supply more data, while write blocks are only governed by how quickly the OS can empty its write buffers to the network). It would mean that we can't process newly incoming "read" data. But that is the case for application controlled TLS event loops and so far hasn't been a real problem (probably because of the transient nature of write blocks).

Writes aren't the only possible byproduct of a timeout.

Writes aren't the only possible byproduct of a timeout.

What other byproducts are there - and what is the impact of them being delayed by a small amount?

When we get a timeout it could cause a write, but it could also cause connection teardown (idle timeout), meaning application SSL API calls need to return, etc.

I recall seeing research suggesting that the latency of the path to the network is actually quite a relevant factor for optimising QUIC performance. Unfortunately I can't seem to place it currently. It's true that delays on an OS's kernel write queues will probably be short enough that it's not going to be a severe detriment to performance. I am quite concerned however by the prospect of walling ourselves in here and preventing ourselves from tuning this properly in the future.

mattcaswell · 2022-12-05T16:36:07Z

doc/designs/quic-design/quic-io-arch.md

+- This usage pattern precludes multithreaded use barring some locking scheme
+  due to the possibility of other threads racing between the call to select(3)
+  and the subsequent I/O call. This undermines our intentions to support
+  multi-threaded network I/O on the backend.


Multi-threaded network IO is a nice-to-have (and not mentioned in the original requirements doc). I think it will be mostly useful for high-performance server apps which are likely to be using non-blocking IO anyway. This doesn't seem like a reason in itself to exclude blocking sockets as an option.

But the API-facing architectural decisions we take now influence what we can do in the future. Rather confused about this since I was under the impression it was the entire reason we spent time developing BIO_sendmmsg.

I can not see a use case of multiple threads on the same socket in combination with blocking I/O. I think that if multiple threads are used, and you're using blocking I/O, you have 1 thread per socket, and so there is no race.

We can't have one thread per socket if we do blocking I/O on the network side because we need to do reads and writes (and timeouts) simultaneously, hence @mattcaswell proposing we have multiple background threads, one making BIO_read calls to an application's (potentially custom) BIO and one making BIO_write calls. But our existing BIO API contract doesn't provide for concurrent use of BIO_read and BIO_write on the same BIO (not even our own BIO_s_datagram is safe under these conditions), so an application would have to make major changes to make its custom BIO concurrent to support this. The scale of those changes imposed upon an application seems to be to be a lot more major than just mandating non-blocking operation (as this document explains).

But don't we require that any custom BIO implement support for BIO_sendmmsg() and BIO_recvmmsg()? Those APIs are defined to be thread safe....so this restriction applies even in the non-blocking scenario.

mattcaswell · 2022-12-05T16:43:38Z

doc/designs/quic-design/quic-io-arch.md

+have a BIO interface which provides for select(3)-like functionality or which
+can implement the required semantics above. Therefore, trying to implement QUIC
+on top of blocking I/O in this way would require violating the BIO abstraction
+layer, and would not work with custom BIOs.


This comment is valid - but seems to apply to using "select" with non-blocking sockets too.

The entire point of poll descriptors is that we explicitly make this part of the BIO abstraction layer and give a BIO implementation control of it.

Moreover, suppose we exposed the BIO poll descriptor interface I've designed for non-blocking I/O but said it can also be used when the network BIO could be blocking. But this doesn't actually solve anything because again, we have no contractual guarantee as to how many system calls a BIO_read/BIO_write implementation is allowed to make, whereas select(3) (as exposed by poll descriptors) only guarantees a single subsequent system call will work (on a compliant OS). Updated text to clarify this.

doc/designs/quic-design/quic-io-arch.md

mattcaswell · 2022-12-05T17:38:17Z

doc/designs/quic-design/quic-io-arch.md

+
+Where we are provided with a non-pollable BIO, we cannot provide the application
+with any primitive used for synchronisation and it is assumed that the
+application will handle its own network I/O, for example via a


How will the application know when any given stream is readable/writeable?

It calls SSL_read and gets SSL_ERROR_WANT_READ etc. We can also provide interfaces like SSL_pending etc.

Many people get confused by this, and I'm not sure if we can fix this. It seems that some people think that SSL_ERROR_WANT_READ means that they need to wait for data from the socket to arrive, but it's really not what it means. (But they should wait for it anyway.) It means that for whatever reason, the last thing we tried to do was read from the socket, and there was no data available on the socket. If the application calls SSL_read(), it wants application data. Maybe there was some other non-application data on the socket, and we processed that, but there is no application data and so we return SSL_ERROR_WANT_READ. But SSL_ERROR_WANT_READ can also mean that the protocol requires that we read some non-application data before we can continue, and so other functions might return that too.

But I think the question is, if data was received on the connection, how does it know for which of the streams on that connection it needs to call SSL_read(), try all the streams?

But I think the question is, if data was received on the connection, how does it know for which of the streams on that connection it needs to call SSL_read(), try all the streams?

This is a good question and we will need to answer it when we support multiple streams.

It's quite possible we will provide some select-esque call, say SSL_stream_select, which can determine which streams are ready.

t8m

Just nits. In general I agree with the document.

t8m · 2022-12-07T11:17:12Z

doc/designs/quic-design/quic-io-arch.md

+  BIO and violates the BIO abstraction. For BIOs in general, there does not
+  appear to be any viable solution to the teardown issue.
+
+Even if this approach were successfully implemented, applications will still


...approach was successfully ....

This reads correct to me...

t8m · 2022-12-07T11:17:46Z

doc/designs/quic-design/quic-io-arch.md

+Even if this approach were successfully implemented, applications will still
+need to change to using network BIOs with datagram semantics. For applications
+using custom BIOs, this is likely to require substantial rework of those BIOs.
+There is no possible way around this. Thus, even if this solution were adopted


doc/designs/quic-design/quic-io-arch.md

hlandau · 2022-12-14T18:11:16Z

Updated.

DemiMarie · 2023-01-03T19:06:45Z

Is the BIO abstraction even the right one to use? Optimal performance with io_uring requires that buffers are owned by the kernel and that the application uses a callback-on-completion model. This means that OpenSSL needs to use application-provided buffers, rather than having OpenSSL allocate the buffers.

kroeckx · 2023-01-03T23:44:33Z

I've been thinking about this too, but I'm currently unsure what the best approach is. The design goal for me is to be able to make an implementation that uses io_uring and do the minimal amount of copying. Some of the things I have to think about:

OpenSSL need to encrypt things, so reading 1 buffer and writing it to an allocated buffer might be a solution.
As far as I know, the OpenSSL always supports inplace encryption, so 1 buffer might be possible. I have no idea if this provides any speed advantages or not, and if that's the same for all algorithms we have.
We will need to add a header in front of the encrypted data. This makes using an application provider buffer harder, unless we can make use of iovec. It's my understanding that we will not be able to use iovec on all platforms, they would get an additional alloc+copy, and we want to avoid that.
If the application provider buffer is used, we will need a new API anyway, where we can say that they need to leave a fix amount of room that OpenSSL can use. I'm not sure it's actually a fixed amount. Either we need an API where they can allocate the correct buffer, or, if they allocate it, we need a callback to indicate they can free it. (You can't free something not allocated by your own dll.)

DemiMarie · 2023-01-03T23:54:27Z

@kroeckx: what is the reason for OpenSSL implementing QUIC itself in the first place, as opposed to supporting third-party implementations such as MsQuic or LSQUIC?

hlandau · 2023-01-04T07:16:39Z

I've been thinking about this too, but I'm currently unsure what the best approach is. The design goal for me is to be able to make an implementation that uses io_uring and do the minimal amount of copying. Some of the things I have to think about:

OpenSSL need to encrypt things, so reading 1 buffer and writing it to an allocated buffer might be a solution.

As far as I know, the OpenSSL always supports inplace encryption, so 1 buffer might be possible. I have no idea if this provides any speed advantages or not, and if that's the same for all algorithms we have.

We will need to add a header in front of the encrypted data. This makes using an application provider buffer harder, unless we can make use of iovec. It's my understanding that we will not be able to use iovec on all platforms, they would get an additional alloc+copy, and we want to avoid that.

The current QUIC code has been designed with single-copy in mind, though this is not fully achieved yet. Take a look at the QUIC record layer (QTX), for example; it uses an iovec-like structure to allow the plaintext to come from multiple buffers during the encryption stage. Handling of QUIC packet headers during this is already done.

Currently there are two copies; one when an application calls BIO_write to add data to a stream's buffer, and another when encrypting from that buffer while generating a packet. Three copies if you count the copy made by the kernel in its own buffers when we pass the final datagram to the OS.

Reducing the number of copies further would require some API changes, which obviously undermines our attempt to support existing users of OpenSSL with only minimal changes:

We would need to evolve the application-side BIO API to allow an application to lend OpenSSL a reference to a buffer instead of using BIO_write. Presumably OpenSSL would call some sort of callback when done with the buffer. I believe we've already discussed the idea of such an API in future.
We would need to eliminate the need of the OS to make a copy of the data we pass it, via something like io_uring or similar mechanisms on Windows. This creates substantial questions of how to evolve the BIO abstraction layer on the network side to support this. If we want to support existing network-side BIOs with only 'minimal' changes this would presumably need to be some kind of opt-in enhancement.

kroeckx · 2023-01-04T18:42:10Z

I think one of the points we want to make is that the BIO should not create an additional copy, it should be using the same buffer. This might require a new API for the BIOs, where the BIO should be able to say it no longer needs the buffer. At least on the write path, there is support in Linux using io_uring not to make a copy at all, the kernel doesn't copy it, the network hardware does DMA to get to the user space buffer, and the BIO can then get a completion request when the buffer is no longer needed, and can then also indicate that it's no longer using the buffer.

That is, once it's encrypted, the CPU should no copy it at all.

hlandau · 2023-01-05T11:06:10Z

@kroeckx Absolutely, I agree we want to get to that eventually and this has informed our implementation.

There is an OMC dimension here, as currently we are developing to the premise that we should be allowing existing network-side BIOs to be used with only minimal changes, as was asked for*. Achieving single-copy means developing a new BIO API and requiring it to be supported for the network-side BIO by applications which want to enjoy the single-copy fast path.

* Of course, "minimal" does not necessarily mean "small" — a network-side BIO has be adapted to have datagram semantics rather than bytestream semantics, etc., which is unavoidable.

openssl-machine · 2023-03-05T00:09:31Z

This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago

openssl-machine · 2023-04-05T00:09:33Z

This PR is in a state where it requires action by @openssl/committers but the last update was 61 days ago

openssl-machine · 2023-04-26T14:00:22Z

This pull request is ready to merge

hlandau · 2023-04-26T15:00:29Z

Merged to master. Thanks for the reviews.

Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #19770)

hlandau added branch: master Merge to master branch approval: review pending This pull request needs review by a committer approval: otc review pending This pull request needs review by an OTC member triaged: documentation The issue/pr deals with documentation (errors) labels Nov 25, 2022

hlandau requested a review from a team November 25, 2022 12:49

hlandau self-assigned this Nov 25, 2022

hlandau mentioned this pull request Nov 25, 2022

QUIC API Overview Design Document #19769

Closed

kroeckx reviewed Nov 25, 2022

View reviewed changes

t8m added the triaged: design The issue/pr deals with a design document label Nov 25, 2022

hlandau force-pushed the quic-ioarch branch 3 times, most recently from 547b065 to 740c31d Compare November 28, 2022 13:19

kroeckx reviewed Nov 30, 2022

View reviewed changes

doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved

mattcaswell requested changes Dec 5, 2022

View reviewed changes

t8m requested changes Dec 7, 2022

View reviewed changes

hlandau mentioned this pull request Jan 3, 2023

QUIC Event Loop Design #17185

Closed

hlandau added the tests: exempted The PR is exempt from requirements for testing label Jan 4, 2023

hlandau added 3 commits January 5, 2023 11:06

QUIC I/O Architecture Design Document

7f658a2

QUIC I/O Architecture Design: Add block diagram, tweak wording

cc524a4

QUIC I/O Architecture Design: Minor updates

b2dd892

hlandau force-pushed the quic-ioarch branch from efe8ba8 to b2dd892 Compare January 5, 2023 11:06

paulidale approved these changes Feb 2, 2023

View reviewed changes

paulidale removed the approval: otc review pending This pull request needs review by an OTC member label Feb 2, 2023

t8m approved these changes Apr 25, 2023

View reviewed changes

t8m added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Apr 25, 2023

openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Apr 26, 2023

hlandau closed this Apr 26, 2023

openssl-machine pushed a commit that referenced this pull request Apr 26, 2023

QUIC I/O Architecture Design Document

aed7082

Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #19770)

openssl-machine pushed a commit that referenced this pull request Apr 26, 2023

QUIC I/O Architecture Design: Add block diagram, tweak wording

dda8647

Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #19770)

openssl-machine pushed a commit that referenced this pull request Apr 26, 2023

QUIC I/O Architecture Design: Minor updates

c48cc76

Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #19770)

QUIC I/O Architecture Design Document #19770

QUIC I/O Architecture Design Document #19770

Conversation

hlandau commented Nov 25, 2022 • edited

Choose a reason for hiding this comment

hlandau Nov 28, 2022 • edited

Choose a reason for hiding this comment

hlandau commented Nov 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hlandau Dec 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hlandau Dec 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t8m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hlandau commented Dec 14, 2022

DemiMarie commented Jan 3, 2023

kroeckx commented Jan 3, 2023

DemiMarie commented Jan 3, 2023

hlandau commented Jan 4, 2023

kroeckx commented Jan 4, 2023

hlandau commented Jan 5, 2023

openssl-machine commented Mar 5, 2023

openssl-machine commented Apr 5, 2023

openssl-machine commented Apr 26, 2023

hlandau commented Apr 26, 2023

hlandau commented Nov 25, 2022 •

edited

hlandau Nov 28, 2022 •

edited

hlandau Dec 14, 2022 •

edited

hlandau Dec 14, 2022 •

edited