New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUIC I/O Architecture Design Document #19770
Conversation
changes needed on different platforms in the common case of an OS network | ||
socket. (Use of an `int` here is strictly incorrect for Windows; however, this | ||
style of usage is prevalent in the OpenSSL codebase, so for consistency we | ||
continue the pattern here.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's already a union, I see no reason to just have the int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea to #ifdef WIN32
and use SOCKET
in that case?
It's possible, but use of int
for Win32 SOCKETs is so prolific in the codebase I'm not sure there's really a benefit.
Updated: tweaked wording, added diagram. |
547b065
to
740c31d
Compare
|
||
- The fact that a socket is writeable does not necessarily mean that a datagram | ||
of the size we wish to send is writeable, so a send(3) call could block | ||
anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this matter? If we are blocked on write we can't send anything that might occur as the result of a timeout anyway. I would expect write blocks to be much shorter and more transient than read blocks (read blocks are dependent on the remote peer to supply more data, while write blocks are only governed by how quickly the OS can empty its write buffers to the network). It would mean that we can't process newly incoming "read" data. But that is the case for application controlled TLS event loops and so far hasn't been a real problem (probably because of the transient nature of write blocks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writes aren't the only possible byproduct of a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writes aren't the only possible byproduct of a timeout.
What other byproducts are there - and what is the impact of them being delayed by a small amount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we get a timeout it could cause a write, but it could also cause connection teardown (idle timeout), meaning application SSL API calls need to return, etc.
I recall seeing research suggesting that the latency of the path to the network is actually quite a relevant factor for optimising QUIC performance. Unfortunately I can't seem to place it currently. It's true that delays on an OS's kernel write queues will probably be short enough that it's not going to be a severe detriment to performance. I am quite concerned however by the prospect of walling ourselves in here and preventing ourselves from tuning this properly in the future.
- This usage pattern precludes multithreaded use barring some locking scheme | ||
due to the possibility of other threads racing between the call to select(3) | ||
and the subsequent I/O call. This undermines our intentions to support | ||
multi-threaded network I/O on the backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-threaded network IO is a nice-to-have (and not mentioned in the original requirements doc). I think it will be mostly useful for high-performance server apps which are likely to be using non-blocking IO anyway. This doesn't seem like a reason in itself to exclude blocking sockets as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the API-facing architectural decisions we take now influence what we can do in the future. Rather confused about this since I was under the impression it was the entire reason we spent time developing BIO_sendmmsg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not see a use case of multiple threads on the same socket in combination with blocking I/O. I think that if multiple threads are used, and you're using blocking I/O, you have 1 thread per socket, and so there is no race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have one thread per socket if we do blocking I/O on the network side because we need to do reads and writes (and timeouts) simultaneously, hence @mattcaswell proposing we have multiple background threads, one making BIO_read calls to an application's (potentially custom) BIO and one making BIO_write calls. But our existing BIO API contract doesn't provide for concurrent use of BIO_read and BIO_write on the same BIO (not even our own BIO_s_datagram is safe under these conditions), so an application would have to make major changes to make its custom BIO concurrent to support this. The scale of those changes imposed upon an application seems to be to be a lot more major than just mandating non-blocking operation (as this document explains).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't we require that any custom BIO implement support for BIO_sendmmsg() and BIO_recvmmsg()? Those APIs are defined to be thread safe....so this restriction applies even in the non-blocking scenario.
have a BIO interface which provides for select(3)-like functionality or which | ||
can implement the required semantics above. Therefore, trying to implement QUIC | ||
on top of blocking I/O in this way would require violating the BIO abstraction | ||
layer, and would not work with custom BIOs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is valid - but seems to apply to using "select" with non-blocking sockets too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire point of poll descriptors is that we explicitly make this part of the BIO abstraction layer and give a BIO implementation control of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, suppose we exposed the BIO poll descriptor interface I've designed for non-blocking I/O but said it can also be used when the network BIO could be blocking. But this doesn't actually solve anything because again, we have no contractual guarantee as to how many system calls a BIO_read/BIO_write implementation is allowed to make, whereas select(3) (as exposed by poll descriptors) only guarantees a single subsequent system call will work (on a compliant OS). Updated text to clarify this.
|
||
Where we are provided with a non-pollable BIO, we cannot provide the application | ||
with any primitive used for synchronisation and it is assumed that the | ||
application will handle its own network I/O, for example via a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the application know when any given stream is readable/writeable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It calls SSL_read
and gets SSL_ERROR_WANT_READ
etc. We can also provide interfaces like SSL_pending
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many people get confused by this, and I'm not sure if we can fix this. It seems that some people think that SSL_ERROR_WANT_READ means that they need to wait for data from the socket to arrive, but it's really not what it means. (But they should wait for it anyway.) It means that for whatever reason, the last thing we tried to do was read from the socket, and there was no data available on the socket. If the application calls SSL_read(), it wants application data. Maybe there was some other non-application data on the socket, and we processed that, but there is no application data and so we return SSL_ERROR_WANT_READ. But SSL_ERROR_WANT_READ can also mean that the protocol requires that we read some non-application data before we can continue, and so other functions might return that too.
But I think the question is, if data was received on the connection, how does it know for which of the streams on that connection it needs to call SSL_read(), try all the streams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think the question is, if data was received on the connection, how does it know for which of the streams on that connection it needs to call SSL_read(), try all the streams?
This is a good question and we will need to answer it when we support multiple streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quite possible we will provide some select-esque call, say SSL_stream_select
, which can determine which streams are ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits. In general I agree with the document.
BIO and violates the BIO abstraction. For BIOs in general, there does not | ||
appear to be any viable solution to the teardown issue. | ||
|
||
Even if this approach were successfully implemented, applications will still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...approach was successfully ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads correct to me...
Even if this approach were successfully implemented, applications will still | ||
need to change to using network BIOs with datagram semantics. For applications | ||
using custom BIOs, this is likely to require substantial rework of those BIOs. | ||
There is no possible way around this. Thus, even if this solution were adopted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were->was
Updated. |
Is the BIO abstraction even the right one to use? Optimal performance with |
I've been thinking about this too, but I'm currently unsure what the best approach is. The design goal for me is to be able to make an implementation that uses io_uring and do the minimal amount of copying. Some of the things I have to think about:
|
@kroeckx: what is the reason for OpenSSL implementing QUIC itself in the first place, as opposed to supporting third-party implementations such as MsQuic or LSQUIC? |
The current QUIC code has been designed with single-copy in mind, though this is not fully achieved yet. Take a look at the QUIC record layer (QTX), for example; it uses an iovec-like structure to allow the plaintext to come from multiple buffers during the encryption stage. Handling of QUIC packet headers during this is already done. Currently there are two copies; one when an application calls BIO_write to add data to a stream's buffer, and another when encrypting from that buffer while generating a packet. Three copies if you count the copy made by the kernel in its own buffers when we pass the final datagram to the OS. Reducing the number of copies further would require some API changes, which obviously undermines our attempt to support existing users of OpenSSL with only minimal changes:
|
I think one of the points we want to make is that the BIO should not create an additional copy, it should be using the same buffer. This might require a new API for the BIOs, where the BIO should be able to say it no longer needs the buffer. At least on the write path, there is support in Linux using io_uring not to make a copy at all, the kernel doesn't copy it, the network hardware does DMA to get to the user space buffer, and the BIO can then get a completion request when the buffer is no longer needed, and can then also indicate that it's no longer using the buffer. That is, once it's encrypted, the CPU should no copy it at all. |
@kroeckx Absolutely, I agree we want to get to that eventually and this has informed our implementation. There is an OMC dimension here, as currently we are developing to the premise that we should be allowing existing network-side BIOs to be used with only minimal changes, as was asked for*. Achieving single-copy means developing a new BIO API and requiring it to be supported for the network-side BIO by applications which want to enjoy the single-copy fast path. * Of course, "minimal" does not necessarily mean "small" — a network-side BIO has be adapted to have datagram semantics rather than bytestream semantics, etc., which is unavoidable. |
This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago |
This PR is in a state where it requires action by @openssl/committers but the last update was 61 days ago |
This pull request is ready to merge |
Merged to master. Thanks for the reviews. |
Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #19770)
Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #19770)
Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #19770)
This document describes the chosen internal I/O architecture for our QUIC implementation and the underlying motivations.
Feedback welcomed.
See also #19769.