Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUIC I/O Architecture Design Document #19770

Closed
wants to merge 3 commits into from

Conversation

hlandau
Copy link
Member

@hlandau hlandau commented Nov 25, 2022

This document describes the chosen internal I/O architecture for our QUIC implementation and the underlying motivations.

Feedback welcomed.

See also #19769.

@hlandau hlandau added branch: master Merge to master branch approval: review pending This pull request needs review by a committer approval: otc review pending This pull request needs review by an OTC member triaged: documentation The issue/pr deals with documentation (errors) labels Nov 25, 2022
@hlandau hlandau requested a review from a team November 25, 2022 12:49
@hlandau hlandau self-assigned this Nov 25, 2022
changes needed on different platforms in the common case of an OS network
socket. (Use of an `int` here is strictly incorrect for Windows; however, this
style of usage is prevalent in the OpenSSL codebase, so for consistency we
continue the pattern here.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's already a union, I see no reason to just have the int.

Copy link
Member Author

@hlandau hlandau Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to #ifdef WIN32 and use SOCKET in that case?

It's possible, but use of int for Win32 SOCKETs is so prolific in the codebase I'm not sure there's really a benefit.

doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
@t8m t8m added the triaged: design The issue/pr deals with a design document label Nov 25, 2022
@hlandau
Copy link
Member Author

hlandau commented Nov 28, 2022

Updated: tweaked wording, added diagram.

@hlandau hlandau force-pushed the quic-ioarch branch 3 times, most recently from 547b065 to 740c31d Compare November 28, 2022 13:19
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved

- The fact that a socket is writeable does not necessarily mean that a datagram
of the size we wish to send is writeable, so a send(3) call could block
anyway.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this matter? If we are blocked on write we can't send anything that might occur as the result of a timeout anyway. I would expect write blocks to be much shorter and more transient than read blocks (read blocks are dependent on the remote peer to supply more data, while write blocks are only governed by how quickly the OS can empty its write buffers to the network). It would mean that we can't process newly incoming "read" data. But that is the case for application controlled TLS event loops and so far hasn't been a real problem (probably because of the transient nature of write blocks).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writes aren't the only possible byproduct of a timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writes aren't the only possible byproduct of a timeout.

What other byproducts are there - and what is the impact of them being delayed by a small amount?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we get a timeout it could cause a write, but it could also cause connection teardown (idle timeout), meaning application SSL API calls need to return, etc.

I recall seeing research suggesting that the latency of the path to the network is actually quite a relevant factor for optimising QUIC performance. Unfortunately I can't seem to place it currently. It's true that delays on an OS's kernel write queues will probably be short enough that it's not going to be a severe detriment to performance. I am quite concerned however by the prospect of walling ourselves in here and preventing ourselves from tuning this properly in the future.

- This usage pattern precludes multithreaded use barring some locking scheme
due to the possibility of other threads racing between the call to select(3)
and the subsequent I/O call. This undermines our intentions to support
multi-threaded network I/O on the backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-threaded network IO is a nice-to-have (and not mentioned in the original requirements doc). I think it will be mostly useful for high-performance server apps which are likely to be using non-blocking IO anyway. This doesn't seem like a reason in itself to exclude blocking sockets as an option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the API-facing architectural decisions we take now influence what we can do in the future. Rather confused about this since I was under the impression it was the entire reason we spent time developing BIO_sendmmsg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not see a use case of multiple threads on the same socket in combination with blocking I/O. I think that if multiple threads are used, and you're using blocking I/O, you have 1 thread per socket, and so there is no race.

Copy link
Member Author

@hlandau hlandau Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have one thread per socket if we do blocking I/O on the network side because we need to do reads and writes (and timeouts) simultaneously, hence @mattcaswell proposing we have multiple background threads, one making BIO_read calls to an application's (potentially custom) BIO and one making BIO_write calls. But our existing BIO API contract doesn't provide for concurrent use of BIO_read and BIO_write on the same BIO (not even our own BIO_s_datagram is safe under these conditions), so an application would have to make major changes to make its custom BIO concurrent to support this. The scale of those changes imposed upon an application seems to be to be a lot more major than just mandating non-blocking operation (as this document explains).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't we require that any custom BIO implement support for BIO_sendmmsg() and BIO_recvmmsg()? Those APIs are defined to be thread safe....so this restriction applies even in the non-blocking scenario.

have a BIO interface which provides for select(3)-like functionality or which
can implement the required semantics above. Therefore, trying to implement QUIC
on top of blocking I/O in this way would require violating the BIO abstraction
layer, and would not work with custom BIOs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is valid - but seems to apply to using "select" with non-blocking sockets too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire point of poll descriptors is that we explicitly make this part of the BIO abstraction layer and give a BIO implementation control of it.

Copy link
Member Author

@hlandau hlandau Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, suppose we exposed the BIO poll descriptor interface I've designed for non-blocking I/O but said it can also be used when the network BIO could be blocking. But this doesn't actually solve anything because again, we have no contractual guarantee as to how many system calls a BIO_read/BIO_write implementation is allowed to make, whereas select(3) (as exposed by poll descriptors) only guarantees a single subsequent system call will work (on a compliant OS). Updated text to clarify this.

doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved

Where we are provided with a non-pollable BIO, we cannot provide the application
with any primitive used for synchronisation and it is assumed that the
application will handle its own network I/O, for example via a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the application know when any given stream is readable/writeable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It calls SSL_read and gets SSL_ERROR_WANT_READ etc. We can also provide interfaces like SSL_pending etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many people get confused by this, and I'm not sure if we can fix this. It seems that some people think that SSL_ERROR_WANT_READ means that they need to wait for data from the socket to arrive, but it's really not what it means. (But they should wait for it anyway.) It means that for whatever reason, the last thing we tried to do was read from the socket, and there was no data available on the socket. If the application calls SSL_read(), it wants application data. Maybe there was some other non-application data on the socket, and we processed that, but there is no application data and so we return SSL_ERROR_WANT_READ. But SSL_ERROR_WANT_READ can also mean that the protocol requires that we read some non-application data before we can continue, and so other functions might return that too.

But I think the question is, if data was received on the connection, how does it know for which of the streams on that connection it needs to call SSL_read(), try all the streams?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think the question is, if data was received on the connection, how does it know for which of the streams on that connection it needs to call SSL_read(), try all the streams?

This is a good question and we will need to answer it when we support multiple streams.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite possible we will provide some select-esque call, say SSL_stream_select, which can determine which streams are ready.

Copy link
Member

@t8m t8m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits. In general I agree with the document.

BIO and violates the BIO abstraction. For BIOs in general, there does not
appear to be any viable solution to the teardown issue.

Even if this approach were successfully implemented, applications will still
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...approach was successfully ....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads correct to me...

Even if this approach were successfully implemented, applications will still
need to change to using network BIOs with datagram semantics. For applications
using custom BIOs, this is likely to require substantial rework of those BIOs.
There is no possible way around this. Thus, even if this solution were adopted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were->was

doc/designs/quic-design/quic-io-arch.md Outdated Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Outdated Show resolved Hide resolved
doc/designs/quic-design/quic-io-arch.md Show resolved Hide resolved
@hlandau
Copy link
Member Author

hlandau commented Dec 14, 2022

Updated.

@hlandau hlandau mentioned this pull request Jan 3, 2023
@DemiMarie
Copy link

Is the BIO abstraction even the right one to use? Optimal performance with io_uring requires that buffers are owned by the kernel and that the application uses a callback-on-completion model. This means that OpenSSL needs to use application-provided buffers, rather than having OpenSSL allocate the buffers.

@kroeckx
Copy link
Member

kroeckx commented Jan 3, 2023

I've been thinking about this too, but I'm currently unsure what the best approach is. The design goal for me is to be able to make an implementation that uses io_uring and do the minimal amount of copying. Some of the things I have to think about:

  • OpenSSL need to encrypt things, so reading 1 buffer and writing it to an allocated buffer might be a solution.
  • As far as I know, the OpenSSL always supports inplace encryption, so 1 buffer might be possible. I have no idea if this provides any speed advantages or not, and if that's the same for all algorithms we have.
  • We will need to add a header in front of the encrypted data. This makes using an application provider buffer harder, unless we can make use of iovec. It's my understanding that we will not be able to use iovec on all platforms, they would get an additional alloc+copy, and we want to avoid that.
  • If the application provider buffer is used, we will need a new API anyway, where we can say that they need to leave a fix amount of room that OpenSSL can use. I'm not sure it's actually a fixed amount. Either we need an API where they can allocate the correct buffer, or, if they allocate it, we need a callback to indicate they can free it. (You can't free something not allocated by your own dll.)

@DemiMarie
Copy link

@kroeckx: what is the reason for OpenSSL implementing QUIC itself in the first place, as opposed to supporting third-party implementations such as MsQuic or LSQUIC?

@hlandau
Copy link
Member Author

hlandau commented Jan 4, 2023

I've been thinking about this too, but I'm currently unsure what the best approach is. The design goal for me is to be able to make an implementation that uses io_uring and do the minimal amount of copying. Some of the things I have to think about:

  • OpenSSL need to encrypt things, so reading 1 buffer and writing it to an allocated buffer might be a solution.

  • As far as I know, the OpenSSL always supports inplace encryption, so 1 buffer might be possible. I have no idea if this provides any speed advantages or not, and if that's the same for all algorithms we have.

  • We will need to add a header in front of the encrypted data. This makes using an application provider buffer harder, unless we can make use of iovec. It's my understanding that we will not be able to use iovec on all platforms, they would get an additional alloc+copy, and we want to avoid that.

The current QUIC code has been designed with single-copy in mind, though this is not fully achieved yet. Take a look at the QUIC record layer (QTX), for example; it uses an iovec-like structure to allow the plaintext to come from multiple buffers during the encryption stage. Handling of QUIC packet headers during this is already done.

Currently there are two copies; one when an application calls BIO_write to add data to a stream's buffer, and another when encrypting from that buffer while generating a packet. Three copies if you count the copy made by the kernel in its own buffers when we pass the final datagram to the OS.

Reducing the number of copies further would require some API changes, which obviously undermines our attempt to support existing users of OpenSSL with only minimal changes:

  • We would need to evolve the application-side BIO API to allow an application to lend OpenSSL a reference to a buffer instead of using BIO_write. Presumably OpenSSL would call some sort of callback when done with the buffer. I believe we've already discussed the idea of such an API in future.

  • We would need to eliminate the need of the OS to make a copy of the data we pass it, via something like io_uring or similar mechanisms on Windows. This creates substantial questions of how to evolve the BIO abstraction layer on the network side to support this. If we want to support existing network-side BIOs with only 'minimal' changes this would presumably need to be some kind of opt-in enhancement.

@hlandau hlandau added the tests: exempted The PR is exempt from requirements for testing label Jan 4, 2023
@kroeckx
Copy link
Member

kroeckx commented Jan 4, 2023

I think one of the points we want to make is that the BIO should not create an additional copy, it should be using the same buffer. This might require a new API for the BIOs, where the BIO should be able to say it no longer needs the buffer. At least on the write path, there is support in Linux using io_uring not to make a copy at all, the kernel doesn't copy it, the network hardware does DMA to get to the user space buffer, and the BIO can then get a completion request when the buffer is no longer needed, and can then also indicate that it's no longer using the buffer.

That is, once it's encrypted, the CPU should no copy it at all.

@hlandau
Copy link
Member Author

hlandau commented Jan 5, 2023

@kroeckx Absolutely, I agree we want to get to that eventually and this has informed our implementation.

There is an OMC dimension here, as currently we are developing to the premise that we should be allowing existing network-side BIOs to be used with only minimal changes, as was asked for*. Achieving single-copy means developing a new BIO API and requiring it to be supported for the network-side BIO by applications which want to enjoy the single-copy fast path.

* Of course, "minimal" does not necessarily mean "small" — a network-side BIO has be adapted to have datagram semantics rather than bytestream semantics, etc., which is unavoidable.

@paulidale paulidale removed the approval: otc review pending This pull request needs review by an OTC member label Feb 2, 2023
@openssl-machine
Copy link
Collaborator

This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago

@openssl-machine
Copy link
Collaborator

This PR is in a state where it requires action by @openssl/committers but the last update was 61 days ago

@t8m t8m added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Apr 25, 2023
@openssl-machine openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Apr 26, 2023
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

@hlandau
Copy link
Member Author

hlandau commented Apr 26, 2023

Merged to master. Thanks for the reviews.

@hlandau hlandau closed this Apr 26, 2023
openssl-machine pushed a commit that referenced this pull request Apr 26, 2023
Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #19770)
openssl-machine pushed a commit that referenced this pull request Apr 26, 2023
Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #19770)
openssl-machine pushed a commit that referenced this pull request Apr 26, 2023
Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #19770)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch tests: exempted The PR is exempt from requirements for testing triaged: design The issue/pr deals with a design document triaged: documentation The issue/pr deals with documentation (errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants