Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New kernel polling interface for Linux 4.18 (io_uring)? #1947

Closed
joehillen opened this issue Aug 14, 2018 · 49 comments
Closed

New kernel polling interface for Linux 4.18 (io_uring)? #1947

joehillen opened this issue Aug 14, 2018 · 49 comments
Labels
enhancement not-stale Issues that should never be marked stale

Comments

@joehillen
Copy link

Would libuv be able to benefit from leveraging the new kernel polling interface that was just released with Linux kernel version 4.18?

I'm curious if I should target the kernel API directly or if it would be better to abstract it as part of libuv.

@gireeshpunathil
Copy link
Contributor

For me, the central part of this interface are io_submit and io_getevents the mechanism of which is exactly what is covered within libuv, with epoll_wait primitive and the algorithms around it.

So probably this interface is best leveraged by other platforms to implement non-blocking I/O in a more systematic way, not of much use for libuv and its consumers. However, if the kernel has to offer performance benefits for this interface over the design around epoll_wait primitive, then it makes sense to review - I don't know whether that is the case.

@bnoordhuis
Copy link
Member

I sent patches to teach aio poll about more file descriptor types. But then aio poll got temporarily reverted wholesale and I don't think my patches were reapplied again when it was merged back in. It's probably too limited right now to be useful for libuv.

I don't have time to work on it but for anyone interested, the ability to embed an epoll fd in an aio pollset is the key, that means you can always use epoll as a fallback.

@bnoordhuis
Copy link
Member

IOCB_CMD_POLL didn't make the cut for 4.18 but it looks to be on track for 4.19 and from testing linux master I can confirm that embedding an epoll fd in an aio pollset works! I hope to have time this month to put together a pull request.

However, if the kernel has to offer performance benefits for this interface over the design around epoll_wait primitive, then it makes sense to review - I don't know whether that is the case.

@gireeshpunathil The big benefit is that aio lets you use a ring buffer. That means you don't have to make a (slow) system call to check for pending events, you can just pull them from the ring buffer.

@bnoordhuis
Copy link
Member

https://lwn.net/ml/linux-fsdevel/20190121201456.28338-1-rpenyaev@suse.de/ - there seems to be some movement on adding a ring buffer to epoll.

That would make life a little easier for libuv because it means we won't have to support two completely disparate AIO mechanisms.

@sam-github
Copy link
Contributor

Background: https://lwn.net/Articles/776703/, the discussion on that article may have led to the above.

@rektide rektide mentioned this issue Mar 31, 2019
@saghul saghul changed the title New kernel polling interface for Linux 4.18? New kernel polling interface for Linux 4.18 (io_uring)? Apr 1, 2019
@zbjornson
Copy link
Contributor

I think there are two different interfaces being discussed -- io_uring is in Linux 5.1; IOCB_CMD_POLL is (was?) an addition to Linux AIO and I'm unclear what its status is. It sounds like AIO is even being removed entirely from the kernel. 🎻

I have a proof-of-concept Node.js addon that implements read and write using liburing here, using libuv's idle checks for polling (not sure if that's the best way). 5.1-rc3 is sufficient to test it.

From limited benchmarking:

  • The latency is more consistent between calls (2.6% vs 8.7%)
  • Sequential read throughput is about 38% better
  • Parallel tbd, benchmark is being inconsistent.

There's more advanced stuff I haven't gotten into with registering fds to reduce overhead, which would be useful for Node.js' streams. (see io_uring_register.2)

Windows' overlapped I/O could be used equivalently, but I don't know about any of the other platforms. Would libuv ever use a mix of techniques for disk I/O (async I/O where it's usable, threadpool on platforms where it's not)?

@bnoordhuis
Copy link
Member

It sounds like AIO is even being removed entirely from the kernel.

Note the post date however. :)

Would libuv ever use a mix of techniques for disk I/O

Yes, provided it's reliable. (That was always the issue with Linux AIO, it wasn't.)

@zbjornson
Copy link
Contributor

Note the post date however.

Oh good grief 😣. Based on the replies at least I wasn't the only one fooled!

@sam-github
Copy link
Contributor

Linus is famous for a firm commitment to not break user-space, ripping an entire API set out is unlikely, it should set your spidey senses to twingling!

@jimmyjones2
Copy link

Some more docs for io_uring have just been released: http://kernel.dk/io_uring.pdf

@zbjornson
Copy link
Contributor

Does anyone familiar with libuv have a minute to review the approach in the repo I linked to previously, just to make sure it's reasonable so that I can keep evaluating it and maybe move toward a PR, please? Specifically: I'm polling and draining the completion queue in an uv_idle_t callback. I wasn't sure how else to get the loop to run so that the cq could be polled. I know that's not exactly how it would be if it was integrated into libuv, but is that a close approximation?

@bnoordhuis
Copy link
Member

uv_idle_t basically turns it into busy-loop polling so that's no good (in general, I mean; it might be okay for your use case.)

I think the way to go is to have AIO events reported to an eventfd that's watched by the event loop. You check the ring buffer and only enter epoll_wait() or io_submit() if it's empty.

That should be safe because even if events arrive between the check and the system call, the fact that they signal the eventfd means you won't lose them, you'll return from the system call straight away.

@zbjornson
Copy link
Contributor

Thanks. It sounds like you're talking about AIO though. io_uring has no io_set_eventfd equivalent AFAIK. Because the event loop is generally running anyway, and because checking the completion queue is very cheap, it seems okay to poll for completion events? There's only one uv_idle_t for all pending I/O operations, so it also doesn't deteriorate with higher load.

@sam-github
Copy link
Contributor

Isn't the event loop generally blocked on epoll, and won't wake up (aka return from epoll) and go check the uring unless something causes it to wake up? Like a notification on an eventfd. If you see the loop running continuously, its probably because you have a uv_idle_t, so have forced it to busy loop.

@zbjornson
Copy link
Contributor

I'm thinking in the context of Node.js where it's almost invariably running.

@sam-github
Copy link
Contributor

So am I! :-) Why would it be constantly running? I guess if node always has outstanding I/O and never quite catchs up it would be always running.

@zbjornson
Copy link
Contributor

zbjornson commented Apr 10, 2019

Maybe I'm wrong there :-) I assumed a busy server always has pending network IO and timers at least.

edit - Jens has kindly sent me a patch to try out to add notifications to io_uring. I'll try it out shortly.

@axboe
Copy link

axboe commented Apr 11, 2019

I did send Zach a patch, it's also in my io_uring-next repository.

BTW, for poll(2) type checking, it's also very possible to do that on the ring fd. That'll work as well, without having to add support for eventfd.

@zbjornson
Copy link
Contributor

zbjornson commented Apr 21, 2019

edited after prototype code fixed to reduce epoll_ctl calls
edited to show varying threadpool sizes

My test repo is updated to use an eventfd with uv_poll_t. Here's how it benchmarks for varying numbers of simultaneous reads:

image

That benchmark measured how long each read took for each of one thousand 1024B files, read 250 times. I used small files to try to get at the interface overhead and not the I/O throughput.

  • threadpool={1,2,4} is libuv's current method with a threadpool size of 1, 2 or 4. This worsening latency with more threads is repeatable between runs of the benchmark 🤔.
  • io_uring calls io_uring_submit in each read()/write() call. This is expensive (strace below), so...
  • io_uring batched submit calls io_uring_submit in a uv_prepare_t. This helped a lot but obviously means the kernel can't start work on the I/O right away. (I don't know when libuv actually starts threadpool I/O -- is it immediate assuming no queuing, or does a different phase of the loop have to begin?)

io_uring_submit on each read/write()

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 91.07    2.482327           9    250000           io_uring_enter
  0.93    0.025472          22      1109        21 futex
  0.14    0.003930           7       509       501 epoll_ctl
  0.01    0.000384          22        17           read
  0.00    0.000015           5         3           epoll_pwait
------ ----------- ----------- --------- --------- ----------------
100.00    2.725836                258207       610 total

io_uring_submit in uv_prepare_t

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.67    0.480573          60      8000           io_uring_enter
 13.97    0.143823          17      8016           read
       # NB: read calls is high b/c read() is used to reset eventfd
 13.39    0.137910          17      8002           epoll_pwait
  4.09    0.042105          29      1444         2 futex
  1.35    0.013950          18       758       252 epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    1.029654                 32789       342 total

I haven't finished the Windows overlapped+IOCP version for comparison yet.


@axboe

for poll(2) type checking, it's also very possible to do that on the ring fd.

currently possible, or you mean that would be an alternative interface to the eventfd?

(Thanks again for the speedy patch!)

@bnoordhuis
Copy link
Member

torvalds/linux@9b40284 ("io_uring: add support for eventfd notifications") appears to be on track for Linux 5.2.

I'm posting this mostly as a follow-up to the discussion above because I assume libuv would poll the io_uring fd itself, something that's supported in 5.1.

@zbjornson W.r.t. libuv integration, I expect you want to feature-detect in uv__platform_loop_init() in src/unix/linux-core.c and then either set up io_uring or fall back to epoll.

You can't change sizeof(uv_loop_t) because of ABI and it's already kind of crowded in there but you can turn int backend_fd into a union { int fd; void* data; } backend (if I'm not mistaken, there's a 4 byte hole there on 64 bits architectures just waiting to be filled) and use one bit from loop->flags to discriminate between the two1. You can then use backend.data to point to dynamically allocated memory.

1 Currently only bit 0 is in use for UV_LOOP_BLOCK_SIGPROF.

zbjornson added a commit to zbjornson/libuv that referenced this issue May 28, 2019
zbjornson added a commit to zbjornson/libuv that referenced this issue Jun 2, 2019
Currently trying to fix a new test failure that arose from only the preparations for io_uring.

Ref libuv#1947
@zbjornson
Copy link
Contributor

zbjornson commented Jun 2, 2019

@bnoordhuis thanks for the pointers. Is this what you had in mind?

(as of that commit I'm still working on a test failure edit: all good, a test was failing because I was writing the io_uring setup status to stderr)

I assume libuv would poll the io_uring fd itself

I haven't figured out how to make this work. The ring fd added with epoll_ctl never triggers the epoll instance. (I'm also not sure what the difference would be between that and an eventfd(2), aside from slightly earlier kernel support and avoiding two syscalls on init.)

zbjornson added a commit to zbjornson/libuv that referenced this issue Jun 2, 2019
Refactoring works. Need to actually use io_uring now.

ref libuv#1947
@zbjornson
Copy link
Contributor

Can ignore the first part of the previous comment; I opened #2322 for easier discussion.

@axboe
Copy link

axboe commented Dec 3, 2019

I have read through it, and the io_uring parts look fine to me. It's possible to remove some boiler plate code by adding a liburing dependency (like the setup code, or the ring fill-in), but in all fairness, that code is probably never going to get touched once it's merged. So I don't think that is a big deal, and there's nothing wrong with using the raw interface vs going with liburing. The only real upside I can think of is getting rid of the memory barriers.

I'm not familiar with the libuv code base so I focused on the specific io_uring bits.

@vcaputo
Copy link

vcaputo commented Nov 26, 2020

@axboe It may be worth considering having applications go through liburing provides a convenient boundary for hooking io_uring emulation in userspace for either pre-io_uring kernel support or development/debugging purposes. liburing itself could implement this, enabled via some environment variable, or just get replaced via LD_PRELOAD.

@ioquatix
Copy link

ioquatix commented Nov 27, 2020

@vcaputo I don't think you can replace liburing with an emulation layer without a huge amount of impedance mismatch.

I also don't think it's necessary to pull in liburing and I think a lot of people will refer to this code base to see how to do it without liburing so there is value in going straight to the system calls, etc.

@concatime
Copy link
Contributor

concatime commented Nov 27, 2020

I think liburing make sens for apps like QEMU or nginx. This way, you have an abstraction layer between the app and io_uring. In our case, I think libuv should deal directly with io_uring because libuv is already an abstraction layer by itself. And one less dependency at compile-time and runtime :)
Also, like @ioquatix said, it would be nice to have a code base to refer on how to deal with io_uwing without liburing.

EDIT: After reading @axboe's comment below, I changed my mind, and using liburing is the de-factor way to interact with io_uring. It keeps libuv's code simple. And I would trust @axboe's liburing than any other io_uring code. Thank you :)

@vcaputo
Copy link

vcaputo commented Nov 27, 2020

Have you looked at the liburing code? It's just a thin veneer over the syscall interface, it's not even 1000 lines of .c files.

The value IMHO of having most stuff go through it is it's a single point for implementing stuff like an emulation for compatibility or debugging, and I strongly disagree with the claim of a "huge amount of impedance mismatch" pertaining to emulation.

For ages there was protest to adding such asynchronous syscall interfaces to the Linux kernel because they never did anything that couldn't be done perfectly well from userspace after NPTL landed.

Sure, now that syscalls are more expensive it won't be as fast when emulated, but there's no major barriers or contortions necessary to make it work at all. The submission queue maps to threads, threads perform the syscalls instead of the kernel, the results get serialized and pumped out the completion queue. No. Big. Deal. Now applications written for just liburing can work on any kernel, and libuv targeting liburing could be both developed on and tested against either pre-io_uring kernels or hell even non-linux systems like OSX.

That sort of layer belongs in liburing, and if it doesn't hurt libuv significantly to go through liburing, my vote is to do the same, since many applications use libuv.

Just my $.02, I don't really have a dog in this race otherwise.

@jorangreef
Copy link
Contributor

I would say: if you can use liburing, then go for it. liburing has solved some tricky bugs in the past, and there a few surprising edge cases you need to get right if you want to do it yourself. liburing is essentially the test suite for the kernel so it's a pretty good reference implementation.

@jobs-git
Copy link

jobs-git commented Jun 5, 2021

What happened here now? nginx got io_uring support already.

@concatime

This comment was marked as resolved.

@jobs-git
Copy link

jobs-git commented Jun 8, 2021

The liburing is made by the creator of io_uring, so...

@axboe
Copy link

axboe commented Jun 9, 2021

liburing should just be considered a reference implementation. I've written a few things that use the raw interface, so that's quite possible too. Or even a mix - using liburing for the ring setup to avoid a bunch of boiler plate code, then use the raw interface after that.

It's possible to be slightly faster with the raw interface. As with any kind of API, liburing will add some fat to the middle and have some indirection, as well as catering to cases that any one specific user may not care about. In terms of API stability, both the kernel and liburing won't change (on purpose...).

That said, unless there's a good reason not to, I'd probably just stick with liburing.

@jorangreef
Copy link
Contributor

And liburing's also got some great clean design. We've had a few comments from people looking at @ziglang's io_uring design and we just say, well this is just what @axboe did! Take a look at liburing...

@amirouche
Copy link

Take a look at liburing...

https://github.com/axboe/liburing

@tontinton
Copy link

Wouldn't this be a big step for libuv?
I don't get why it's on hold for so long...

@xorgy
Copy link

xorgy commented Sep 13, 2022

Wouldn't this be a big step for libuv? I don't get why it's on hold for so long...

libuv needs more maintainers, and maybe those maintainers should be getting paid pretty well given how much money people are making with libuv. The PR could probably use rebasing at this point too, but it can be used as-is by anyone brave enough, so give it a shot. :+ )

@oerdnj
Copy link
Contributor

oerdnj commented Feb 23, 2023

I’ve recently stumbled upon this https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023 and my employer (ISC) would be willing to sponsor io_uring support in libuv. We are not big tech company, so reasonable proposals only ;). I think you might know how to reach me via email.

@bnoordhuis
Copy link
Member

@oerdnj I'll be in touch. It's something I've been chipping away at for some time, if ever so slowly.

@plajjan
Copy link

plajjan commented May 18, 2023

Awesome work in #3952, really cool this is happening. #3952 is focused on fs operations. Are there plans (ISC sponsorship? @oerdnj) that covers network sockets too?

@oerdnj
Copy link
Contributor

oerdnj commented May 19, 2023

Awesome work in #3952, really cool this is happening. #3952 is focused on fs operations. Are there plans (ISC sponsorship? @oerdnj) that covers network sockets too?

There’s #3979 too. As for the network sockets, we are talking about that, but the benefit seems to be smaller as epoll is already very effective. @bnoordhuis is willing to do the work, but as far as I understood it, it would require new APIs. Anybody is welcome to chime in with sponsoring the work too though.

@bnoordhuis
Copy link
Member

New APIs - that's right. To work well with io_uring, reading data should be request-based (like writing, connecting, etc. already is) because the "firehose" approach isn't a good fit. Libuv users will be able to opt in to the new behavior.

I have a pretty good idea of what needs to change where, but it's going to be a fair amount of work. Sponsorship welcome.

@ocodista
Copy link

@bnoordhuis is there any other async operation from libuv that could benefit from migration to io_uring?

jeffhhk added a commit to jeffhhk/benchmark_random_io that referenced this issue Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement not-stale Issues that should never be marked stale
Projects
None yet
Development

No branches or pull requests