New kernel polling interface for Linux 4.18 (io_uring)? #1947

joehillen · 2018-08-14T20:09:20Z

Would libuv be able to benefit from leveraging the new kernel polling interface that was just released with Linux kernel version 4.18?

I'm curious if I should target the kernel API directly or if it would be better to abstract it as part of libuv.

gireeshpunathil · 2018-08-15T06:55:30Z

For me, the central part of this interface are io_submit and io_getevents the mechanism of which is exactly what is covered within libuv, with epoll_wait primitive and the algorithms around it.

So probably this interface is best leveraged by other platforms to implement non-blocking I/O in a more systematic way, not of much use for libuv and its consumers. However, if the kernel has to offer performance benefits for this interface over the design around epoll_wait primitive, then it makes sense to review - I don't know whether that is the case.

bnoordhuis · 2018-08-16T08:10:42Z

I sent patches to teach aio poll about more file descriptor types. But then aio poll got temporarily reverted wholesale and I don't think my patches were reapplied again when it was merged back in. It's probably too limited right now to be useful for libuv.

I don't have time to work on it but for anyone interested, the ability to embed an epoll fd in an aio pollset is the key, that means you can always use epoll as a fallback.

bnoordhuis · 2018-09-04T20:08:39Z

IOCB_CMD_POLL didn't make the cut for 4.18 but it looks to be on track for 4.19 and from testing linux master I can confirm that embedding an epoll fd in an aio pollset works! I hope to have time this month to put together a pull request.

However, if the kernel has to offer performance benefits for this interface over the design around epoll_wait primitive, then it makes sense to review - I don't know whether that is the case.

@gireeshpunathil The big benefit is that aio lets you use a ring buffer. That means you don't have to make a (slow) system call to check for pending events, you can just pull them from the ring buffer.

bnoordhuis · 2019-03-01T10:12:46Z

https://lwn.net/ml/linux-fsdevel/20190121201456.28338-1-rpenyaev@suse.de/ - there seems to be some movement on adding a ring buffer to epoll.

That would make life a little easier for libuv because it means we won't have to support two completely disparate AIO mechanisms.

sam-github · 2019-03-01T19:41:15Z

Background: https://lwn.net/Articles/776703/, the discussion on that article may have led to the above.

zbjornson · 2019-04-07T22:57:12Z

I think there are two different interfaces being discussed -- io_uring is in Linux 5.1; IOCB_CMD_POLL is (was?) an addition to Linux AIO and I'm unclear what its status is. It sounds like AIO is even being removed entirely from the kernel. _🎻

I have a proof-of-concept Node.js addon that implements read and write using liburing here, using libuv's idle checks for polling (not sure if that's the best way). 5.1-rc3 is sufficient to test it.

From limited benchmarking:

The latency is more consistent between calls (2.6% vs 8.7%)
Sequential read throughput is about 38% better
Parallel tbd, benchmark is being inconsistent.

There's more advanced stuff I haven't gotten into with registering fds to reduce overhead, which would be useful for Node.js' streams. (see io_uring_register.2)

Windows' overlapped I/O could be used equivalently, but I don't know about any of the other platforms. Would libuv ever use a mix of techniques for disk I/O (async I/O where it's usable, threadpool on platforms where it's not)?

bnoordhuis · 2019-04-08T08:03:08Z

It sounds like AIO is even being removed entirely from the kernel.

Note the post date however. :)

Would libuv ever use a mix of techniques for disk I/O

Yes, provided it's reliable. (That was always the issue with Linux AIO, it wasn't.)

zbjornson · 2019-04-08T15:24:32Z

Note the post date however.

Oh good grief 😣. Based on the replies at least I wasn't the only one fooled!

sam-github · 2019-04-08T15:36:31Z

Linus is famous for a firm commitment to not break user-space, ripping an entire API set out is unlikely, it should set your spidey senses to twingling!

jimmyjones2 · 2019-04-08T16:40:31Z

Some more docs for io_uring have just been released: http://kernel.dk/io_uring.pdf

zbjornson · 2019-04-08T17:25:11Z

Does anyone familiar with libuv have a minute to review the approach in the repo I linked to previously, just to make sure it's reasonable so that I can keep evaluating it and maybe move toward a PR, please? Specifically: I'm polling and draining the completion queue in an uv_idle_t callback. I wasn't sure how else to get the loop to run so that the cq could be polled. I know that's not exactly how it would be if it was integrated into libuv, but is that a close approximation?

bnoordhuis · 2019-04-09T09:24:35Z

uv_idle_t basically turns it into busy-loop polling so that's no good (in general, I mean; it might be okay for your use case.)

I think the way to go is to have AIO events reported to an eventfd that's watched by the event loop. You check the ring buffer and only enter epoll_wait() or io_submit() if it's empty.

That should be safe because even if events arrive between the check and the system call, the fact that they signal the eventfd means you won't lose them, you'll return from the system call straight away.

zbjornson · 2019-04-09T17:09:38Z

Thanks. It sounds like you're talking about AIO though. io_uring has no io_set_eventfd equivalent AFAIK. Because the event loop is generally running anyway, and because checking the completion queue is very cheap, it seems okay to poll for completion events? There's only one uv_idle_t for all pending I/O operations, so it also doesn't deteriorate with higher load.

sam-github · 2019-04-09T17:30:27Z

Isn't the event loop generally blocked on epoll, and won't wake up (aka return from epoll) and go check the uring unless something causes it to wake up? Like a notification on an eventfd. If you see the loop running continuously, its probably because you have a uv_idle_t, so have forced it to busy loop.

zbjornson · 2019-04-10T16:37:34Z

I'm thinking in the context of Node.js where it's almost invariably running.

sam-github · 2019-04-10T16:55:04Z

So am I! :-) Why would it be constantly running? I guess if node always has outstanding I/O and never quite catchs up it would be always running.

zbjornson · 2019-04-10T16:59:00Z

Maybe I'm wrong there :-) I assumed a busy server always has pending network IO and timers at least.

edit - Jens has kindly sent me a patch to try out to add notifications to io_uring. I'll try it out shortly.

axboe · 2019-04-11T18:17:47Z

I did send Zach a patch, it's also in my io_uring-next repository.

BTW, for poll(2) type checking, it's also very possible to do that on the ring fd. That'll work as well, without having to add support for eventfd.

zbjornson · 2019-04-21T07:01:36Z

edited after prototype code fixed to reduce epoll_ctl calls
edited to show varying threadpool sizes

My test repo is updated to use an eventfd with uv_poll_t. Here's how it benchmarks for varying numbers of simultaneous reads:

That benchmark measured how long each read took for each of one thousand 1024B files, read 250 times. I used small files to try to get at the interface overhead and not the I/O throughput.

threadpool={1,2,4} is libuv's current method with a threadpool size of 1, 2 or 4. This worsening latency with more threads is repeatable between runs of the benchmark 🤔.
io_uring calls io_uring_submit in each read()/write() call. This is expensive (strace below), so...
io_uring batched submit calls io_uring_submit in a uv_prepare_t. This helped a lot but obviously means the kernel can't start work on the I/O right away. (I don't know when libuv actually starts threadpool I/O -- is it immediate assuming no queuing, or does a different phase of the loop have to begin?)

io_uring_submit on each read/write()

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 91.07    2.482327           9    250000           io_uring_enter
  0.93    0.025472          22      1109        21 futex
  0.14    0.003930           7       509       501 epoll_ctl
  0.01    0.000384          22        17           read
  0.00    0.000015           5         3           epoll_pwait
------ ----------- ----------- --------- --------- ----------------
100.00    2.725836                258207       610 total

io_uring_submit in uv_prepare_t

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.67    0.480573          60      8000           io_uring_enter
 13.97    0.143823          17      8016           read
       # NB: read calls is high b/c read() is used to reset eventfd
 13.39    0.137910          17      8002           epoll_pwait
  4.09    0.042105          29      1444         2 futex
  1.35    0.013950          18       758       252 epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    1.029654                 32789       342 total

I haven't finished the Windows overlapped+IOCP version for comparison yet.

@axboe

for poll(2) type checking, it's also very possible to do that on the ring fd.

currently possible, or you mean that would be an alternative interface to the eventfd?

(Thanks again for the speedy patch!)

bnoordhuis · 2019-05-25T13:25:55Z

torvalds/linux@9b40284 ("io_uring: add support for eventfd notifications") appears to be on track for Linux 5.2.

I'm posting this mostly as a follow-up to the discussion above because I assume libuv would poll the io_uring fd itself, something that's supported in 5.1.

@zbjornson W.r.t. libuv integration, I expect you want to feature-detect in uv__platform_loop_init() in src/unix/linux-core.c and then either set up io_uring or fall back to epoll.

You can't change sizeof(uv_loop_t) because of ABI and it's already kind of crowded in there but you can turn int backend_fd into a union { int fd; void* data; } backend (if I'm not mistaken, there's a 4 byte hole there on 64 bits architectures just waiting to be filled) and use one bit from loop->flags to discriminate between the two¹. You can then use backend.data to point to dynamically allocated memory.

¹ Currently only bit 0 is in use for UV_LOOP_BLOCK_SIGPROF.

ref libuv#1947

Currently trying to fix a new test failure that arose from only the preparations for io_uring. Ref libuv#1947

zbjornson · 2019-06-02T02:42:23Z

@bnoordhuis thanks for the pointers. Is this what you had in mind?

change to a union: zbjornson@15ce03b#diff-85035134387f15f68b82018611558657 (you're right, size of uv_loop_t is 816 B before and after on 64b)
corresponding changes: zbjornson@15ce03b#diff-11f6185f06ed656f0c73dbb188d91ff2

(~~as of that commit I'm still working on a test failure~~ edit: all good, a test was failing because I was writing the io_uring setup status to stderr)

I assume libuv would poll the io_uring fd itself

I haven't figured out how to make this work. The ring fd added with epoll_ctl never triggers the epoll instance. (I'm also not sure what the difference would be between that and an eventfd(2), aside from slightly earlier kernel support and avoiding two syscalls on init.)

Refactoring works. Need to actually use io_uring now. ref libuv#1947

zbjornson · 2019-06-03T03:37:51Z

Can ignore the first part of the previous comment; I opened #2322 for easier discussion.

axboe · 2019-12-03T16:37:36Z

I have read through it, and the io_uring parts look fine to me. It's possible to remove some boiler plate code by adding a liburing dependency (like the setup code, or the ring fill-in), but in all fairness, that code is probably never going to get touched once it's merged. So I don't think that is a big deal, and there's nothing wrong with using the raw interface vs going with liburing. The only real upside I can think of is getting rid of the memory barriers.

I'm not familiar with the libuv code base so I focused on the specific io_uring bits.

vcaputo · 2020-11-26T22:16:06Z

@axboe It may be worth considering having applications go through liburing provides a convenient boundary for hooking io_uring emulation in userspace for either pre-io_uring kernel support or development/debugging purposes. liburing itself could implement this, enabled via some environment variable, or just get replaced via LD_PRELOAD.

ioquatix · 2020-11-27T05:12:35Z

@vcaputo I don't think you can replace liburing with an emulation layer without a huge amount of impedance mismatch.

I also don't think it's necessary to pull in liburing and I think a lot of people will refer to this code base to see how to do it without liburing so there is value in going straight to the system calls, etc.

concatime · 2020-11-27T06:09:28Z

I think liburing make sens for apps like QEMU or nginx. This way, you have an abstraction layer between the app and io_uring. In our case, I think libuv should deal directly with io_uring because libuv is already an abstraction layer by itself. And one less dependency at compile-time and runtime :)
Also, like @ioquatix said, it would be nice to have a code base to refer on how to deal with io_uwing without liburing.

EDIT: After reading @axboe's comment below, I changed my mind, and using liburing is the de-factor way to interact with io_uring. It keeps libuv's code simple. And I would trust @axboe's liburing than any other io_uring code. Thank you :)

vcaputo · 2020-11-27T06:21:00Z

Have you looked at the liburing code? It's just a thin veneer over the syscall interface, it's not even 1000 lines of .c files.

The value IMHO of having most stuff go through it is it's a single point for implementing stuff like an emulation for compatibility or debugging, and I strongly disagree with the claim of a "huge amount of impedance mismatch" pertaining to emulation.

For ages there was protest to adding such asynchronous syscall interfaces to the Linux kernel because they never did anything that couldn't be done perfectly well from userspace after NPTL landed.

Sure, now that syscalls are more expensive it won't be as fast when emulated, but there's no major barriers or contortions necessary to make it work at all. The submission queue maps to threads, threads perform the syscalls instead of the kernel, the results get serialized and pumped out the completion queue. No. Big. Deal. Now applications written for just liburing can work on any kernel, and libuv targeting liburing could be both developed on and tested against either pre-io_uring kernels or hell even non-linux systems like OSX.

That sort of layer belongs in liburing, and if it doesn't hurt libuv significantly to go through liburing, my vote is to do the same, since many applications use libuv.

Just my $.02, I don't really have a dog in this race otherwise.

jorangreef · 2020-11-27T07:23:07Z

I would say: if you can use liburing, then go for it. liburing has solved some tricky bugs in the past, and there a few surprising edge cases you need to get right if you want to do it yourself. liburing is essentially the test suite for the kernel so it's a pretty good reference implementation.

jobs-git · 2021-06-05T16:39:17Z

What happened here now? nginx got io_uring support already.

jobs-git · 2021-06-08T03:35:54Z

The liburing is made by the creator of io_uring, so...

axboe · 2021-06-09T15:28:33Z

liburing should just be considered a reference implementation. I've written a few things that use the raw interface, so that's quite possible too. Or even a mix - using liburing for the ring setup to avoid a bunch of boiler plate code, then use the raw interface after that.

It's possible to be slightly faster with the raw interface. As with any kind of API, liburing will add some fat to the middle and have some indirection, as well as catering to cases that any one specific user may not care about. In terms of API stability, both the kernel and liburing won't change (on purpose...).

That said, unless there's a good reason not to, I'd probably just stick with liburing.

jorangreef · 2021-06-10T08:28:41Z

And liburing's also got some great clean design. We've had a few comments from people looking at @ziglang's io_uring design and we just say, well this is just what @axboe did! Take a look at liburing...

amirouche · 2021-08-19T16:46:37Z

Take a look at liburing...

https://github.com/axboe/liburing

tontinton · 2021-12-11T09:59:43Z

Wouldn't this be a big step for libuv?
I don't get why it's on hold for so long...

xorgy · 2022-09-13T16:45:04Z

Wouldn't this be a big step for libuv? I don't get why it's on hold for so long...

libuv needs more maintainers, and maybe those maintainers should be getting paid pretty well given how much money people are making with libuv. The PR could probably use rebasing at this point too, but it can be used as-is by anyone brave enough, so give it a shot. :+ )

oerdnj · 2023-02-23T19:27:51Z

I’ve recently stumbled upon this https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023 and my employer (ISC) would be willing to sponsor io_uring support in libuv. We are not big tech company, so reasonable proposals only ;). I think you might know how to reach me via email.

bnoordhuis · 2023-02-24T09:27:38Z

@oerdnj I'll be in touch. It's something I've been chipping away at for some time, if ever so slowly.

plajjan · 2023-05-18T22:41:23Z

Awesome work in #3952, really cool this is happening. #3952 is focused on fs operations. Are there plans (ISC sponsorship? @oerdnj) that covers network sockets too?

oerdnj · 2023-05-19T08:13:42Z

Awesome work in #3952, really cool this is happening. #3952 is focused on fs operations. Are there plans (ISC sponsorship? @oerdnj) that covers network sockets too?

There’s #3979 too. As for the network sockets, we are talking about that, but the benefit seems to be smaller as epoll is already very effective. @bnoordhuis is willing to do the work, but as far as I understood it, it would require new APIs. Anybody is welcome to chime in with sponsoring the work too though.

bnoordhuis · 2023-05-19T09:17:23Z

New APIs - that's right. To work well with io_uring, reading data should be request-based (like writing, connecting, etc. already is) because the "firehose" approach isn't a good fit. Libuv users will be able to opt in to the new behavior.

I have a pretty good idea of what needs to change where, but it's going to be a fair amount of work. Sponsorship welcome.

ocodista · 2023-11-22T20:11:35Z

@bnoordhuis is there any other async operation from libuv that could benefit from migration to io_uring?

bnoordhuis added the enhancement label Aug 16, 2018

rektide mentioned this issue Mar 31, 2019

Use liburing #2238

Closed

saghul changed the title ~~New kernel polling interface for Linux 4.18?~~ New kernel polling interface for Linux 4.18 (io_uring)? Apr 1, 2019

ismell mentioned this issue May 8, 2019

io_uring for linux 5.1+ tokio-rs/mio#923

Closed

zbjornson added a commit to zbjornson/libuv that referenced this issue May 28, 2019

WIP - io_uring

09ae51f

ref libuv#1947

zbjornson added a commit to zbjornson/libuv that referenced this issue Jun 2, 2019

WIP - io_uring

15ce03b

Currently trying to fix a new test failure that arose from only the preparations for io_uring. Ref libuv#1947

zbjornson added a commit to zbjornson/libuv that referenced this issue Jun 2, 2019

WIP - io_uring

a14f45c

Refactoring works. Need to actually use io_uring now. ref libuv#1947

zbjornson mentioned this issue Jun 3, 2019

Use io_uring for read/write/fsync on Linux #2322

Closed

7 tasks

damageboy mentioned this issue Jan 31, 2020

[Linux] Support async I/O with uring / liburing dotnet/runtime#12650

Open

aantron mentioned this issue Mar 5, 2020

Feature request: file pread and pwrite ocsigen/lwt#767

Closed

jobs-git mentioned this issue May 2, 2020

Performance improvements - Leverage IO_URING h2o/h2o#2316

Open

solumos mentioned this issue Aug 25, 2020

Event Loop jstime/jstime#12

Open

This comment was marked as resolved.

Sign in to view

Forza-tng mentioned this issue Jan 26, 2023

[Feat]: Minimize resources utilization netdata/netdata#14231

Closed

rektide mentioned this issue Apr 13, 2023

linux: introduce io_uring support #3952

Merged

bnoordhuis closed this as completed in d2c31f4 Apr 18, 2023

angelsanzn mentioned this issue May 11, 2023

io_uring new uSockets backend uNetworking/uWebSockets#1603

Open

bartvanhoutte mentioned this issue Jun 26, 2023

io_uring / IoRing loop revoltphp/event-loop#81

Open

jeffhhk added a commit to jeffhhk/benchmark_random_io that referenced this issue Feb 12, 2024

exercise libuv/libuv#1947 via nodejs>=20.3

11d44b5

New kernel polling interface for Linux 4.18 (io_uring)? #1947

New kernel polling interface for Linux 4.18 (io_uring)? #1947

Comments

joehillen commented Aug 14, 2018

gireeshpunathil commented Aug 15, 2018

bnoordhuis commented Aug 16, 2018

bnoordhuis commented Sep 4, 2018

bnoordhuis commented Mar 1, 2019

sam-github commented Mar 1, 2019

zbjornson commented Apr 7, 2019

bnoordhuis commented Apr 8, 2019

zbjornson commented Apr 8, 2019

sam-github commented Apr 8, 2019

jimmyjones2 commented Apr 8, 2019

zbjornson commented Apr 8, 2019

bnoordhuis commented Apr 9, 2019

zbjornson commented Apr 9, 2019

sam-github commented Apr 9, 2019

zbjornson commented Apr 10, 2019

sam-github commented Apr 10, 2019

zbjornson commented Apr 10, 2019 • edited Loading

axboe commented Apr 11, 2019

zbjornson commented Apr 21, 2019 • edited Loading

io_uring_submit on each read/write()

io_uring_submit in uv_prepare_t

bnoordhuis commented May 25, 2019

zbjornson commented Jun 2, 2019 • edited Loading

zbjornson commented Jun 3, 2019

axboe commented Dec 3, 2019

vcaputo commented Nov 26, 2020

ioquatix commented Nov 27, 2020 • edited Loading

concatime commented Nov 27, 2020 • edited Loading

vcaputo commented Nov 27, 2020 • edited Loading

jorangreef commented Nov 27, 2020

jobs-git commented Jun 5, 2021

This comment was marked as resolved.

jobs-git commented Jun 8, 2021

axboe commented Jun 9, 2021

jorangreef commented Jun 10, 2021

amirouche commented Aug 19, 2021

tontinton commented Dec 11, 2021

xorgy commented Sep 13, 2022 • edited Loading

oerdnj commented Feb 23, 2023

bnoordhuis commented Feb 24, 2023

plajjan commented May 18, 2023

oerdnj commented May 19, 2023

bnoordhuis commented May 19, 2023

ocodista commented Nov 22, 2023

zbjornson commented Apr 10, 2019 •

edited

Loading

zbjornson commented Apr 21, 2019 •

edited

Loading

zbjornson commented Jun 2, 2019 •

edited

Loading

ioquatix commented Nov 27, 2020 •

edited

Loading

concatime commented Nov 27, 2020 •

edited

Loading

vcaputo commented Nov 27, 2020 •

edited

Loading

xorgy commented Sep 13, 2022 •

edited

Loading