Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use io_uring for read/write/fsync on Linux #2322

Open
wants to merge 3 commits into
base: master
from

Conversation

@zbjornson
Copy link
Contributor

zbjornson commented Jun 3, 2019

TODO:

  • Get read working.
  • Implement write and fsync, identify code to consolidate and cleanup
  • Handle SQ overflow per TODO in code -- For now this punts to threadpool.
  • Before submitting a job, need to check that CQ will have space also (compare pending to capacity).
  • Handle cancellation (#2322 (comment))
  • Q: Are non-atomic seeking operations (lseek+read+lseek) okay? A: No, these are now punted to the threadpool.
  • Replace the eventfd with just using the ring fd.

Ref #1947

@CarterLi

This comment has been minimized.

Copy link

CarterLi commented Jun 8, 2019

What about use io_uring directly for polling?

If we must use epoll, what about poll ring_fd directly?

int main() {
    int epfd = epoll_create(1);

    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);

    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_nop(sqe);

    struct epoll_event ev, ret;
    ev.data.fd = ring.ring_fd;
    ev.events = EPOLLIN | EPOLLET;

    printf("%d\n", epoll_ctl(epfd, EPOLL_CTL_ADD, ring.ring_fd, &ev)); // => 0
    printf("%d\n", epoll_wait(epfd, &ret, 1, 500)); // => 0

    io_uring_submit(&ring);

    printf("%d\n", epoll_wait(epfd, &ret, 1, 500)); // => 1

    close(epfd);
    io_uring_queue_exit(&ring);
}
@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jun 8, 2019

You mean just for file I/O, not replacing epoll entirely (io_uring for file I/O, io_submit for TCP/UDP, idk what else epoll is used for), right? 😟

io_uring_enter(IORING_ENTER_GETEVENTS) has no timeout like epoll_pwait does, which I think means it wouldn't work for us. I'm also not sure how it would work to have more than a single "wait" syscall in the core of the event loop. However, maybe we could do something like:

// in uv__io_poll()
if (QUEUE_EMPTY(watcher_queue)
    && there are pending io_uring requests
    && io_uring fd is not in watcher_queue) {
  add io_uring fd to watcher_queue
  // else save an epoll_ctl and don't add io_uring fd to the epoll instance
}

epoll_ctl(EPOLL_CTL_ADD the fds in watcher_queue);

epoll_pwait();
process fds returned by epoll_pwait
process any io_uring CQEs. If the watcher queue was empty, the epoll_pwait waited on
  the io_uring requests. Otherwise, anything that completed by this point just happened
  to complete during the wait period.

if (no more requests pending in io_uring) {
  remove io_uring fd from watcher_queue
} else if (QUEUE_EMPTY(watcher_queue)) {
   schedule loop again for io_uring requests
}

That only saves zero, one or two epoll_ctl calls per loop iteration regardless of how many calls the user makes to uv_fs_read/write/fsync. (i.e. there's a single fd for all io_uring requests)

what about poll ring_fd directly?

Thanks for that code sample. I had tried this earlier unsuccessfully, and now see what I did wrong.

@CarterLi

This comment has been minimized.

Copy link

CarterLi commented Jun 8, 2019

You mean just for file I/O, not replacing epoll entirely (io_uring for file I/O, io_submit for TCP/UDP, idk what else epoll is used for), right? 😟

Sorry for my confusing words, I mean replacing epoll entriely using IORING_OP_POLL_ADD.

io_uring_enter(IORING_ENTER_GETEVENTS) has no timeout like epoll_pwait does.

Yes. Only poll_add is cancelable. We can simulate timeout by polling another timerfd. If the timerfd timeouts before being polled fd does, cancel the polling operation using IORING_OP_POLL_REMOVE.

Sigh. The old Linux AIO io_getevents does support timeout and canceling. Just tweeted @axboe about how to cancel operations other then poll_add, got his rely:

You can't, and that is deliberate. It's basically impossible to add support for cancel of read/write requests, since ultimately it would need hardware support. One exception would be for linked requests where the linked request isn't started yet. But I suspect that's not common.

Sockets is different, we could support cancel on those. For filesystem io, it’s very different.

Uring aio has other known limitations. IORING_OP_READV doesn't work on tty, timerfd, eventfd and friends. IORING_OP_POLL_ADD hangs forever on signalfd, which may be a bug.

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jun 8, 2019

Oh! I like that. Will work on it this weekend.

Do we need cancellation? (What needs to be cancelled?)

@CarterLi

This comment has been minimized.

Copy link

CarterLi commented Jun 9, 2019

Polling timerfd seems like a hack. Since if the kernel support io_uring it must support IOCB_CMD_POLL, using IOCB_CMD_POLL can be another option ( we still need to poll io_uring events though ). The bug workaround can also be removed.

Do we need cancellation? (What needs to be cancelled?)

All operations that may not start immediately ( i.e. return EAGAIN ) need cancellation, especially for socket

EDIT: just wrote timedwait_event in C++. Please see if it helps: https://github.com/CarterLi/liburing-http-demo/blob/master/coroutine.hpp#L190

Copy link
Member

bnoordhuis left a comment

This is really nice, thanks for submitting this!

src/unix/fs.c Outdated Show resolved Hide resolved
src/unix/internal.h Outdated Show resolved Hide resolved
src/unix/internal.h Outdated Show resolved Hide resolved
* TODO what do we do if the user submits more than this number of requests at
* once? We can't start draining the CQ until the loop goes around, so do we
* need to overflow into dynamically allocated space and wait to submit the
* overflow entries until the queues have space?

This comment has been minimized.

Copy link
@bnoordhuis

bnoordhuis Jun 10, 2019

Member

Yes, this is a problem. I suppose as a stop-gap measure you could return UV_ENOMEM and fall back to the thread pool in fs.c.

src/unix/linux-core.c Outdated Show resolved Hide resolved
src/unix/linux-iouring.c Show resolved Hide resolved
src/unix/linux-iouring.c Show resolved Hide resolved
src/unix/linux-iouring.c Show resolved Hide resolved
src/unix/linux-iouring.h Outdated Show resolved Hide resolved
src/unix/linux-syscalls.c Outdated Show resolved Hide resolved
@CarterLi

This comment has been minimized.

Copy link

CarterLi commented Jun 10, 2019

I'm suggesting not putting liburing code into libuv's codebase. Besides their different coding style, liburing is being actively developed, it's hard to track its changes ( for example high priority bug fixes, in coming feature such as IORING_OP_SENDMSG ).

Since liburing is using git too, I suggest you just make it a submodule. In this way Linux distributions can make liburing as libuv's dependency package, which results in smaller binary.

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jun 10, 2019

@CarterLi

I'm suggesting not putting liburing code into libuv's codebase.

liburing is LGPL, which AFAIK means we can't look at it. I re-derived the library code in this PR based on the description in http://kernel.dk/io_uring.pdf and the man pages. (I was tempted to ask Jens about re-licensing it as it was a painful task.) I don't think dynamic linking is an option either.


@bnoordhuis thank you for the helpful comments! Update coming in a day or three. But real quick, do these three specific lines look correct-ish? https://github.com/libuv/libuv/pull/2322/files#diff-6a16903c26af4b4035eda9922a73ecc9R1519 I haven't made another attempt at debugging yet, but I was having a problem that presumably originates at that point (more details in the OP).

(Also, please speak up if you don't like force-pushed branches in PRs, otherwise I'll force-push changes to keep things tidy.)

@CarterLi

This comment has been minimized.

Copy link

CarterLi commented Jun 10, 2019

liburing is LGPL, which AFAIK means we can't look at it.

Oh, sorry I missed that

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Jun 11, 2019

do these three specific lines look correct-ish? https://github.com/libuv/libuv/pull/2322/files#diff-6a16903c26af4b4035eda9922a73ecc9R1519

Ah no, they don't. handle is a stack-allocated uv_poll_t, not a pointer. You're making a copy of backend_data->ring->poll_handle but that won't work.

@zbjornson zbjornson force-pushed the zbjornson:zb/iouring-2 branch 3 times, most recently from c70c049 to 732d551 Jun 16, 2019
@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jun 18, 2019

@bnoordhuis two more impl questions I ran into...

  1. io_uring does not (and per Jens will not) support seeking (current position) ops, it only supports positional ops. In this PR I'm using lseek to get and set the file cursor. I think that would work for Node.js because it's documented as not being safe to call fs.read/write on the same file w/o waiting for the callback, but it's not pretty. Is this okay, or should seeking ops use the threadpool w/ readv()? I'm hesitant to use the threadpool for it, given that fs.read/write default to "current position".

  2. I think I need one private bit in uv_fs_s to track that the req is using io_uring, so that I can specialize its handling in uv_cancel (intending to set req->result = UV_CANCELLED in uv_cancel()). I'm not seeing anything available, aside from turning something into a union. Advice?

Still have more comments from your first review to get through, but this PR is now functional for reads. Thanks!

@zbjornson zbjornson force-pushed the zbjornson:zb/iouring-2 branch 3 times, most recently from 47254e9 to c56be46 Jun 18, 2019
@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Jun 18, 2019

@zbjornson

  1. I don't think lseek would work for concurrent writes. Non-positional writes below a certain size are atomic (i.e., they may interleave with other writes but writer A won't overwrite data from writer B) but that kind of overlap is definitely possible with lseek + write. Linux makes pretty strong atomicity guarantees and no doubt some libuv users depend on that.

Interesting side question: how does io_uring interact with writes to files opened in O_APPEND mode? I assume the position argument is effectively ignored?

That's how lseek(SEEK_SET) works on O_APPEND file descriptors - it changes the read position but writes still go at the end.

  1. uv_fs_t has void* reserved[6] (see UV_REQ_FIELDS) that you can use.

I'll try to re-review later today / this week.

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jun 18, 2019

how does io_uring interact with writes to files opened in O_APPEND mode? I assume the position argument is effectively ignored?

Appears to be ignored, yes.

Thanks! (I've been keeping the OP updated with my TODO list, if you want the latest update when you review.)

@zbjornson zbjornson force-pushed the zbjornson:zb/iouring-2 branch 5 times, most recently from db71bda to 99a0bc1 Jun 24, 2019
@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jun 24, 2019

@bnoordhuis this is ready for final review.

IDK why github is saying the branch is out of date. edit it fixed itself.

I did not add acquire_fence() and release_fence(). If you still think that would be helpful despite the new code comments explaining each barrier, let me know (although I might call them read_acquire/write_release to avoid suggesting that they emit the mfence instruction on x86/64.)

Unfortunately the need to test if an fd is seekable means a call to uv_fs_read/write/fsync entails calling lseek(2) in addition to io_uring_enter(2) (and zero, one or two epoll_ctl(2) calls).

I haven't benchmarked Node.js built with this yet. Is there a branch of Node.js somewhere that works with libuv 2.x?

I am realizing that there's probably still utility in a separate Node.js addon that exposes more of io_uring's specific functionality than could be reasonably provided by a cross-platform library (registered buffers, registered fds, possibly a faster fs_read/write/fsync that skips the lseek check). Overall this PR came out cleaner than I expected it to though.

@zbjornson zbjornson marked this pull request as ready for review Jun 24, 2019
@zbjornson zbjornson force-pushed the zbjornson:zb/iouring-2 branch from 99a0bc1 to 3d87f89 Jun 24, 2019
@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jun 24, 2019

False-positives are okay though (i.e. it's always safe to use the threadpool), so it might still be an okay optimization. 🤷‍♂

Copy link
Member

bnoordhuis left a comment

Getting close!

README.md Outdated Show resolved Hide resolved
docs/src/fs.rst Outdated Show resolved Hide resolved
include/uv.h Outdated Show resolved Hide resolved
include/uv.h Outdated Show resolved Hide resolved
src/unix/fs.c Outdated Show resolved Hide resolved
src/unix/linux-iouring.c Outdated Show resolved Hide resolved
src/unix/linux-iouring.c Outdated Show resolved Hide resolved
src/unix/linux-iouring.c Outdated Show resolved Hide resolved
src/unix/linux-syscalls.c Outdated Show resolved Hide resolved
src/unix/linux-syscalls.c Outdated Show resolved Hide resolved
@zbjornson zbjornson force-pushed the zbjornson:zb/iouring-2 branch 3 times, most recently from 4ee7964 to 6f8c717 Jul 21, 2019
@zbjornson zbjornson force-pushed the zbjornson:zb/iouring-2 branch from 6f8c717 to f7e91d2 Jul 22, 2019
@zbjornson zbjornson force-pushed the zbjornson:zb/iouring-2 branch from f7e91d2 to 9098e63 Jul 22, 2019
@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jul 22, 2019

Sorry for the delay. I think this is ready.

All io_uring reqs that fail with EINVAL are retried with the threadpool. This presumably means reads/writes from pipes and streams will have higher latency. IDK if we want to assume STDIN/OUT/ERR_FILENO are usually streams and always send those to the threadpool? (See the last few comments before this.)

Is there a Node.js branch that works with libuv 2.x? I'd like to test and benchmark there before this is merged.

Edit 3-Sep Fixed broken win build by adding a uv__platform_work_cancel impl. Not sure where the best place is for that code to live.

@axboe

This comment has been minimized.

Copy link

axboe commented Sep 23, 2019

I need to sign off for the night, but I guess a better solution is for the completion handler to always retry EINVAL results with the threadpool impl. AFAIK there's no way from userspace to check for read_iter/write_iter; trying lseek was a (bad) approximation of that. Retrying would save the upfront cost of trying to check fd capabilities with a syscall, but add some overhead to reqs that have to use the threadpool impl. Could add a fast-path that checks if file == STDIN/OUT/ERR_FILENO and skips io_uring for those reqs. (Sockets shouldn't be using the fs_ APIs.)

Better late than never, but I queued up non ->read/write_iter() support and will push it for 5.4:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.4/io_uring&id=32960613b7c3352ddf38c42596e28a16ae36335e

and liburing has a test case for it as well:

http://git.kernel.dk/cgit/liburing/commit/?id=556960942eaa69fd53544932f00db3fa9f196e00

If you want to detect support for this at load time, we can also add a feature flag for it, similar to what we have for the single mmap:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.4/io_uring&id=ac90f249e15cd2a850daa9e36e15f81ce1ff6550

@wadetb

This comment has been minimized.

Copy link
Contributor

wadetb commented Sep 26, 2019

Hi there, this work seems to have a lot of potential! I have two questions about this patch and io_uring in general:

  1. What is the potential for using io_uring for socket I/O in addition to FS?

In libuv, sockets don't currently require a threadpool, but I presume io_uring is still applicable from a syscall reduction perspective. I see that in the patch, the io_uring code is factored out of the filesystem code, indicating that it could be used for other purposes. How difficult would it be to extend this patch to support sockets?

  1. How does io_uring deal with paging?

When a write or send syscall is invoked, I believe the kernel synchronously locks or copies the pages into kernel memory. Does io_uring somehow ensure that data pointers are locked in physical memory until the IO completes?

@axboe

This comment has been minimized.

Copy link

axboe commented Sep 26, 2019

I can't comment on the libuv addition, I'll just say that io_uring does support sendmsg/recvmsg for sockets.

For any buffered write, io_uring or otherwise, the data is indeed copied to page cache pages first. So there's no paging of that memory.

@axboe

This comment has been minimized.

Copy link

axboe commented Sep 26, 2019

Ditto send()/sendmsg() for what it's worth, only zero-copy or O_DIRECT would behave differently.

@axboe

This comment has been minimized.

Copy link

axboe commented on src/unix/linux-core.c in 9098e63 Dec 3, 2019

From Linux 5.5, we have flexible CQ ring sizing and overflow support. This means that you can size the CQ ring independently of the SQ ring, and that you can rely on the fact that the CQ ring will never overflow.

In terms of submission, you should never run out of entries as you can simply submit pending IO if you can't retrieve a new sqe.

@axboe

This comment has been minimized.

Copy link

axboe commented on src/unix/linux-core.c in 9098e63 Dec 3, 2019

Should also work fine in later kernels. But I guess it doesn't really matter, since you have a viable fallback alternative in the threadpool.

@stale

This comment has been minimized.

Copy link

stale bot commented Dec 25, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 25, 2019
@rektide

This comment has been minimized.

Copy link

rektide commented Dec 27, 2019

Please keep this issue open stale bot. This is epic work that could greatly help libuv.

@axboe

This comment has been minimized.

Copy link

axboe commented Jan 24, 2020

Gentle nudge on this again. As before, do let me know if there's anything I can do to help move this forward!

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

zbjornson commented Jan 24, 2020

Thanks @axboe .

I have a few updates I can work on since io_uring has gained new features, but they're not strictly necessary for merging this.

@bnoordhuis since liburing is now MIT-licensed, can I remove the custom io_uring code here and just pull that in as a vendored dependency? That'd be nice for maintainability -- liburing detects single mmap kernel support (saving a syscall), for example.

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Jan 24, 2020

Yes, that's acceptable provided it's linked in 'hidden static' fashion. Downstream libuv consumers shouldn't be able to see (or more accurately: clash with) liburing's symbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.