-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Use io_uring for read/write/fsync on Linux #2322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
What about use io_uring directly for polling?
If we must use epoll, what about poll ring_fd directly? int main() {
int epfd = epoll_create(1);
struct io_uring ring;
io_uring_queue_init(32, &ring, 0);
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_nop(sqe);
struct epoll_event ev, ret;
ev.data.fd = ring.ring_fd;
ev.events = EPOLLIN | EPOLLET;
printf("%d\n", epoll_ctl(epfd, EPOLL_CTL_ADD, ring.ring_fd, &ev)); // => 0
printf("%d\n", epoll_wait(epfd, &ret, 1, 500)); // => 0
io_uring_submit(&ring);
printf("%d\n", epoll_wait(epfd, &ret, 1, 500)); // => 1
close(epfd);
io_uring_queue_exit(&ring);
} |
|
You mean just for file I/O, not replacing epoll entirely (io_uring for file I/O, io_submit for TCP/UDP, idk what else epoll is used for), right? 😟
// in uv__io_poll()
if (QUEUE_EMPTY(watcher_queue)
&& there are pending io_uring requests
&& io_uring fd is not in watcher_queue) {
add io_uring fd to watcher_queue
// else save an epoll_ctl and don't add io_uring fd to the epoll instance
}
epoll_ctl(EPOLL_CTL_ADD the fds in watcher_queue);
epoll_pwait();
process fds returned by epoll_pwait
process any io_uring CQEs. If the watcher queue was empty, the epoll_pwait waited on
the io_uring requests. Otherwise, anything that completed by this point just happened
to complete during the wait period.
if (no more requests pending in io_uring) {
remove io_uring fd from watcher_queue
} else if (QUEUE_EMPTY(watcher_queue)) {
schedule loop again for io_uring requests
}That only saves zero, one or two
Thanks for that code sample. I had tried this earlier unsuccessfully, and now see what I did wrong. |
Sorry for my confusing words, I mean replacing epoll entriely using IORING_OP_POLL_ADD.
Yes. Only poll_add is cancelable. We can simulate timeout by polling another timerfd. If the timerfd timeouts before being polled fd does, cancel the polling operation using IORING_OP_POLL_REMOVE. Sigh. The old Linux AIO io_getevents does support timeout and canceling. Just tweeted @axboe about how to cancel operations other then poll_add, got his rely:
Uring aio has other known limitations. IORING_OP_READV doesn't work on tty, timerfd, eventfd and friends. IORING_OP_POLL_ADD hangs forever on signalfd, which may be a bug. |
|
Oh! I like that. Will work on it this weekend. Do we need cancellation? (What needs to be cancelled?) |
|
Polling timerfd seems like a hack. Since if the kernel support io_uring it must support IOCB_CMD_POLL, using IOCB_CMD_POLL can be another option ( we still need to poll io_uring events though ). The bug workaround can also be removed.
All operations that may not start immediately ( i.e. return EAGAIN ) need cancellation, especially for socket EDIT: just wrote timedwait_event in C++. Please see if it helps: https://github.com/CarterLi/liburing-http-demo/blob/master/coroutine.hpp#L190 |
bnoordhuis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice, thanks for submitting this!
src/unix/linux-core.c
Outdated
| * TODO what do we do if the user submits more than this number of requests at | ||
| * once? We can't start draining the CQ until the loop goes around, so do we | ||
| * need to overflow into dynamically allocated space and wait to submit the | ||
| * overflow entries until the queues have space? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a problem. I suppose as a stop-gap measure you could return UV_ENOMEM and fall back to the thread pool in fs.c.
|
I'm suggesting not putting liburing code into libuv's codebase. Besides their different coding style, liburing is being actively developed, it's hard to track its changes ( for example high priority bug fixes, in coming feature such as IORING_OP_SENDMSG ). Since liburing is using git too, I suggest you just make it a submodule. In this way Linux distributions can make liburing as libuv's dependency package, which results in smaller binary. |
liburing is LGPL, which AFAIK means we can't look at it. I re-derived the library code in this PR based on the description in http://kernel.dk/io_uring.pdf and the man pages. (I was tempted to ask Jens about re-licensing it as it was a painful task.) I don't think dynamic linking is an option either. @bnoordhuis thank you for the helpful comments! Update coming in a day or three. But real quick, do these three specific lines look correct-ish? https://github.com/libuv/libuv/pull/2322/files#diff-6a16903c26af4b4035eda9922a73ecc9R1519 I haven't made another attempt at debugging yet, but I was having a problem that presumably originates at that point (more details in the OP). (Also, please speak up if you don't like force-pushed branches in PRs, otherwise I'll force-push changes to keep things tidy.) |
Oh, sorry I missed that |
Ah no, they don't. |
c70c049 to
732d551
Compare
|
@bnoordhuis two more impl questions I ran into...
Still have more comments from your first review to get through, but this PR is now functional for reads. Thanks! |
47254e9 to
c56be46
Compare
Interesting side question: how does io_uring interact with writes to files opened in O_APPEND mode? I assume the position argument is effectively ignored? That's how
I'll try to re-review later today / this week. |
Appears to be ignored, yes. Thanks! (I've been keeping the OP updated with my TODO list, if you want the latest update when you review.) |
db71bda to
99a0bc1
Compare
|
@bnoordhuis this is ready for final review.
I did not add Unfortunately the need to test if an fd is seekable means a call to I haven't benchmarked Node.js built with this yet. Is there a branch of Node.js somewhere that works with libuv 2.x? I am realizing that there's probably still utility in a separate Node.js addon that exposes more of io_uring's specific functionality than could be reasonably provided by a cross-platform library (registered buffers, registered fds, possibly a faster fs_read/write/fsync that skips the lseek check). Overall this PR came out cleaner than I expected it to though. |
|
@zbjornson I've been looking into the cancellation test. I think the problem is that once the fs op has been submitted, it cannot be cancelled unless Looking at how I think the solution, until proper io_uring cancellation is implemented, would be just returning An issue we would have implementing |
|
What about this? int uv__platform_work_cancel(uv_req_t* req) {
if (req->type != UV_FS) return UV_ENOSYS;
uv_fs_t* fs_req = (uv_fs_t*)req;
if (fs_req->priv.fs_req_engine != UV__ENGINE_IOURING) return UV_UNKNOWN;
struct uv__backend_data_io_uring* backend_data = fs_req->loop->backend.data;
struct io_uring* ring = &backend_data->ring;
struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
uint64_t user_data = ~(uint64_t)req;
io_uring_prep_cancel(sqe, req, 0);
sqe->user_data = user_data;
io_uring_submit_and_wait(ring, 1);
/* IORING_OP_ASYNC_CANCEL should always complete inline */
unsigned head;
struct io_uring_cqe* cqe;
io_uring_for_each_cqe(ring, head, cqe) {
if (cqe->user_data == user_data) {
/* A cqe with user_data == 0 should be ignored */
cqe->user_data = 0;
/* cqe->res of The original fs request should be -ECANCELED if it's truly canceled */
return cqe->res;
}
}
}The problem is: before Linux 5.9, most of disk-io requests are not cancelable unless they have not started ( eg. requested in a link chain ). But with recent changes of async buffered reads, I think it's doable ( data copy of disk-> kernel is still not cancelable, but data copy of kernel->userland can be canceled). Haven't tested it yet. |
From e28c0e14ee0a685176d210932a345545bc882f66 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E6=9D=8E=E9=80=9A=E6=B4=B2?= <carter.li@eoitek.com>
Date: Mon, 17 Aug 2020 14:17:58 +0800
Subject: [PATCH] Use io_uring for current-position reads where possible
---
src/unix/internal.h | 1 +
src/unix/linux-core.c | 3 ++-
src/unix/linux-iouring.c | 10 ++++------
3 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/src/unix/internal.h b/src/unix/internal.h
index ce425cfb..6f5073c1 100644
--- a/src/unix/internal.h
+++ b/src/unix/internal.h
@@ -152,6 +152,7 @@ struct uv__backend_data_io_uring {
int32_t pending;
struct io_uring ring;
uv_poll_t poll_handle;
+ struct io_uring_params params;
};
UV_UNUSED(static int uv__get_backend_fd(const uv_loop_t* loop)) {
diff --git a/src/unix/linux-core.c b/src/unix/linux-core.c
index 4151c25c..cc76df14 100644
--- a/src/unix/linux-core.c
+++ b/src/unix/linux-core.c
@@ -130,8 +130,9 @@ int uv__platform_loop_init(uv_loop_t* loop) {
return UV_ENOMEM;
ring = &backend_data->ring;
+ memset(&backend_data->params, sizeof(backend_data->params), 0);
- rc = io_uring_queue_init(IOURING_SQ_SIZE, ring, 0);
+ rc = io_uring_queue_init_params(IOURING_SQ_SIZE, ring, &backend_data->params);
if (rc != 0) {
uv__free(backend_data);
diff --git a/src/unix/linux-iouring.c b/src/unix/linux-iouring.c
index b0b9e0df..94c2e406 100644
--- a/src/unix/linux-iouring.c
+++ b/src/unix/linux-iouring.c
@@ -40,14 +40,12 @@ int uv__io_uring_fs_work(uint8_t opcode,
if (req->priv.fs_req_engine == UV__ENGINE_THREADPOOL)
return UV_ENOTSUP;
- /* io_uring does not support current-position ops, and we can't achieve atomic
- * behavior with lseek(2). TODO it can in Linux 5.4+
- */
- if (off < 0)
- return UV_ENOTSUP;
-
backend_data = loop->backend.data;
+ /* io_uring supports current-position ops in Linux 5.4+ */
+ if (off < 0 && !(backend_data->params.features & IORING_FEAT_RW_CUR_POS))
+ return UV_ENOTSUP;
+
/* The CQ is 2x the size of the SQ, but the kernel quickly frees up the slot
* in the SQ after submission, so we could potentially overflow it if we
* submit a ton of SQEs in one loop iteration.
--
2.25.1 |
Nice, I didn't know that. Thanks @CarterLi |
|
Small optimization based on my last patch From c421e324cbf680bfe8bc7b2278fe4531a8e74f8d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E6=9D=8E=E9=80=9A=E6=B4=B2?= <carter.li@eoitek.com>
Date: Thu, 20 Aug 2020 10:31:04 +0800
Subject: [PATCH] linux-iouring.c: use non-vectored read/write commands when
possible
---
src/unix/linux-iouring.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/src/unix/linux-iouring.c b/src/unix/linux-iouring.c
index 94c2e406..42bc250d 100644
--- a/src/unix/linux-iouring.c
+++ b/src/unix/linux-iouring.c
@@ -59,11 +59,27 @@ int uv__io_uring_fs_work(uint8_t opcode,
if (sqe == NULL)
return UV_ENOMEM;
- sqe->opcode = opcode;
+ if ((opcode == IORING_OP_READV || opcode == IORING_OP_WRITEV)
+ && nbufs == 1
+ && (backend_data->params.features & IORING_FEAT_CUR_PERSONALITY)) {
+ /*
+ * `IORING_OP_(READ|WRITE)` is faster than `IORING_OP_(READ|WRITE)V`, because
+ * the kernel doesn't need to import iovecs in advance.
+ *
+ * If the kernel supports `IORING_FEAT_CUR_PERSONALITY`, it should support
+ * non-vectored read/write commands too. It's not perfect, but avoids an extra
+ * feature-test syscall.
+ */
+ sqe->opcode = opcode == IORING_OP_READV ? IORING_OP_READ : IORING_OP_WRITE;
+ sqe->addr = (uint64_t)bufs[0].base;
+ sqe->len = bufs[0].len;
+ } else {
+ sqe->opcode = opcode;
+ sqe->addr = (uint64_t)bufs;
+ sqe->len = nbufs;
+ }
sqe->fd = file;
sqe->off = off;
- sqe->addr = (uint64_t)bufs;
- sqe->len = nbufs;
sqe->user_data = (uint64_t)req;
submitted = io_uring_submit(&backend_data->ring);
--
2.25.1 |
6ecf3da to
3727170
Compare
|
Fixed the bug causing EFAULTs. The benchmark is still hanging because the timer isn't being allowed to run, I think. Need to dig into how timers work vs. poll handles again. Thanks @CarterLi for the patches. |
|
how to monitor linux fd without uv_tcp/udp_t? (socket,pipe,tty....) |
|
Question for libuv folks please (@santigimeno @bnoordhuis) -- there's a subloop happening from the snippet below that's preventing timers from running if the cb schedules another io_uring req: Should I be putting the completed Thanks! |
I think that could work.
|
A small update on this question. Netty team is working on io_uring support and they got some preliminary benchmark results: netty/netty#10622 (comment) I'm pretty sure that they will perform more benchmarks and publish more detailed information in the future and it's too early to make any claims. But, so far, the benchmark shows some benefits of io_uring in comparison with epoll. |
|
@zbjornson santigimeno@97f23dd I think adds support for autotools. And I think #3005 should hide the symbols for autotools. Let me know if it works for you. |
|
@puzpuzpuz from the work I've been doing, replacing epoll with io_uring and only using the |
io_uring backend to be used if the kernel supports it, otherwise epoll will be used. This is a WIP: - It needs to have liburing installed in the system. I'll update it once PR libuv#2322 lands so that requirement is not needed.
io_uring backend to be used if the kernel supports it, otherwise epoll will be used. This is a WIP: - It needs to have liburing installed in the system. I'll update it once PR libuv#2322 lands so that requirement is not needed.
|
Hi all, I recently stumbled upon this open PR. Some things I have addressed:
I have just tested it on Linux 5.12 with liburing 0.7 and all tests pass so far (with and without liburing) :) There might be room for performance improvements, but for now the focus was on achieving nearly the same behaviour as with the threadpool implementation. Edit: Update: bbara@4a506d2 |
|
@bbara awesome, thank you for picking this up.
From #2322 (comment), I think we do want to vendor it in, but one of the maintainers can say for certain. I'll post a few other questions/comments on the commit itself momentarily. |
- bind iouring to uv loop - remove liburing code and use package instead - encapsulate code in linux-iouring.c under LIBUV_LIBURING define -- set define if liburing package is found by cmake and LIBUV_LIBURING_EXPERIMENTAL cmake option is set - fix cancellation by submit work one by one - add statx
|
@zbjornson there are some failing branches in the CI, is it ok? |
|
Yo ya'll we should make this happen. Let's get good. |
|
Thanks for the pull request, Zach, but I'm going to close this in favor of #3952. |
TODO:
pendingto capacity).Ref #1947