New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extstore: support direct I/O and async I/O #703
base: next
Are you sure you want to change the base?
Conversation
Thanks! This is fun, I've been wanting to do both of these things for a while but my priorities are elsewhere. Might take a while to review due to time right now. Do you have any performance test results for the various modes, etc? Not a lot of comments in the patch and a lot of new things are going on in there. With io_uring I was hoping to replace the entire event loop so extstore requests don't have to push to a separate thread anymore, but of course that's a lot more work. |
In order to make it easy to review, I plan to split this single change into at least three smaller changes.
I usually do performance tests against extstore intensive read-only workloads.
The ext_path is a file in ext4 file system on Intel SSD 900P. With additional extended options With additional extended options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for doing this change-set. I've done one pass and left some comments for review.
The usage information is missing in memcached.c file. Can you add that as well?
extstore.c
Outdated
if (e->ops->thread_init) { | ||
*res = e->ops->thread_init(&e->io_threads[i]); | ||
if (*res) { | ||
free(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
free e->pages
and e->io_threads
before returning NULL
. Better yet introduce goto label and handle this there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by commit 884e3b1
The change also releases e->wbuf_stack and e->io_stack.
extstore.c
Outdated
|
||
if ((e->max_io_size % e->io_align) != 0 || !mc_powerof2(e->io_align)) { | ||
*res = EXTSTORE_INIT_BAD_IO_SIZE; | ||
free(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to free e->pages
before returning NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by commit 884e3b1
extstore.c
Outdated
|
||
ret = io_uring_queue_init(e->io_depth, &priv->ring, 0); | ||
if (ret < 0) { | ||
E_DEBUG("io_uring_queue_init: %s\n", strerror(-ret)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
free priv
before returning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by commit 884e3b1
extstore.c
Outdated
struct store_io_data *data = malloc(sizeof(*data)); | ||
|
||
if (e->direct) { | ||
if (posix_memalign(&data->read_buf, e->io_align, e->max_io_size)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing the following before returning EXTSTORE_INIT_OOM
io_uring_queue_exit(&priv->ring);
// free data and data->read_buf that are added to priv->io_free list.
free(priv);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by commit 884e3b1
extstore.c
Outdated
|
||
static void extstore_io_submit(store_io_thread *t, obj_io *io) { | ||
if (t->e->ops->submit(t, io)) | ||
t->inflight++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing mutex lock and unlock around t->inflight++
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t->inflight is only updated by the thread `t', so it doesn't need to be protected by mutex.
extstore.c
Outdated
static void extstore_io_complete(store_io_thread *t) { | ||
if (t->e->ops->complete) { | ||
if (t->e->ops->complete(t)) | ||
t->inflight--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing mutex lock and unlock around t->inflight--
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
extstore.c
Outdated
|
||
extstore_io_done(t, io, ret); | ||
|
||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function returns false
all the time and won't increment the inflight
counter. Can you please add a comment to the struct store_io_thread
that inflight
is used for only async io.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated comment about inflight in struct store_io_thread:
unsigned int inflight; // number of inflight I/O requests, this is used only for async I/O
extstore.c
Outdated
uint64_t iov_offset; | ||
int fd; | ||
void *read_buf; | ||
struct iovec iov_oneshot; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify when do we use iov_oneshot
vs array of iov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comment to clarify:
/* this is used when io->iov is NULL or resubmitting after short read/write */
extstore.c
Outdated
data = io_uring_cqe_get_data(cqe); | ||
|
||
ret = cqe->res; | ||
if (ret <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be ret < 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably 'ret == 0' can only happen when reading the end of file (e.g. the extstore is accidentally truncated).
So I would like to avoid infinite loop by this condition.
extstore.c
Outdated
if (done) { | ||
ret = data->io->len; | ||
} else { | ||
struct io_uring_sqe *sqe = io_uring_get_sqe(&priv->ring); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For -EAGAIN error on line 1166, can the request be tried again? - similar to lines 1185-1199, where we re-submit the partial request? Can you please clarify this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by commit f2328d4
@mita we are excited about this change. What are the liburing and kernel versions you used? |
I get the latest liburing (0.7+) and build from source on Ubuntu 18.04 LTS. |
Thanks! I saw the liburing is updated to match with the kernel 5.7 and wondering it's compatibility. Usually it's a good idea to stick with a version, so the behavior is deterministic. You may consider upgrading Travis builds to use focal/kernel 5.4 with liburing. |
@@ -228,6 +234,7 @@ AM_CONDITIONAL([ENABLE_SASL],[test "$enable_sasl" = "yes"]) | |||
AM_CONDITIONAL([ENABLE_EXTSTORE],[test "$enable_extstore" != "no"]) | |||
AM_CONDITIONAL([ENABLE_ARM_CRC32],[test "$enable_arm_crc32" = "yes"]) | |||
AM_CONDITIONAL([ENABLE_TLS],[test "$enable_tls" = "yes"]) | |||
AM_CONDITIONAL([ENABLE_LIBURING],[test "$enable_liburing" = "yes"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you haven't done already, it's a good idea to run extstore tests in the async mode. You may do something similar to TLS (https://github.com/memcached/memcached/blob/master/Makefile.am#L128), run all extstore tests with io_uring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Commit 5dee70a introduced a new environment variable which can be used to test with additional memcached server options.
What's memstrike from? I'm still a few days behind being able to look closely. I like the idea of splitting up the patch; I'm very likely to take the direct io patch. I probably want to at least keep the io_uring one around for reference but I had different plans for it. A closer look will help me figure this out though. I feel like there are also too many tunables now. For direct io at least, you'll probably see the biggest differences in performance under mixed read/write workloads. The performance could go either way. |
The memstrike is a memcached benchmark tool which supports multiple server and mget, but does not support read/write mixed workloads. https://github.com/frsyuki/memstrike I have split up the direct IO and async IO patch into three patches.
|
Hey, I fell off the earth for a while. This is a pretty involved change so I'm not going to get away with some quick testing :) What would be nice if you or anyone else wants to, please do mixed read/write workload tests. Use mc-crusher(https://github.com/memcached/mc-crusher) and create some test configs to share here. You can use the test-suites code to design a quick suite of tests which will do ramp-to-failure and other interesting things. I've had to do all this (and more!) while working on larger changes like this. The tunables I'd have to think about one by one; I'm trying pretty hard to not add more of those and keep sensible defaults. It's not quite obvious to me from a skim what those might be. As far as plans go, I'm hoping to have an entire io_uring event loop. Then extstore requests can get executed internally to each worker and we avoid the sidethread push and multiple syscalls on pipes. This code still helps a lot as a point of reference at very least, and I hope I can take the direct io code still. |
I've created a test script based on test-suite-example. https://github.com/mita/mc-crusher/blob/direct_io/test-suites/test-direct In addition to the normal extstore test and RAM test without extstore,This script adds direct IO and direct IO + io_uring test configurations. I agree that extstore requests with io_uring should not be pushed to a separate thread. I plan to work on this. |
This adds support for direct I/O on extstore. With `-o ext_direct` at startup, memcached opens the extstore files with the O_DIRECT flag. When direct I/O is used, extstore reads and writes bypass the operating system caches.
This adds support for async I/O on extstore. To build with async I/O support you will have to install io_uring library (http://git.kernel.dk/cgit/liburing/) and ./configure --enable-liburing. With `-o ext_io_engine=io_uring` and I/O depth > 1 (e.g. `-o ext_io_depth=64), the worker threads directly read and write asynchronouly without the IO threads.
This adds a new MEMCACHED_EXTRA_ARGS environment variable for running tests with additional memcached server options. A test with direct I/O: MEMCACHED_EXTRA_ARGS="-o ext_direct" make test A test with direct I/O and async I/O: MEMCACHED_EXTRA_ARGS="-o ext_io_engine=io_uring,ext_direct,ext_io_depth=64" make test
The latest version does not use the I/O threads when the io_uring engine is selected. |
This table shows performance comparison between default sync engine and io_uring engine.
Benchmark details:
The value of EXTRA_ARGS varies depending on the configuration.
The ext_path is a file in ext4 file system on Intel SSD 900P.
|
Whoops I missed this update. Any idea why direct IO is so much slower? |
Thanks for looking at update. The direct IO performance is limited by the random read IOPS. |
@mita did you end up using this in a production context anywhere? I've been both too busy and too-allocated to do the work necessary to validate this change, but that might be changing. I finally spent some time with io_uring and have more thoughts on it now as well. |
mark |
This adds support for direct I/O and async I/O on extstore.
With
-o ext_direct
at startup, memcached opens the extstore files withthe O_DIRECT flag. When direct I/O is used, extstore reads and writes
bypass the operating system caches.
To build with async I/O support you will have to install io_uring library
(http://git.kernel.dk/cgit/liburing/) and ./configure --enable-liburing.
With
-o ext_io_engine=io_uring
and I/O depth > 1(e.g. `-o ext_io_depth=64), the IO threads execute each IO object on their
queues in asynchronous manner.
This maximizes I/O throughput with smaller number of the IO threads when
direct I/O is used.