Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extstore: support direct I/O and async I/O #703

Open
wants to merge 4 commits into
base: next
Choose a base branch
from

Conversation

mita
Copy link

@mita mita commented Jul 31, 2020

This adds support for direct I/O and async I/O on extstore.

With -o ext_direct at startup, memcached opens the extstore files with
the O_DIRECT flag. When direct I/O is used, extstore reads and writes
bypass the operating system caches.

To build with async I/O support you will have to install io_uring library
(http://git.kernel.dk/cgit/liburing/) and ./configure --enable-liburing.
With -o ext_io_engine=io_uring and I/O depth > 1
(e.g. `-o ext_io_depth=64), the IO threads execute each IO object on their
queues in asynchronous manner.

This maximizes I/O throughput with smaller number of the IO threads when
direct I/O is used.

@dormando
Copy link
Member

Thanks! This is fun, I've been wanting to do both of these things for a while but my priorities are elsewhere.

Might take a while to review due to time right now. Do you have any performance test results for the various modes, etc? Not a lot of comments in the patch and a lot of new things are going on in there.

With io_uring I was hoping to replace the entire event loop so extstore requests don't have to push to a separate thread anymore, but of course that's a lot more work.

@mita
Copy link
Author

mita commented Aug 1, 2020

In order to make it easy to review, I plan to split this single change into at least three smaller changes.

  • Add extstore I/O engine infrastructure
  • Support direct I/O
  • Support async I/O

I usually do performance tests against extstore intensive read-only workloads.

# start memcached
memcached -t 16 -c 48000 -m 40000 -b 4096 \
  -o ext_wbuf_size=64,ext_item_size=240,ext_max_frag=0.80,hashpower=30,no_lru_crawler,ext_recache_rate=0,ext_item_age=5,ext_path=/mnt/ssd/extstore:200G
# mget benchmark (1000 mget, 8 io threads, 256 bytes value, 10000000 requests)
memstrike -d 127.0.0.1:11211 -m 1000 -t 8 -v 256 10000000

The ext_path is a file in ext4 file system on Intel SSD 900P.

With additional extended options -o ext_threads=4,ext_io_engine=io_uring,ext_direct,ext_io_depth=64
(direct I/O, async I/O, and four IO threads): 527712.951173 req/sec
This saturates the SSD bandwidth.

With additional extended options-o ext_threads=20,ext_io_engine=sync,ext_direct
(direct I/O, normal blocking I/O, and twenty IO threads): 483425.397672 req/sec

Copy link
Contributor

@pkarumanchi9 pkarumanchi9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this change-set. I've done one pass and left some comments for review.
The usage information is missing in memcached.c file. Can you add that as well?

extstore.c Outdated
if (e->ops->thread_init) {
*res = e->ops->thread_init(&e->io_threads[i]);
if (*res) {
free(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free e->pages and e->io_threads before returning NULL. Better yet introduce goto label and handle this there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by commit 884e3b1
The change also releases e->wbuf_stack and e->io_stack.

extstore.c Outdated

if ((e->max_io_size % e->io_align) != 0 || !mc_powerof2(e->io_align)) {
*res = EXTSTORE_INIT_BAD_IO_SIZE;
free(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to free e->pages before returning NULL

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by commit 884e3b1

extstore.c Outdated

ret = io_uring_queue_init(e->io_depth, &priv->ring, 0);
if (ret < 0) {
E_DEBUG("io_uring_queue_init: %s\n", strerror(-ret));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free priv before returning

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by commit 884e3b1

extstore.c Outdated
struct store_io_data *data = malloc(sizeof(*data));

if (e->direct) {
if (posix_memalign(&data->read_buf, e->io_align, e->max_io_size)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the following before returning EXTSTORE_INIT_OOM

io_uring_queue_exit(&priv->ring);
// free data and data->read_buf that are added to priv->io_free list.
free(priv);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by commit 884e3b1

extstore.c Outdated

static void extstore_io_submit(store_io_thread *t, obj_io *io) {
if (t->e->ops->submit(t, io))
t->inflight++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing mutex lock and unlock around t->inflight++?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t->inflight is only updated by the thread `t', so it doesn't need to be protected by mutex.

extstore.c Outdated
static void extstore_io_complete(store_io_thread *t) {
if (t->e->ops->complete) {
if (t->e->ops->complete(t))
t->inflight--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing mutex lock and unlock around t->inflight--?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

extstore.c Outdated

extstore_io_done(t, io, ret);

return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function returns false all the time and won't increment the inflight counter. Can you please add a comment to the struct store_io_thread that inflight is used for only async io.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment about inflight in struct store_io_thread:

unsigned int inflight; // number of inflight I/O requests, this is used only for async I/O

extstore.c Outdated
uint64_t iov_offset;
int fd;
void *read_buf;
struct iovec iov_oneshot;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify when do we use iov_oneshot vs array of iov

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comment to clarify:
/* this is used when io->iov is NULL or resubmitting after short read/write */

extstore.c Outdated
data = io_uring_cqe_get_data(cqe);

ret = cqe->res;
if (ret <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be ret < 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably 'ret == 0' can only happen when reading the end of file (e.g. the extstore is accidentally truncated).
So I would like to avoid infinite loop by this condition.

extstore.c Outdated
if (done) {
ret = data->io->len;
} else {
struct io_uring_sqe *sqe = io_uring_get_sqe(&priv->ring);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For -EAGAIN error on line 1166, can the request be tried again? - similar to lines 1185-1199, where we re-submit the partial request? Can you please clarify this part?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by commit f2328d4

@tharanga
Copy link
Contributor

tharanga commented Aug 3, 2020

@mita we are excited about this change. What are the liburing and kernel versions you used?

@mita
Copy link
Author

mita commented Aug 4, 2020

@mita we are excited about this change. What are the liburing and kernel versions you used?

I get the latest liburing (0.7+) and build from source on Ubuntu 18.04 LTS.
The kernel version is 5.4.0-42-generic.

@tharanga
Copy link
Contributor

tharanga commented Aug 4, 2020

@mita we are excited about this change. What are the liburing and kernel versions you used?

I get the latest liburing (0.7+) and build from source on Ubuntu 18.04 LTS.
The kernel version is 5.4.0-42-generic.

Thanks! I saw the liburing is updated to match with the kernel 5.7 and wondering it's compatibility. Usually it's a good idea to stick with a version, so the behavior is deterministic. You may consider upgrading Travis builds to use focal/kernel 5.4 with liburing.

@@ -228,6 +234,7 @@ AM_CONDITIONAL([ENABLE_SASL],[test "$enable_sasl" = "yes"])
AM_CONDITIONAL([ENABLE_EXTSTORE],[test "$enable_extstore" != "no"])
AM_CONDITIONAL([ENABLE_ARM_CRC32],[test "$enable_arm_crc32" = "yes"])
AM_CONDITIONAL([ENABLE_TLS],[test "$enable_tls" = "yes"])
AM_CONDITIONAL([ENABLE_LIBURING],[test "$enable_liburing" = "yes"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you haven't done already, it's a good idea to run extstore tests in the async mode. You may do something similar to TLS (https://github.com/memcached/memcached/blob/master/Makefile.am#L128), run all extstore tests with io_uring.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Commit 5dee70a introduced a new environment variable which can be used to test with additional memcached server options.

@dormando
Copy link
Member

dormando commented Aug 7, 2020

What's memstrike from?

I'm still a few days behind being able to look closely. I like the idea of splitting up the patch; I'm very likely to take the direct io patch. I probably want to at least keep the io_uring one around for reference but I had different plans for it. A closer look will help me figure this out though. I feel like there are also too many tunables now.

For direct io at least, you'll probably see the biggest differences in performance under mixed read/write workloads. The performance could go either way.

@mita
Copy link
Author

mita commented Aug 10, 2020

What's memstrike from?

The memstrike is a memcached benchmark tool which supports multiple server and mget, but does not support read/write mixed workloads.

https://github.com/frsyuki/memstrike

I have split up the direct IO and async IO patch into three patches.

  extstore: support direct I/O
  extstore: introduce extstore IO engine infrastructure
  extstore: support async I/O

@dormando
Copy link
Member

Hey,

I fell off the earth for a while. This is a pretty involved change so I'm not going to get away with some quick testing :)

What would be nice if you or anyone else wants to, please do mixed read/write workload tests. Use mc-crusher(https://github.com/memcached/mc-crusher) and create some test configs to share here. You can use the test-suites code to design a quick suite of tests which will do ramp-to-failure and other interesting things. I've had to do all this (and more!) while working on larger changes like this.

The tunables I'd have to think about one by one; I'm trying pretty hard to not add more of those and keep sensible defaults. It's not quite obvious to me from a skim what those might be.

As far as plans go, I'm hoping to have an entire io_uring event loop. Then extstore requests can get executed internally to each worker and we avoid the sidethread push and multiple syscalls on pipes. This code still helps a lot as a point of reference at very least, and I hope I can take the direct io code still.

@mita
Copy link
Author

mita commented Nov 24, 2020

I've created a test script based on test-suite-example.

https://github.com/mita/mc-crusher/blob/direct_io/test-suites/test-direct

In addition to the normal extstore test and RAM test without extstore,This script adds direct IO and direct IO + io_uring test configurations.

I agree that extstore requests with io_uring should not be pushed to a separate thread.  I plan to work on this.

This adds support for direct I/O on extstore.

With `-o ext_direct` at startup, memcached opens the extstore files with
the O_DIRECT flag.  When direct I/O is used, extstore reads and writes
bypass the operating system caches.
This adds support for async I/O on extstore.

To build with async I/O support you will have to install io_uring library
(http://git.kernel.dk/cgit/liburing/) and ./configure --enable-liburing.
With `-o ext_io_engine=io_uring` and I/O depth > 1
(e.g. `-o ext_io_depth=64), the worker threads directly read and write
asynchronouly without the IO threads.
This adds a new MEMCACHED_EXTRA_ARGS environment variable for running
tests with additional memcached server options.

A test with direct I/O:

MEMCACHED_EXTRA_ARGS="-o ext_direct" make test

A test with direct I/O and async I/O:

MEMCACHED_EXTRA_ARGS="-o ext_io_engine=io_uring,ext_direct,ext_io_depth=64" make test
@mita
Copy link
Author

mita commented Mar 18, 2021

The latest version does not use the I/O threads when the io_uring engine is selected.
The worker threads directly submit I/O requests to extstore asynchronously.

@mita
Copy link
Author

mita commented Mar 19, 2021

This table shows performance comparison between default sync engine and io_uring engine.
The io_uring engine gets better throughput (requets/sec) with smaller number of client threads.

client threads 2 4 8 12 16
sync engine 338768 712797 1145488 1351642 1323820
io_uring engine 455830 855805 1266225 1384047 1323683
sync engine, direct I/O 95181 177153 305078 419204 490403
io_uring engine, direct I/O 289145 481540 507430 511633 512510

Benchmark details:

  1. Run memcached
./memcached/memcached -t 16 -m 8000 -o ext_item_age=0,ext_path=/mnt/ssd/extstore:200G $EXTRA_ARGS

The value of EXTRA_ARGS varies depending on the configuration.

  • sync engine: -o ext_io_engine=sync,ext_threads=16,ext_io_depth=64
  • io_uring engine: -o ext_io_engine=io_uring,ext_io_depth=64
  • sync engine, direct I/O: -o ext_io_engine=sync,ext_direct,ext_threads=16,ext_io_depth=64
  • io_uring engine, direct I/O: -o ext_io_engine=io_uring,ext_direct,ext_io_depth=64

The ext_path is a file in ext4 file system on Intel SSD 900P.

  1. Load 10,000,000 records.
./memstrike/memstrike -d '127.0.0.1:11211' -t 16 -v 512 10000000 -s
  1. Wait for extstore settle
./mc-crusher/bench-sample 127.0.0.1:11211 0 2 extstore_objects_written
  1. Run read only workloads with different number of client threads
for i in 2 4 8 12 16; do
  ./memstrike/memstrike -d '127.0.0.1:11211' -m 10 -t $i -v 512 10000000 -g
done

@dormando
Copy link
Member

Whoops I missed this update. Any idea why direct IO is so much slower?

@mita
Copy link
Author

mita commented Apr 20, 2021

Whoops I missed this update. Any idea why direct IO is so much slower?

Thanks for looking at update.

The direct IO performance is limited by the random read IOPS.
These results are measured with a single Intel SSD 900P.
We'll get better numbers with using multiple devices for extstore.

@dormando
Copy link
Member

@mita did you end up using this in a production context anywhere?

I've been both too busy and too-allocated to do the work necessary to validate this change, but that might be changing. I finally spent some time with io_uring and have more thoughts on it now as well.

@lidaohang
Copy link

mark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants