Linux AIO Support #2565

ryao · 2014-08-04T14:16:00Z

nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to implement fops->aio_write synchronous to make
software that expects this behavior safe. However, there are several
reasons not to do this:

Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.
We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.
The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.
Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not respect O_SYNC on files and assumes synchronous
behavior from do_readv_writev(), even though its fallback clearly does
not enforce it.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure. Exporting any file
system that implements AIO the way this patch does bears similar risk.
However, it seems reasonable to forgo crippling our AIO implementation
in favor of developing patches to fix this problem in Linux's nfsd for
the reasons stated earlier. In the interim, the risk will remain.
Failing to implement AIO will not change the problem that nfsd created,
so there is no reason for nfsd's mistake to block our implementation of
AIO.

Closes:
#223
#2373

Signed-off-by: Richard Yao ryao@gentoo.org

ryao · 2014-08-04T14:22:41Z

It should worth noting that this passes my local XFS Tests. In particular, XFS Test generic/246 is a good test of this functionality.

dswartz · 2014-08-04T14:26:54Z

Richard, thanks! I will give this a shot sometime in the next couple of days and see what the performance is like...

ryao · 2014-08-04T17:47:13Z

It is worth noting that aio_cancel() will always return AIO_NOTCANCELED under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use kiocb_set_cancel_fn() to set a callback function for cancelling work sent to taskqs, but the simpler approach is sufficient for our current purposes. Vectored IO users such as NFS benefit from dramatically reduced processing. The specification implies that an implementation that does not permit cancellation is fine:

Which operations are cancelable is implementation-defined.

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using aio_cancel() are QEMU, beecrypt and fio use it according to a recursive grep of my system's /usr/src/debug. That suggests to me that aio_cancel() users are rare.

dswartz · 2014-08-06T00:15:16Z

Is there a readme or something on how to build from a git repo? I'd love
to test the nfs write perf to compare against stock, but not sure how to
proceed...

dswartz · 2014-08-06T01:12:38Z

Sigh. So after much hacking around, I figured out how to generate
configure and Makefile and actually build things. Then, when I try to
install the zfs stuff, I pukes with:

make[2]: Entering directory `/usr/src/kernels/3.10.0-123.4.4.el7.x86_64'
INSTALL /root/git/module/avl/zavl.ko
Can't read private key
INSTALL /root/git/module/nvpair/znvpair.ko
Can't read private key
INSTALL /root/git/module/unicode/zunicode.ko
Can't read private key
INSTALL /root/git/module/zcommon/zcommon.ko
Can't read private key
INSTALL /root/git/module/zfs/zfs.ko
Can't read private key
INSTALL /root/git/module/zpios/zpios.ko
Can't read private key

I googled for this error but didn't really see anything helpful. I assume
there is a way around this, but damned if I know what it is. What a PITA.

dswartz · 2014-08-06T01:57:15Z

Well, I think I give up for now. Despite the private key messages, it
seems like the modules installed okay, so I rebooted. Now, when I try
writing to the dataset from a remote client, I don't get any visible
errors, but it's doing almost no I/O at all. I suspect my update/install
didn't do something right, but I have no idea what or how to tell. If
someone can generate the requisite rpms or something, I'll be happy to try
again...

behlendorf · 2014-08-06T18:59:04Z

@dswartz You can use the directions for building generic deb/rpm packages, http://zfsonlinux.org/generic-deb.html. You'll just need to cherry-pick this commit to test it.

dswartz · 2014-08-06T19:01:20Z

Cool thx

Brian Behlendorf notifications@github.com wrote:

@dswartz You can use the directions for building generic deb/rpm packages, http://zfsonlinux.org/generic-deb.html. You'll just need to cherry-pick this commit to test it.

—
Reply to this email directly or view it on GitHub.

behlendorf · 2014-08-06T20:11:24Z

@ryao This looks nice. It's a simpler patch than I initially expected, although as you point out it doesn't add support for cancel.

You suggested implementing cancel support using a taskq but I think it might be simpler to just undirty the dbufs for writes. It's probably something we should investigate before merging this work, but if it looks to be too involved it can be deferred.

ryao · 2014-08-07T15:51:37Z

It seems that generic_segment_checks() has been removed in Linux 3.16 by torvalds/linux@cb66a7a. Al is correct that the additional work is redundant, but it is not immediately clear that is also historically true. We will need to either review all supported kernels or err on the side of caution by using an autotools check.

@behlendorf I will look into undirtying the dbufs before pushing a revision.

behlendorf · 2014-08-07T17:19:33Z

I will look into undirtying the dbufs before pushing a revision.

Upon further reflection I don't think this is workable. It should be straight forward to undirty the dbuf. But that would be racy since there's no easy way I'm aware of to check that a specific aio was responsible for dirtying it...

ryao · 2014-08-07T17:31:56Z

@behlendorf We could implement a counter. If it is greater than 1, then multiple parties dirtied the buffer and we cannot trivially undirty it. The POSIX specification gives us full control over when we can and cannot cancel an AIO operation, so we would be free to refuse to cancel the AIO when that happens.

That being said, I am inclined to omit an implementation for aio_cancel() until it is clear that there are consumers that do benefit from it to justify the additional work. That is allowed by the specification.

behlendorf · 2014-08-07T17:54:03Z

That being said, I am inclined to omit an implementation for aio_cancel() until it is clear that there are consumers that do benefit from it to justify the additional work. That is allowed by the specification.

Agreed.

ryao · 2014-08-07T19:58:08Z

@behlendorf I have updated the pull request to address your comments. The commit message has also been updated. It is more clear on what nfsd does that is problematic and also discusses aio_cancel().

DeHackEd · 2014-08-09T20:07:32Z

#define _GNU_SOURCE
#include  <fcntl.h>
#include <unistd.h>

int main(int argc, char **argv) {
  int rounds = 0;
  while (rounds < 1000) {
    int result = splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE);
    if (result <= 0)
      break;
  }
  return 0;
}

# ls -l testfile
-rwxr-xr-x 1 dhe dhe 414563 Dec  3  2013 testfile
# strace ./testprogram < testfile | cat > /dev/null
execve("./splicetest", ["./splicetest"], [/* 26 vars */]) = 0
brk(0)                                  = 0x1904000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf4f000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=124586, ...}) = 0
mmap(NULL, 124586, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd3ecf30000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\356\1g1\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1926800, ...}) = 0
mmap(0x3167000000, 3750152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3167000000
mprotect(0x316718b000, 2093056, PROT_NONE) = 0
mmap(0x316738a000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18a000) = 0x316738a000
mmap(0x316738f000, 18696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x316738f000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2e000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2d000
arch_prctl(ARCH_SET_FS, 0x7fd3ecf2e700) = 0
mprotect(0x316738a000, 16384, PROT_READ) = 0
mprotect(0x3166e1f000, 4096, PROT_READ) = 0
munmap(0x7fd3ecf30000, 124586)          = 0
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
... ... ...
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 867
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = -867
exit_group(0)                           = ?
+++ exited with 0 +++

Wat?

Backing out this one patch returns 0 for the last splice as normal.

Edit: Additional info

Linux atmaweapon 3.14.15 #9 SMP Sat Aug 9 15:11:26 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

https://github.com/DeHackEd/zfs/commits/dehacked

ryao · 2014-08-10T14:33:07Z

I cannot reproduce this problem locally, but after examining things, I found a mistake. size_t len = iovp->iov_len; followed by read = len - uio.uio_resid; is wrong. It is conceiveable that it resulted in the -867 value, but I have not identified how uio.uio_resid, which would have been initialized through iov_length(), could have 867 at the EOF in the first place. However, 414563 = 867 (mod 4096), which explains how we are seeing 867 here. I have pushed a revised patch that fixes this via some internal API changes to pass ppos to the new zpl_read_common_iovec(). @DeHackEd tells me that his userland software works properly with it applied.

behlendorf · 2014-08-12T16:07:51Z

@ryao Could you please rebase this against master and repush it.

behlendorf · 2014-08-12T16:37:24Z

@ryao Thanks for eliminating the addition pos argument, I think that helped considerably. Just a few more review comments, but I think this is getting very close.

behlendorf · 2014-08-12T16:40:03Z

Before merging this it would also be nice to have some results showing it's impact on NFS performance.

dswartz · 2014-08-12T17:05:56Z

I planning on that but have had delays hopefully tonight

Sent from my iPhone

On Aug 12, 2014, at 12:40 PM, Brian Behlendorf notifications@github.com wrote:

Before merging this it would also be nice to have some results showing it's impact on NFS performance.

—
Reply to this email directly or view it on GitHub.

ryao · 2014-08-12T23:04:04Z

@behlendorf I have refreshed the pull request. The refresh to address all of the points made, but does not include NFS performance benchmarks. I will let @dswartz do those while I tackle other things.

dswartz · 2014-08-13T01:48:01Z

Okay, I am feeling mentally disabled here. What is the command line incantation to do a git pull for a specific pull request. I seem to recall seeing such in the first mail sent, but reading all the comments in the pull conversation, there is nothing that tells me what to do...

DeHackEd · 2014-08-13T01:59:48Z

git remote add ryao git://github.com/ryao/zfs.git
git fetch ryao
...
git checkout -b ryao-aio
...
git reset --hard ryao/aio
# or
git merge ryao/aio

(didn't actually try it, just tying from memory)

nedbass · 2014-08-13T02:02:04Z

@dswartz you can do something like

git fetch -t git://github.com/zfsonlinux/zfs.git refs/pull/2565/head
git checkout -b b_2565 FETCH_HEAD

dswartz · 2014-08-13T02:34:00Z

@dswartz you can do something like

git fetch -t git://github.com/zfsonlinux/zfs.git refs/pull/2565/head
git checkout -b b_2565 FETCH_HEAD

Thanks, building now...

dswartz · 2014-08-13T14:50:49Z

Okay, good news (mostly). Booted centos7 with AIO zfs module and ran crystaldiskmark. Target is a virtual disk in a windows 7 VM under vsphere. Stock ZoL gets about 30MB/sec sequential writes. With AIO it gets 55MB/sec. sync=disabled turns in 95MB/sec (pretty close to wire speed.) This is using the same intel s3700 over-provisioned at 20%. I notice that the SLOG device doesn't seem to get much above 1K IOPS (should be capable of far more than that...) I started to wonder if there is some kind of ZIL pathology at play, so I created another dataset and set sync=always on it, then read 1MB blocks from the SSD main pool device with output to a file on the new dataset. 63MB/sec! If I set sync=disabled on the new dataset and re-run the test, I get 172MB/sec (keep in mind this is an old pentium D with 8GB ram on vanilla sata2 motherboard ports.) Is there some kind of ZIL tuning I can play with?

dswartz · 2014-08-13T17:07:32Z

Richard, I think you've lost the original context. My original "slow sync
writes with slog" was discussing why sync=standard (when client forces
sync on) was dog-slow even with a high-performance SSD (intel 3700).
exact same crystaldiskmark (sequential, 512k reads and writes) works at
near wire speed using omnios (illumos variant). We concluded it was the
nfs server breaking up large client writes to a ton of 4KB sync writes,
which swamps the SLOG. Sure enough, with your AIO patches, I see seq
write performance go from 30MB/sec to 55MB/sec (almost 2X), but still
barely half what omnios turns out. As I said in a previous update, even
doing local disk writes, if it is synchronous, I am barely getting
60MB/sec, whereas raw writes to a partition on the SLOG device is doing
almost 200MB/sec. If I do the same 'dd' to a file on the sync=standard
dataset, it gets 167MB/sec. Looking at 'zpool iostat -v' output when
writing to a file on the sync dataset, I see:

           capacity     operations    bandwidth

pool alloc free read write read write

test 11.7G 107G 0 1.25K 0 91.8M
sdc 11.7G 107G 0 335 0 29.4M
logs - - - - - -
sdb1 256M 19.6G 0 940 0 62.5M

Notice the write IOPS to the SLOG is just under 1000/sec. This SSD should
be able to do 10X or 20X or more IOPS. It's almost like something is
throttling the writes to the SLOG device. This is why I was hoping
someone would go 'aha! zil writes are limited by X and you can tweak
module parameter Y to play with that...'

ryao · 2014-08-13T17:18:17Z

@dswartz My apologies. This actually was something multiple parties wanted for different reasons, so I was somewhat confused. Based on some benchmarks of an earlier version of my softnas-sp2 branch that I do not think a NDA allows me to share, I can say that I have seen evidence to suggest that the kmem-rework, which I included in my softnas-sp2 branch, will increase performance. I suggest testing it.

behlendorf · 2014-08-13T17:18:25Z

@ryao Enabling the various aio tests in xfstests resulted in 4 test failures: 113 198 239 240. I've briefly looked at each and they are almost all the result of issues in the test framework. For now we can patch xfstests to handle this.

113, 198, 239 - These all failed because they attempted to use O_DIRECT which is unsupported.
240 - Failed because blockdev -getss $TEST_DEV doesn't work for a zfs filesystem.

However, test 198 may have exposed the same issue @DeHackEd reported. This is going to need to be explained and resolved before this change can be merged.

$ cat 198.out.bad 
QA output created by 198
Silence is golden.
AIO write offset 0 expected 65536 got 0
AIO write offset 5242880 expected 65536 got 0
AIO write offset 10485760 expected 65536 got 0
AIO write offset 15728640 expected 65536 got 0
AIO write offset 20971520 expected 65536 got 0
AIO write offset 26214400 expected 65536 got 0
AIO write offset 31457280 expected 65536 got 0
AIO write offset 36700160 expected 65536 got 0
AIO write offset 41943040 expected 65536 got 0
AIO write offset 47185920 expected 65536 got 0
AIO write offset 52428800 expected 65536 got 0
AIO write offset 57671680 expected 65536 got 0
AIO write offset 62914560 expected 65536 got 0
AIO write offset 68157440 expected 65536 got 0
AIO write offset 73400320 expected 65536 got 0
AIO write offset 78643200 expected 65536 got 0
non one buffer at buf[0] => 0x00,00,00,00
non-one read at offset 83886080

behlendorf · 2014-08-13T17:25:20Z

@dswartz It sounds like we may need to investigate any ZIL issues as well. However, for reference we've done some testing with a pool built with 10 Intel 3500's (5 mirror pairs) and we've seen sustained rates of 10,000 IOPs per device.

ryao · 2014-08-13T17:28:51Z

@DeHackEd When you get a chance, would you try verifying that the issue you encountered is not a problem under the latest version of this patch? The patch should cleanly apply to 0.6.3.

ryao · 2014-08-13T17:32:32Z

@behlendorf I have not sufficiently quantified it myself, but I have seen benchmarks by others that suggest that my kmem rework increased IOPS performance. It includes a change to keep KM_SLEEP from blocking KM_PUSHPAGE allocations during SLAB growth and a few other things that might impact performance, such as changing the reclaim behavior via PF_MEMALLOC_NOIO when PF_FSTRANS is set in performance critical code paths. These changes resolve issues that both harm performance and do not exist on Illumos. The idea that we have issues in memory allocations that hurt performance is consistent with what @dswartz is saying about Illumos performing better.

behlendorf · 2014-08-13T18:06:38Z

@ryao That sounds plausible, but it's a little early jump to the conclusion that's what's causing @dswartz remaining issue. Regardless, xfstests 198 needs to be resolved before this change can be merged.

DeHackEd · 2014-08-13T18:10:36Z

Due to time constraints I'm not going to be able to do this very soon.

If you want to reproduce it yourself install CentOS 6 with the Gnome desktop group selected, then install ZFS with /home on a ZFS filesystem. If you can use the standard Gnome login screen and get to a gnome desktop, you pass my test... Or just use the sample program and command-line I demonstrated.

ryao · 2014-08-13T18:39:19Z

@behlendorf I will investigate the xfs test failure today.

Also, I agree that it early to draw conclusions on the cause of @dswartz's performance issue, but I would like to consider the performance discrepancy between Linux and Illumos that is present even after this patch is applied to be separate from this pull request. That is why I suggest that the kmem-rework could be something to consider.

ryao · 2014-08-13T19:57:07Z

@behlendorf Test generic/198 fails for me because of DirectIO. See aiodio_sparse() in src/aio-dio-regress/aiodio_sparse2.c:

fd = open(filename, direct|O_WRONLY|O_CREAT, 0666);

The precise test output on my machine is consistent with opening a file with O_DIRECT:

QA output created by 198
Silence is golden.
cannot create file: Invalid argument

Did you mean some other test?

ryao · 2014-08-13T19:58:08Z

Actually, disregard that. We seem to be using different versions of the XFS Test Suite.

aarcane · 2014-08-13T20:14:06Z

@ryao: compare your NDA with the license of the software and the terms of
your contact and think long and hard about whether you can release it or not
On Aug 13, 2014 10:18 AM, "Richard Yao" notifications@github.com wrote:

@dswartz https://github.com/dswartz My apologies. This actually was
something multiple parties want for different reasons, so I was somewhat
confused. Based on some benchmarks of an earlier version of my softnas-sp2
branch that I do not think a NDA allows me to share, I can say that I have
seen evidence to suggest that the kmem-rework, which I included in my
softnas-sp2 branch, will increase performance. I suggest testing it.

—
Reply to this email directly or view it on GitHub
#2565 (comment).

dswartz · 2014-08-13T21:37:07Z

Agreed, since it happens even with AIO not involved. I will throw Illumos on a 2nd boot drive and try the dd local test again.

Richard Yao notifications@github.com wrote:

@behlendorf I will investigate the xfs test failure today.

Also, I agree that it early to draw conclusions on the cause of @dswartz's performance issue, but I would like to consider the performance discrepancy between Linux and Illumos that is present even after this patch is applied to be separate from this pull request.

—
Reply to this email directly or view it on GitHub.

dswartz · 2014-08-13T21:37:55Z

Brian, I agree...

Brian Behlendorf notifications@github.com wrote:

@dswartz That's good news. Thanks for running some additional tests. Regarding zil tuning there's the logbias property and then some statistics in /proc/spl/kstat/zfs/zil to get a feel for what's happening. However, if you want to pursue that further we should do it another issue.

@ryao The updated patch looks good to me. Thanks for the multiple refreshes, I may be able to get it merged today after a limit manual testing of my own.

—
Reply to this email directly or view it on GitHub.

ryao · 2014-08-13T21:38:00Z

There is an explicit exemption for OSS code in the NDA in question, so it is fine. I would not have signed it if there was a chance it kept me from sharing code. Unfortunately, I did not think of benchmark results that party does when I requested the exemption, so those are likely covered by confidentiality requirements. If I do my own benchmarks, it should be fine, but I lack time to do everything that I would like to do.

On Aug 13, 2014, at 4:14 PM, Christ Schlacta notifications@github.com wrote:

@ryao: compare your NDA with the license of the software and the terms of
your contact and think long and hard about whether you can release it or not
On Aug 13, 2014 10:18 AM, "Richard Yao" notifications@github.com wrote:

@dswartz https://github.com/dswartz My apologies. This actually was
something multiple parties want for different reasons, so I was somewhat
confused. Based on some benchmarks of an earlier version of my softnas-sp2
branch that I do not think a NDA allows me to share, I can say that I have
seen evidence to suggest that the kmem-rework, which I included in my
softnas-sp2 branch, will increase performance. I suggest testing it.

—
Reply to this email directly or view it on GitHub
#2565 (comment).

—
Reply to this email directly or view it on GitHub.

behlendorf · 2014-08-18T16:05:01Z

@ryao Any progress on the remaining xfstests issue? I'd like this to be part of the next tag if possible.

ryao · 2014-08-18T18:46:58Z

@behlendorf I suspect that there is something subtle about torvalds/linux@73a7075 that is the cause of the discrepancy, but I have not tested that theory yet. I likely will not be able to finish this until I return from LinuxCon in Chicago later this week.

ryao · 2014-09-02T23:03:15Z

@behlendorf My intuition was close. Responsibility for updating the iovecs when writing is different on Illumos and Linux. On Illumos, it is the filesystem's repsonsibility while on Linux, it is the VFS' responsibility. On older kernels, the retry code would invoke our ->aio_write() on each segment individually while torvalds/linux@73a7075 in 3.12 changed the behavior to kill that code. Doing our own iov->iov_len updates on older kernels caused aio_advance_iovec() to skip over everything after the first segment. The Linux 3.12 change allowed my initial AIO implementation to work on newer kernels because Linux was no longer updating the iovecs at all.

Since it is clear that we are not supposed to be touching these structures, I have modified the patch to copy the contents of the iovecs into dynamic memory which is then used to simulate the Solaris environment. This incurs a small performance penalty, but it should be fairly negligible. We can revisit it later if performance profiling shows that the allocation is causing a bottleneck. The pull request has been refreshed.

ryao · 2014-09-03T02:43:12Z

The revised commit had a few typos that were caught by the builders. I pushed a revised commit to fix that. I have also adjusted the comment about the iovecs.

nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to known that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao · 2014-09-03T10:56:40Z

After sleeping on it, I decided to make a couple of additional revisions. This should be safe to merge provided that the tests do not reveal any problems.

behlendorf · 2014-09-03T18:34:10Z

@ryao It looks like you managed to hit #2523 in the buildbot which is unrelated to this change. I'll pull the latest patch and run it again on EPEL7 with all the AIO tests enabled.

behlendorf · 2014-09-05T23:02:08Z

Merged as:

cd3939c Linux AIO Support

All the AIO tests are now passing on the various buildslaves. I'll enable the AIO tests by default next week. This will give us a chance to clear the buildbot backlog and for new changes to rebased on a version of master which now supports AIO.

behlendorf added this to the 0.6.4 milestone Aug 6, 2014

behlendorf added Bug and removed Bug labels Aug 6, 2014

behlendorf mentioned this pull request Aug 15, 2014

very bad NFS read performance with ZFS #2599

Closed

ryao force-pushed the aio branch from d77b73b to 1a1b903 Compare September 2, 2014 22:56

ryao force-pushed the aio branch from 1a1b903 to 10a70e1 Compare September 3, 2014 02:42

ryao force-pushed the aio branch 2 times, most recently from 8f185fe to 4632dbd Compare September 3, 2014 10:54

ryao force-pushed the aio branch from 4632dbd to 749d31e Compare September 3, 2014 10:55

behlendorf closed this Sep 5, 2014

Linux AIO Support #2565

Linux AIO Support #2565

Conversation

ryao commented Aug 4, 2014

ryao commented Aug 4, 2014

dswartz commented Aug 4, 2014

ryao commented Aug 4, 2014

dswartz commented Aug 6, 2014

dswartz commented Aug 6, 2014

dswartz commented Aug 6, 2014

behlendorf commented Aug 6, 2014

dswartz commented Aug 6, 2014

behlendorf commented Aug 6, 2014

ryao commented Aug 7, 2014

behlendorf commented Aug 7, 2014

ryao commented Aug 7, 2014

behlendorf commented Aug 7, 2014

ryao commented Aug 7, 2014

DeHackEd commented Aug 9, 2014

ryao commented Aug 10, 2014

behlendorf commented Aug 12, 2014

behlendorf commented Aug 12, 2014

behlendorf commented Aug 12, 2014

dswartz commented Aug 12, 2014

ryao commented Aug 12, 2014

dswartz commented Aug 13, 2014

DeHackEd commented Aug 13, 2014

nedbass commented Aug 13, 2014

dswartz commented Aug 13, 2014

dswartz commented Aug 13, 2014

dswartz commented Aug 13, 2014

ryao commented Aug 13, 2014

behlendorf commented Aug 13, 2014

behlendorf commented Aug 13, 2014

ryao commented Aug 13, 2014

ryao commented Aug 13, 2014

behlendorf commented Aug 13, 2014

DeHackEd commented Aug 13, 2014

ryao commented Aug 13, 2014

ryao commented Aug 13, 2014

ryao commented Aug 13, 2014

aarcane commented Aug 13, 2014

dswartz commented Aug 13, 2014

dswartz commented Aug 13, 2014

ryao commented Aug 13, 2014

behlendorf commented Aug 18, 2014

ryao commented Aug 18, 2014

ryao commented Sep 2, 2014

ryao commented Sep 3, 2014

ryao commented Sep 3, 2014

behlendorf commented Sep 3, 2014

behlendorf commented Sep 5, 2014