Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux AIO Support #2565

Closed
wants to merge 1 commit into from
Closed

Linux AIO Support #2565

wants to merge 1 commit into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Aug 4, 2014

nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to implement fops->aio_write synchronous to make
software that expects this behavior safe. However, there are several
reasons not to do this:

  1. Other platforms do not implement aio_write() synchronously and since
    the majority of userland software using AIO should be cross platform,
    expectations of synchronous behavior should not be a problem.
  2. We would hurt the performance of programs that use POSIX interfaces
    properly while simultaneously encouraging the creation of more
    non-compliant software.
  3. The broader community concluded that userland software should be
    patched to properly use POSIX interfaces instead of implementing hacks
    in filesystems to cater to broken software. This concept is best
    described as the O_PONIES debate.
  4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not respect O_SYNC on files and assumes synchronous
behavior from do_readv_writev(), even though its fallback clearly does
not enforce it.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure. Exporting any file
system that implements AIO the way this patch does bears similar risk.
However, it seems reasonable to forgo crippling our AIO implementation
in favor of developing patches to fix this problem in Linux's nfsd for
the reasons stated earlier. In the interim, the risk will remain.
Failing to implement AIO will not change the problem that nfsd created,
so there is no reason for nfsd's mistake to block our implementation of
AIO.

Closes:
#223
#2373

Signed-off-by: Richard Yao ryao@gentoo.org

@ryao
Copy link
Contributor Author

ryao commented Aug 4, 2014

It should worth noting that this passes my local XFS Tests. In particular, XFS Test generic/246 is a good test of this functionality.

@dswartz
Copy link
Contributor

dswartz commented Aug 4, 2014

Richard, thanks! I will give this a shot sometime in the next couple of days and see what the performance is like...

@ryao
Copy link
Contributor Author

ryao commented Aug 4, 2014

It is worth noting that aio_cancel() will always return AIO_NOTCANCELED under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use kiocb_set_cancel_fn() to set a callback function for cancelling work sent to taskqs, but the simpler approach is sufficient for our current purposes. Vectored IO users such as NFS benefit from dramatically reduced processing. The specification implies that an implementation that does not permit cancellation is fine:

Which operations are cancelable is implementation-defined.

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using aio_cancel() are QEMU, beecrypt and fio use it according to a recursive grep of my system's /usr/src/debug. That suggests to me that aio_cancel() users are rare.

@dswartz
Copy link
Contributor

dswartz commented Aug 6, 2014

Is there a readme or something on how to build from a git repo? I'd love
to test the nfs write perf to compare against stock, but not sure how to
proceed...

@dswartz
Copy link
Contributor

dswartz commented Aug 6, 2014

Sigh. So after much hacking around, I figured out how to generate
configure and Makefile and actually build things. Then, when I try to
install the zfs stuff, I pukes with:

make[2]: Entering directory `/usr/src/kernels/3.10.0-123.4.4.el7.x86_64'
INSTALL /root/git/module/avl/zavl.ko
Can't read private key
INSTALL /root/git/module/nvpair/znvpair.ko
Can't read private key
INSTALL /root/git/module/unicode/zunicode.ko
Can't read private key
INSTALL /root/git/module/zcommon/zcommon.ko
Can't read private key
INSTALL /root/git/module/zfs/zfs.ko
Can't read private key
INSTALL /root/git/module/zpios/zpios.ko
Can't read private key

I googled for this error but didn't really see anything helpful. I assume
there is a way around this, but damned if I know what it is. What a PITA.

@dswartz
Copy link
Contributor

dswartz commented Aug 6, 2014

Well, I think I give up for now. Despite the private key messages, it
seems like the modules installed okay, so I rebooted. Now, when I try
writing to the dataset from a remote client, I don't get any visible
errors, but it's doing almost no I/O at all. I suspect my update/install
didn't do something right, but I have no idea what or how to tell. If
someone can generate the requisite rpms or something, I'll be happy to try
again...

@behlendorf
Copy link
Contributor

@dswartz You can use the directions for building generic deb/rpm packages, http://zfsonlinux.org/generic-deb.html. You'll just need to cherry-pick this commit to test it.

@behlendorf behlendorf added this to the 0.6.4 milestone Aug 6, 2014
@behlendorf behlendorf added Bug and removed Bug labels Aug 6, 2014
@dswartz
Copy link
Contributor

dswartz commented Aug 6, 2014

Cool thx

Brian Behlendorf notifications@github.com wrote:

@dswartz You can use the directions for building generic deb/rpm packages, http://zfsonlinux.org/generic-deb.html. You'll just need to cherry-pick this commit to test it.


Reply to this email directly or view it on GitHub.

@behlendorf
Copy link
Contributor

@ryao This looks nice. It's a simpler patch than I initially expected, although as you point out it doesn't add support for cancel.

You suggested implementing cancel support using a taskq but I think it might be simpler to just undirty the dbufs for writes. It's probably something we should investigate before merging this work, but if it looks to be too involved it can be deferred.

@ryao
Copy link
Contributor Author

ryao commented Aug 7, 2014

It seems that generic_segment_checks() has been removed in Linux 3.16 by torvalds/linux@cb66a7a. Al is correct that the additional work is redundant, but it is not immediately clear that is also historically true. We will need to either review all supported kernels or err on the side of caution by using an autotools check.

@behlendorf I will look into undirtying the dbufs before pushing a revision.

@behlendorf
Copy link
Contributor

I will look into undirtying the dbufs before pushing a revision.

Upon further reflection I don't think this is workable. It should be straight forward to undirty the dbuf. But that would be racy since there's no easy way I'm aware of to check that a specific aio was responsible for dirtying it...

@ryao
Copy link
Contributor Author

ryao commented Aug 7, 2014

@behlendorf We could implement a counter. If it is greater than 1, then multiple parties dirtied the buffer and we cannot trivially undirty it. The POSIX specification gives us full control over when we can and cannot cancel an AIO operation, so we would be free to refuse to cancel the AIO when that happens.

That being said, I am inclined to omit an implementation for aio_cancel() until it is clear that there are consumers that do benefit from it to justify the additional work. That is allowed by the specification.

@behlendorf
Copy link
Contributor

That being said, I am inclined to omit an implementation for aio_cancel() until it is clear that there are consumers that do benefit from it to justify the additional work. That is allowed by the specification.

Agreed.

@ryao
Copy link
Contributor Author

ryao commented Aug 7, 2014

@behlendorf I have updated the pull request to address your comments. The commit message has also been updated. It is more clear on what nfsd does that is problematic and also discusses aio_cancel().

@DeHackEd
Copy link
Contributor

DeHackEd commented Aug 9, 2014

#define _GNU_SOURCE
#include  <fcntl.h>
#include <unistd.h>

int main(int argc, char **argv) {
  int rounds = 0;
  while (rounds < 1000) {
    int result = splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE);
    if (result <= 0)
      break;
  }
  return 0;
}
# ls -l testfile
-rwxr-xr-x 1 dhe dhe 414563 Dec  3  2013 testfile
# strace ./testprogram < testfile | cat > /dev/null
execve("./splicetest", ["./splicetest"], [/* 26 vars */]) = 0
brk(0)                                  = 0x1904000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf4f000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=124586, ...}) = 0
mmap(NULL, 124586, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd3ecf30000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\356\1g1\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1926800, ...}) = 0
mmap(0x3167000000, 3750152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3167000000
mprotect(0x316718b000, 2093056, PROT_NONE) = 0
mmap(0x316738a000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18a000) = 0x316738a000
mmap(0x316738f000, 18696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x316738f000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2e000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2d000
arch_prctl(ARCH_SET_FS, 0x7fd3ecf2e700) = 0
mprotect(0x316738a000, 16384, PROT_READ) = 0
mprotect(0x3166e1f000, 4096, PROT_READ) = 0
munmap(0x7fd3ecf30000, 124586)          = 0
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
... ... ...
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 867
splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = -867
exit_group(0)                           = ?
+++ exited with 0 +++

Wat?

Backing out this one patch returns 0 for the last splice as normal.

Edit: Additional info

Linux atmaweapon 3.14.15 #9 SMP Sat Aug 9 15:11:26 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

https://github.com/DeHackEd/zfs/commits/dehacked

@ryao
Copy link
Contributor Author

ryao commented Aug 10, 2014

I cannot reproduce this problem locally, but after examining things, I found a mistake. size_t len = iovp->iov_len; followed by read = len - uio.uio_resid; is wrong. It is conceiveable that it resulted in the -867 value, but I have not identified how uio.uio_resid, which would have been initialized through iov_length(), could have 867 at the EOF in the first place. However, 414563 = 867 (mod 4096), which explains how we are seeing 867 here. I have pushed a revised patch that fixes this via some internal API changes to pass ppos to the new zpl_read_common_iovec(). @DeHackEd tells me that his userland software works properly with it applied.

@behlendorf
Copy link
Contributor

@ryao Could you please rebase this against master and repush it.

@behlendorf
Copy link
Contributor

@ryao Thanks for eliminating the addition pos argument, I think that helped considerably. Just a few more review comments, but I think this is getting very close.

@behlendorf
Copy link
Contributor

Before merging this it would also be nice to have some results showing it's impact on NFS performance.

@dswartz
Copy link
Contributor

dswartz commented Aug 12, 2014

I planning on that but have had delays hopefully tonight

Sent from my iPhone

On Aug 12, 2014, at 12:40 PM, Brian Behlendorf notifications@github.com wrote:

Before merging this it would also be nice to have some results showing it's impact on NFS performance.


Reply to this email directly or view it on GitHub.

@ryao
Copy link
Contributor Author

ryao commented Aug 12, 2014

@behlendorf I have refreshed the pull request. The refresh to address all of the points made, but does not include NFS performance benchmarks. I will let @dswartz do those while I tackle other things.

@dswartz
Copy link
Contributor

dswartz commented Aug 13, 2014

Okay, I am feeling mentally disabled here. What is the command line incantation to do a git pull for a specific pull request. I seem to recall seeing such in the first mail sent, but reading all the comments in the pull conversation, there is nothing that tells me what to do...

@DeHackEd
Copy link
Contributor

git remote add ryao git://github.com/ryao/zfs.git
git fetch ryao
...
git checkout -b ryao-aio
...
git reset --hard ryao/aio
# or
git merge ryao/aio

(didn't actually try it, just tying from memory)

@nedbass
Copy link
Contributor

nedbass commented Aug 13, 2014

@dswartz you can do something like

git fetch -t git://github.com/zfsonlinux/zfs.git refs/pull/2565/head
git checkout -b b_2565 FETCH_HEAD

@dswartz
Copy link
Contributor

dswartz commented Aug 13, 2014

@dswartz you can do something like

git fetch -t git://github.com/zfsonlinux/zfs.git refs/pull/2565/head
git checkout -b b_2565 FETCH_HEAD

Thanks, building now...

@dswartz
Copy link
Contributor

dswartz commented Aug 13, 2014

Okay, good news (mostly). Booted centos7 with AIO zfs module and ran crystaldiskmark. Target is a virtual disk in a windows 7 VM under vsphere. Stock ZoL gets about 30MB/sec sequential writes. With AIO it gets 55MB/sec. sync=disabled turns in 95MB/sec (pretty close to wire speed.) This is using the same intel s3700 over-provisioned at 20%. I notice that the SLOG device doesn't seem to get much above 1K IOPS (should be capable of far more than that...) I started to wonder if there is some kind of ZIL pathology at play, so I created another dataset and set sync=always on it, then read 1MB blocks from the SSD main pool device with output to a file on the new dataset. 63MB/sec! If I set sync=disabled on the new dataset and re-run the test, I get 172MB/sec (keep in mind this is an old pentium D with 8GB ram on vanilla sata2 motherboard ports.) Is there some kind of ZIL tuning I can play with?

@dswartz
Copy link
Contributor

dswartz commented Aug 13, 2014

Richard, I think you've lost the original context. My original "slow sync
writes with slog" was discussing why sync=standard (when client forces
sync on) was dog-slow even with a high-performance SSD (intel 3700).
exact same crystaldiskmark (sequential, 512k reads and writes) works at
near wire speed using omnios (illumos variant). We concluded it was the
nfs server breaking up large client writes to a ton of 4KB sync writes,
which swamps the SLOG. Sure enough, with your AIO patches, I see seq
write performance go from 30MB/sec to 55MB/sec (almost 2X), but still
barely half what omnios turns out. As I said in a previous update, even
doing local disk writes, if it is synchronous, I am barely getting
60MB/sec, whereas raw writes to a partition on the SLOG device is doing
almost 200MB/sec. If I do the same 'dd' to a file on the sync=standard
dataset, it gets 167MB/sec. Looking at 'zpool iostat -v' output when
writing to a file on the sync dataset, I see:

           capacity     operations    bandwidth

pool alloc free read write read write


test 11.7G 107G 0 1.25K 0 91.8M
sdc 11.7G 107G 0 335 0 29.4M
logs - - - - - -
sdb1 256M 19.6G 0 940 0 62.5M


Notice the write IOPS to the SLOG is just under 1000/sec. This SSD should
be able to do 10X or 20X or more IOPS. It's almost like something is
throttling the writes to the SLOG device. This is why I was hoping
someone would go 'aha! zil writes are limited by X and you can tweak
module parameter Y to play with that...'

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

@dswartz My apologies. This actually was something multiple parties wanted for different reasons, so I was somewhat confused. Based on some benchmarks of an earlier version of my softnas-sp2 branch that I do not think a NDA allows me to share, I can say that I have seen evidence to suggest that the kmem-rework, which I included in my softnas-sp2 branch, will increase performance. I suggest testing it.

@behlendorf
Copy link
Contributor

@ryao Enabling the various aio tests in xfstests resulted in 4 test failures: 113 198 239 240. I've briefly looked at each and they are almost all the result of issues in the test framework. For now we can patch xfstests to handle this.

113, 198, 239 - These all failed because they attempted to use O_DIRECT which is unsupported.
240 - Failed because blockdev -getss $TEST_DEV doesn't work for a zfs filesystem.

However, test 198 may have exposed the same issue @DeHackEd reported. This is going to need to be explained and resolved before this change can be merged.

$ cat 198.out.bad 
QA output created by 198
Silence is golden.
AIO write offset 0 expected 65536 got 0
AIO write offset 5242880 expected 65536 got 0
AIO write offset 10485760 expected 65536 got 0
AIO write offset 15728640 expected 65536 got 0
AIO write offset 20971520 expected 65536 got 0
AIO write offset 26214400 expected 65536 got 0
AIO write offset 31457280 expected 65536 got 0
AIO write offset 36700160 expected 65536 got 0
AIO write offset 41943040 expected 65536 got 0
AIO write offset 47185920 expected 65536 got 0
AIO write offset 52428800 expected 65536 got 0
AIO write offset 57671680 expected 65536 got 0
AIO write offset 62914560 expected 65536 got 0
AIO write offset 68157440 expected 65536 got 0
AIO write offset 73400320 expected 65536 got 0
AIO write offset 78643200 expected 65536 got 0
non one buffer at buf[0] => 0x00,00,00,00
non-one read at offset 83886080

@behlendorf
Copy link
Contributor

@dswartz It sounds like we may need to investigate any ZIL issues as well. However, for reference we've done some testing with a pool built with 10 Intel 3500's (5 mirror pairs) and we've seen sustained rates of 10,000 IOPs per device.

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

@DeHackEd When you get a chance, would you try verifying that the issue you encountered is not a problem under the latest version of this patch? The patch should cleanly apply to 0.6.3.

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

@behlendorf I have not sufficiently quantified it myself, but I have seen benchmarks by others that suggest that my kmem rework increased IOPS performance. It includes a change to keep KM_SLEEP from blocking KM_PUSHPAGE allocations during SLAB growth and a few other things that might impact performance, such as changing the reclaim behavior via PF_MEMALLOC_NOIO when PF_FSTRANS is set in performance critical code paths. These changes resolve issues that both harm performance and do not exist on Illumos. The idea that we have issues in memory allocations that hurt performance is consistent with what @dswartz is saying about Illumos performing better.

@behlendorf
Copy link
Contributor

@ryao That sounds plausible, but it's a little early jump to the conclusion that's what's causing @dswartz remaining issue. Regardless, xfstests 198 needs to be resolved before this change can be merged.

@DeHackEd
Copy link
Contributor

Due to time constraints I'm not going to be able to do this very soon.

If you want to reproduce it yourself install CentOS 6 with the Gnome desktop group selected, then install ZFS with /home on a ZFS filesystem. If you can use the standard Gnome login screen and get to a gnome desktop, you pass my test... Or just use the sample program and command-line I demonstrated.

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

@behlendorf I will investigate the xfs test failure today.

Also, I agree that it early to draw conclusions on the cause of @dswartz's performance issue, but I would like to consider the performance discrepancy between Linux and Illumos that is present even after this patch is applied to be separate from this pull request. That is why I suggest that the kmem-rework could be something to consider.

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

@behlendorf Test generic/198 fails for me because of DirectIO. See aiodio_sparse() in src/aio-dio-regress/aiodio_sparse2.c:

fd = open(filename, direct|O_WRONLY|O_CREAT, 0666);

The precise test output on my machine is consistent with opening a file with O_DIRECT:

QA output created by 198
Silence is golden.
cannot create file: Invalid argument

Did you mean some other test?

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

Actually, disregard that. We seem to be using different versions of the XFS Test Suite.

@aarcane
Copy link

aarcane commented Aug 13, 2014

@ryao: compare your NDA with the license of the software and the terms of
your contact and think long and hard about whether you can release it or not
On Aug 13, 2014 10:18 AM, "Richard Yao" notifications@github.com wrote:

@dswartz https://github.com/dswartz My apologies. This actually was
something multiple parties want for different reasons, so I was somewhat
confused. Based on some benchmarks of an earlier version of my softnas-sp2
branch that I do not think a NDA allows me to share, I can say that I have
seen evidence to suggest that the kmem-rework, which I included in my
softnas-sp2 branch, will increase performance. I suggest testing it.


Reply to this email directly or view it on GitHub
#2565 (comment).

@dswartz
Copy link
Contributor

dswartz commented Aug 13, 2014

Agreed, since it happens even with AIO not involved. I will throw Illumos on a 2nd boot drive and try the dd local test again.

Richard Yao notifications@github.com wrote:

@behlendorf I will investigate the xfs test failure today.

Also, I agree that it early to draw conclusions on the cause of @dswartz's performance issue, but I would like to consider the performance discrepancy between Linux and Illumos that is present even after this patch is applied to be separate from this pull request.


Reply to this email directly or view it on GitHub.

@dswartz
Copy link
Contributor

dswartz commented Aug 13, 2014

Brian, I agree...

Brian Behlendorf notifications@github.com wrote:

@dswartz That's good news. Thanks for running some additional tests. Regarding zil tuning there's the logbias property and then some statistics in /proc/spl/kstat/zfs/zil to get a feel for what's happening. However, if you want to pursue that further we should do it another issue.

@ryao The updated patch looks good to me. Thanks for the multiple refreshes, I may be able to get it merged today after a limit manual testing of my own.


Reply to this email directly or view it on GitHub.

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

There is an explicit exemption for OSS code in the NDA in question, so it is fine. I would not have signed it if there was a chance it kept me from sharing code. Unfortunately, I did not think of benchmark results that party does when I requested the exemption, so those are likely covered by confidentiality requirements. If I do my own benchmarks, it should be fine, but I lack time to do everything that I would like to do.

On Aug 13, 2014, at 4:14 PM, Christ Schlacta notifications@github.com wrote:

@ryao: compare your NDA with the license of the software and the terms of
your contact and think long and hard about whether you can release it or not
On Aug 13, 2014 10:18 AM, "Richard Yao" notifications@github.com wrote:

@dswartz https://github.com/dswartz My apologies. This actually was
something multiple parties want for different reasons, so I was somewhat
confused. Based on some benchmarks of an earlier version of my softnas-sp2
branch that I do not think a NDA allows me to share, I can say that I have
seen evidence to suggest that the kmem-rework, which I included in my
softnas-sp2 branch, will increase performance. I suggest testing it.


Reply to this email directly or view it on GitHub
#2565 (comment).


Reply to this email directly or view it on GitHub.

@behlendorf
Copy link
Contributor

@ryao Any progress on the remaining xfstests issue? I'd like this to be part of the next tag if possible.

@ryao
Copy link
Contributor Author

ryao commented Aug 18, 2014

@behlendorf I suspect that there is something subtle about torvalds/linux@73a7075 that is the cause of the discrepancy, but I have not tested that theory yet. I likely will not be able to finish this until I return from LinuxCon in Chicago later this week.

@ryao
Copy link
Contributor Author

ryao commented Sep 2, 2014

@behlendorf My intuition was close. Responsibility for updating the iovecs when writing is different on Illumos and Linux. On Illumos, it is the filesystem's repsonsibility while on Linux, it is the VFS' responsibility. On older kernels, the retry code would invoke our ->aio_write() on each segment individually while torvalds/linux@73a7075 in 3.12 changed the behavior to kill that code. Doing our own iov->iov_len updates on older kernels caused aio_advance_iovec() to skip over everything after the first segment. The Linux 3.12 change allowed my initial AIO implementation to work on newer kernels because Linux was no longer updating the iovecs at all.

Since it is clear that we are not supposed to be touching these structures, I have modified the patch to copy the contents of the iovecs into dynamic memory which is then used to simulate the Solaris environment. This incurs a small performance penalty, but it should be fairly negligible. We can revisit it later if performance profiling shows that the allocation is causing a bottleneck. The pull request has been refreshed.

@ryao
Copy link
Contributor Author

ryao commented Sep 3, 2014

The revised commit had a few typos that were caught by the builders. I pushed a revised commit to fix that. I have also adjusted the comment about the iovecs.

nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to known that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux, it is
the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility.
We take the intermediate solution of copying the iovec so that the ZFS code can
update it like on Solaris while leaving the originals alone. This imposes some
overhead. We could always revisit this should profiling show that the
allocations are a problem.

Closes:
openzfs#223
openzfs#2373

Signed-off-by: Richard Yao <ryao@gentoo.org>
@ryao
Copy link
Contributor Author

ryao commented Sep 3, 2014

After sleeping on it, I decided to make a couple of additional revisions. This should be safe to merge provided that the tests do not reveal any problems.

@behlendorf
Copy link
Contributor

@ryao It looks like you managed to hit #2523 in the buildbot which is unrelated to this change. I'll pull the latest patch and run it again on EPEL7 with all the AIO tests enabled.

@behlendorf
Copy link
Contributor

Merged as:

cd3939c Linux AIO Support

All the AIO tests are now passing on the various buildslaves. I'll enable the AIO tests by default next week. This will give us a chance to clear the buildbot backlog and for new changes to rebased on a version of master which now supports AIO.

@behlendorf behlendorf closed this Sep 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants