-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux AIO Support #2565
Linux AIO Support #2565
Conversation
It should worth noting that this passes my local XFS Tests. In particular, XFS Test generic/246 is a good test of this functionality. |
Richard, thanks! I will give this a shot sometime in the next couple of days and see what the performance is like... |
It is worth noting that
http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using |
Is there a readme or something on how to build from a git repo? I'd love |
Sigh. So after much hacking around, I figured out how to generate make[2]: Entering directory `/usr/src/kernels/3.10.0-123.4.4.el7.x86_64' I googled for this error but didn't really see anything helpful. I assume |
Well, I think I give up for now. Despite the private key messages, it |
@dswartz You can use the directions for building generic deb/rpm packages, http://zfsonlinux.org/generic-deb.html. You'll just need to cherry-pick this commit to test it. |
Cool thx Brian Behlendorf notifications@github.com wrote:
|
@ryao This looks nice. It's a simpler patch than I initially expected, although as you point out it doesn't add support for cancel. You suggested implementing cancel support using a taskq but I think it might be simpler to just undirty the dbufs for writes. It's probably something we should investigate before merging this work, but if it looks to be too involved it can be deferred. |
It seems that @behlendorf I will look into undirtying the dbufs before pushing a revision. |
Upon further reflection I don't think this is workable. It should be straight forward to undirty the dbuf. But that would be racy since there's no easy way I'm aware of to check that a specific aio was responsible for dirtying it... |
@behlendorf We could implement a counter. If it is greater than 1, then multiple parties dirtied the buffer and we cannot trivially undirty it. The POSIX specification gives us full control over when we can and cannot cancel an AIO operation, so we would be free to refuse to cancel the AIO when that happens. That being said, I am inclined to omit an implementation for |
Agreed. |
@behlendorf I have updated the pull request to address your comments. The commit message has also been updated. It is more clear on what nfsd does that is problematic and also discusses |
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char **argv) {
int rounds = 0;
while (rounds < 1000) {
int result = splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE);
if (result <= 0)
break;
}
return 0;
} # ls -l testfile -rwxr-xr-x 1 dhe dhe 414563 Dec 3 2013 testfile # strace ./testprogram < testfile | cat > /dev/null execve("./splicetest", ["./splicetest"], [/* 26 vars */]) = 0 brk(0) = 0x1904000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf4f000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=124586, ...}) = 0 mmap(NULL, 124586, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd3ecf30000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\356\1g1\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1926800, ...}) = 0 mmap(0x3167000000, 3750152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3167000000 mprotect(0x316718b000, 2093056, PROT_NONE) = 0 mmap(0x316738a000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18a000) = 0x316738a000 mmap(0x316738f000, 18696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x316738f000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2f000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2e000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3ecf2d000 arch_prctl(ARCH_SET_FS, 0x7fd3ecf2e700) = 0 mprotect(0x316738a000, 16384, PROT_READ) = 0 mprotect(0x3166e1f000, 4096, PROT_READ) = 0 munmap(0x7fd3ecf30000, 124586) = 0 splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096 splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096 splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096 splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096 ... ... ... splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096 splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096 splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 4096 splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = 867 splice(0, NULL, 1, NULL, 4096, SPLICE_F_MOVE) = -867 exit_group(0) = ? +++ exited with 0 +++ Wat? Backing out this one patch returns 0 for the last splice as normal. Edit: Additional info Linux atmaweapon 3.14.15 #9 SMP Sat Aug 9 15:11:26 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux |
I cannot reproduce this problem locally, but after examining things, I found a mistake. |
@ryao Could you please rebase this against master and repush it. |
@ryao Thanks for eliminating the addition |
Before merging this it would also be nice to have some results showing it's impact on NFS performance. |
I planning on that but have had delays hopefully tonight Sent from my iPhone
|
@behlendorf I have refreshed the pull request. The refresh to address all of the points made, but does not include NFS performance benchmarks. I will let @dswartz do those while I tackle other things. |
Okay, I am feeling mentally disabled here. What is the command line incantation to do a git pull for a specific pull request. I seem to recall seeing such in the first mail sent, but reading all the comments in the pull conversation, there is nothing that tells me what to do... |
git remote add ryao git://github.com/ryao/zfs.git git fetch ryao ... git checkout -b ryao-aio ... git reset --hard ryao/aio # or git merge ryao/aio (didn't actually try it, just tying from memory) |
@dswartz you can do something like
|
Thanks, building now... |
Okay, good news (mostly). Booted centos7 with AIO zfs module and ran crystaldiskmark. Target is a virtual disk in a windows 7 VM under vsphere. Stock ZoL gets about 30MB/sec sequential writes. With AIO it gets 55MB/sec. sync=disabled turns in 95MB/sec (pretty close to wire speed.) This is using the same intel s3700 over-provisioned at 20%. I notice that the SLOG device doesn't seem to get much above 1K IOPS (should be capable of far more than that...) I started to wonder if there is some kind of ZIL pathology at play, so I created another dataset and set sync=always on it, then read 1MB blocks from the SSD main pool device with output to a file on the new dataset. 63MB/sec! If I set sync=disabled on the new dataset and re-run the test, I get 172MB/sec (keep in mind this is an old pentium D with 8GB ram on vanilla sata2 motherboard ports.) Is there some kind of ZIL tuning I can play with? |
Richard, I think you've lost the original context. My original "slow sync
pool alloc free read write read write test 11.7G 107G 0 1.25K 0 91.8M Notice the write IOPS to the SLOG is just under 1000/sec. This SSD should |
@dswartz My apologies. This actually was something multiple parties wanted for different reasons, so I was somewhat confused. Based on some benchmarks of an earlier version of my softnas-sp2 branch that I do not think a NDA allows me to share, I can say that I have seen evidence to suggest that the kmem-rework, which I included in my softnas-sp2 branch, will increase performance. I suggest testing it. |
@ryao Enabling the various aio tests in xfstests resulted in 4 test failures: 113 198 239 240. I've briefly looked at each and they are almost all the result of issues in the test framework. For now we can patch xfstests to handle this. 113, 198, 239 - These all failed because they attempted to use O_DIRECT which is unsupported. However, test 198 may have exposed the same issue @DeHackEd reported. This is going to need to be explained and resolved before this change can be merged.
|
@dswartz It sounds like we may need to investigate any ZIL issues as well. However, for reference we've done some testing with a pool built with 10 Intel 3500's (5 mirror pairs) and we've seen sustained rates of 10,000 IOPs per device. |
@DeHackEd When you get a chance, would you try verifying that the issue you encountered is not a problem under the latest version of this patch? The patch should cleanly apply to 0.6.3. |
@behlendorf I have not sufficiently quantified it myself, but I have seen benchmarks by others that suggest that my kmem rework increased IOPS performance. It includes a change to keep KM_SLEEP from blocking KM_PUSHPAGE allocations during SLAB growth and a few other things that might impact performance, such as changing the reclaim behavior via PF_MEMALLOC_NOIO when PF_FSTRANS is set in performance critical code paths. These changes resolve issues that both harm performance and do not exist on Illumos. The idea that we have issues in memory allocations that hurt performance is consistent with what @dswartz is saying about Illumos performing better. |
Due to time constraints I'm not going to be able to do this very soon. If you want to reproduce it yourself install CentOS 6 with the Gnome desktop group selected, then install ZFS with /home on a ZFS filesystem. If you can use the standard Gnome login screen and get to a gnome desktop, you pass my test... Or just use the sample program and command-line I demonstrated. |
@behlendorf I will investigate the xfs test failure today. Also, I agree that it early to draw conclusions on the cause of @dswartz's performance issue, but I would like to consider the performance discrepancy between Linux and Illumos that is present even after this patch is applied to be separate from this pull request. That is why I suggest that the kmem-rework could be something to consider. |
@behlendorf Test generic/198 fails for me because of DirectIO. See
The precise test output on my machine is consistent with opening a file with O_DIRECT:
Did you mean some other test? |
Actually, disregard that. We seem to be using different versions of the XFS Test Suite. |
@ryao: compare your NDA with the license of the software and the terms of
|
Agreed, since it happens even with AIO not involved. I will throw Illumos on a 2nd boot drive and try the dd local test again. Richard Yao notifications@github.com wrote:
|
Brian, I agree... Brian Behlendorf notifications@github.com wrote:
|
There is an explicit exemption for OSS code in the NDA in question, so it is fine. I would not have signed it if there was a chance it kept me from sharing code. Unfortunately, I did not think of benchmark results that party does when I requested the exemption, so those are likely covered by confidentiality requirements. If I do my own benchmarks, it should be fine, but I lack time to do everything that I would like to do.
|
@ryao Any progress on the remaining xfstests issue? I'd like this to be part of the next tag if possible. |
@behlendorf I suspect that there is something subtle about torvalds/linux@73a7075 that is the cause of the discrepancy, but I have not tested that theory yet. I likely will not be able to finish this until I return from LinuxCon in Chicago later this week. |
@behlendorf My intuition was close. Responsibility for updating the iovecs when writing is different on Illumos and Linux. On Illumos, it is the filesystem's repsonsibility while on Linux, it is the VFS' responsibility. On older kernels, the retry code would invoke our ->aio_write() on each segment individually while torvalds/linux@73a7075 in 3.12 changed the behavior to kill that code. Doing our own iov->iov_len updates on older kernels caused aio_advance_iovec() to skip over everything after the first segment. The Linux 3.12 change allowed my initial AIO implementation to work on newer kernels because Linux was no longer updating the iovecs at all. Since it is clear that we are not supposed to be touching these structures, I have modified the patch to copy the contents of the iovecs into dynamic memory which is then used to simulate the Solaris environment. This incurs a small performance penalty, but it should be fairly negligible. We can revisit it later if performance profiling shows that the allocation is causing a bottleneck. The pull request has been refreshed. |
The revised commit had a few typos that were caught by the builders. I pushed a revised commit to fix that. I have also adjusted the comment about the iovecs. |
8f185fe
to
4632dbd
Compare
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to known that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>
After sleeping on it, I decided to make a couple of additional revisions. This should be safe to merge provided that the tests do not reveal any problems. |
Merged as: cd3939c Linux AIO Support All the AIO tests are now passing on the various buildslaves. I'll enable the AIO tests by default next week. This will give us a chance to clear the buildbot backlog and for new changes to rebased on a version of master which now supports AIO. |
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.
ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.
One might be inclined to implement fops->aio_write synchronous to make
software that expects this behavior safe. However, there are several
reasons not to do this:
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.
properly while simultaneously encouraging the creation of more
non-compliant software.
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.
Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.
It should be noted that this is also a problem in the kernel itself
where nfsd does not respect O_SYNC on files and assumes synchronous
behavior from do_readv_writev(), even though its fallback clearly does
not enforce it.
Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure. Exporting any file
system that implements AIO the way this patch does bears similar risk.
However, it seems reasonable to forgo crippling our AIO implementation
in favor of developing patches to fix this problem in Linux's nfsd for
the reasons stated earlier. In the interim, the risk will remain.
Failing to implement AIO will not change the problem that nfsd created,
so there is no reason for nfsd's mistake to block our implementation of
AIO.
Closes:
#223
#2373
Signed-off-by: Richard Yao ryao@gentoo.org