-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Linux posix_fadvise(..., POSIX_FADV_WILLNEED) support. #9807
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9807 +/- ##
==========================================
+ Coverage 75.17% 79.59% +4.41%
==========================================
Files 402 289 -113
Lines 128071 82335 -45736
==========================================
- Hits 96283 65536 -30747
+ Misses 31788 16799 -14989
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking in to this!
Regarding test cases I'd suggest starting by taking a look at tests/zfs-tests/tests/functional/arc/dbufstats_001_pos.ksh
. It leverages the /proc/spl/kstat/zfs/dbufs
kstat file to check which blocks have been cached by the ARC. Additionally, the /proc/spl/kstat/zfs/dbufs
file provides some more coarse, but useful, statistics.
You'll also want to include some logic similar to that used by the tmpfile/setup.sh
which skips the test case when fadvise isn't supported by the kernel+zfs. When log_unsupported
is called in setup
the whole test group is skipped, when placed in an individual test case only that test is skipped.
|
||
if (offset < 0 || len < 0) { | ||
error = -EINVAL; | ||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice not to need the out:
label and instead just return (-EINVAL);
here.
ZFS_ENTER(zfsvfs); | ||
ZFS_VERIFY_ZP(zp); | ||
|
||
switch (advice) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By chance did you investigate additionally calling generic_fadvise()
. While the majority of cached ZFS data is stored in the ARC, we do make use of the page cache for mmap(2)
. It looks like calling generic_fadvise()
first would let us add the missing EPIPE
error handling, honor the mmap case, and set ra_pages
which shouldn't be harmless and may in fact be desirable. You'd of course still need to explicitly call dmu_prefetch()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just call generic_fadvise if zp->z_is_mapped is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern with that is it'd be nice to keep the generic error handling (EPIPE
, EINVAL
) or we need to handle that ourselves. A reasonable middle ground might be to only pass the full len
when zp->z_is_mapped
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed By: allanjude@freebsd.org
Nice. Beats having to parse blockdump output which is what I do now for this. |
I'd like to offer a strong vote in favor of incorporating this patch. While the author's motivation was to improve database random read performance, my testing suggests it also could be a big help for single-threaded sequential file access. My Tests Show Double Throughput on Sequential File Reads by Real-World Backup Jobs I have several sets of data which consist of tens of TB of files, with average file sizes around 10 MB (zfs recordsize=1M), which I'd like to back up to LTO-8 tape. I'm using bacula, which (like many common Unix programs) sequentially reads files in chunks, file-by-file, with only a single thread for any given backup job. Bacula calls "posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED)" on every file before starting to read from it. Without this patch, I was getting throughput of around 80 MB/s backing up from a mirrored array of 40 7200 rpm hdds. Apparently zfetch was ramping up too slowly to be of help, and ZFS was effectively reading from one drive at at time (zfetch works great with multi-GB files on my system, but doesn't do much good on smaller multi-record files). With this patch, throughput almost doubles to 150 MB/s. Actual performance is more than double, because the tape drive has a minimum write speed of 112 MB/s. so 80 MB/s throughput either means spooling to disk before writing to tape or constant back-hitching of the tape. FADVISE_SEQUENTIAL Should Have the Same Behavior as FADVISE_WILLNEED I would also strongly suggest treating POSIX_FADV_SEQUENTIAL the same as FADVISE_WILLNEED. The "copy" function used by cp and mv in the GNU coreutils calls "fdadvise (source_desc, 0, 0, FADVISE_SEQUENTIAL)" before it starts reading the source file in chunks. (https://github.com/coreutils/coreutils/blob/master/src/copy.c) Although I have not tested it, that suggests to me that pre-fetching the file (up to size dmu_prefetch_max) when FADVISE_SEQUENTIAL is specified would similarly increase the performance of cp and mv, for files that are multiples of the recordsize, but too small to really benefit from zfetch. This behavior would be a little different than the documented behavior of the Linux kernel, which is to simply double the readahead buffer in response to FADVISE_SEQUENTIAL. But it seems like the better behavior for a system like ZFS, which is designed for hardware that can manage much better throughput when prefetching and which is designed to use large caches. If an application goes to the trouble of telling the OS that it is going to read the whole file (or some large chuck), it is probably a throughput-sensitive application, and what it is saying is probably true. It seems to me that the filesystem should take it at its word and be at least as aggressive in pre-fetching as it is when the zfetch heuristics are fully ramped up. Increasing performance like this when using standard OS utilities to do normal things like copy and move directories of files, with sizes typical of user data like PDFs, Office files, and photographs (order of magnitude 10 MB), without needing fancy heuristics in the filesystem or multi-threaded I/O on the application side, seems like a huge win to me. The M4 Code Appears to Need Updating I will note that simply applying the patch as written did not successfully enable the feature on my system (applied to ZFS 2.0.6, with Linux kernel 5.13.0 on Ubuntu). From poking around a little, it looks to me like the system for testing for OS features has changed a little since this patch was submitted, but I don't know M4 well enough to figure out what the issue is there. But, when I commented out the HAVE_FADVISE #ifdef (forcing the added code to be included) the code compiled successfully, and I observed the roughly 2x speed up on my backup workloads. |
Nice result @neilrubin! Thanks for poking this sleeping patch, which I'd love to see in the tree. So it looks like to have a chance of landing this we need (1) updated feature test for Linux 5.x, I can do that soon, (2) call generic_fadvise() as suggested, will look into that, and (3) a basic test -- if you have any concrete ideas for how to satisfy that requirement I'm all ears. I looked around a bit, understood that fio was used and wondered about teaching it to issue these calls (see axboe/fio#911), but that seems tricky and an unnecessary distraition; hmm, yeah, maybe something basic perhaps just asserting that that blocks are in the ARC within X seconds is all we need? |
It'd be great to see this work dusted off and integrated. It sounds like for some common workloads there's a significant performance win to be had. @macdice your plan of attack sounds good to me. As for the test case, something as simple as adding a small test program which is a wrapper around I suspect @amotin may also be interested in getting this wired up for FreeBSD given the encouraging results. With a little more work it looks like we could also support the described behavior of |
47bdf58
to
792cfd6
Compare
Allow user programs to request prefetching of regions of files. Other kinds of advice are ignored for now. Signed-off-by: Thomas Munro <tmunro@freebsd.org>
792cfd6
to
9bd54d4
Compare
dnl # | ||
dnl # Linux 4.19 API | ||
dnl # | ||
AC_DEFUN([ZFS_AC_KERNEL_FADVISE], [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This configure check should be updated to operate in parallel like all the others with a ZFS_AC_KERNEL_FADVISE
and ZFS_AC_KERNEL_SRC_FADVISE
Just saw that #13694 was merged so I'll close this one. Very happy to see this feature -- thanks very much @Finix1979 & @behlendorf! |
Allow user programs to request asynchronous prefetching of regions of files.
Motivation and Context
Some programs including PostgreSQL and MongoDB use POSIX_FADV_WILLNEED in cases where they have visibility of future random reads. Doing that some time before you call pread() is a very simple kind of asynchronous I/O, and works pretty well at reducing I/O waits if you can get the distance right. PostgreSQL has an "effective_io_concurrency" setting that controls how far ahead it issues the advice in certain cases.
Description
In Linux 4.19, fadvise was added as a VFS file operation. Previously it worked at page cache level only, so it wasn't usable for ZFS. This patch connects POSIX_FADV_WILLNEED up to dmu_prefetch() using the new VFS callback. (It may be possible to do something meaningful with the other kinds of advice too, but I haven't looked into that.)
How Has This Been Tested?
I used the program at https://github.com/macdice/some-io-tests to verify that random access to larger-than-memory files could be made faster by prefetching N reads. I've tested on a virtual machine with 1GB and 4GB of RAM using Debian with a 4.19 kernel. The best speed-up from doing that is around 4x on my laptop, comparing "./random-read 10gb-file 0 100000 0" and "./random-read 10gb-file 20 100000 0".
I get similar results with the same test on ext4 (though the speed-up is greater there, I'm not sure why, possibly tuning related).
I also tested an equivalent patch for FreeBSD VFS (about which more soon), and it shows the same sort of results on various systems.
This is a draft for now, because I haven't added anything under "tests" yet. I'm trying to work out what tests to provide for this -- perhaps a tiny program to just exercise it and check for invalid parameters, or perhaps there is a way to introspect enough things to verify its actual effect. Any suggestions?
Types of changes
Checklist:
Signed-off-by
.