Add Linux posix_fadvise(..., POSIX_FADV_WILLNEED) support. #9807

macdice · 2020-01-04T14:49:09Z

Allow user programs to request asynchronous prefetching of regions of files.

Motivation and Context

Some programs including PostgreSQL and MongoDB use POSIX_FADV_WILLNEED in cases where they have visibility of future random reads. Doing that some time before you call pread() is a very simple kind of asynchronous I/O, and works pretty well at reducing I/O waits if you can get the distance right. PostgreSQL has an "effective_io_concurrency" setting that controls how far ahead it issues the advice in certain cases.

Description

In Linux 4.19, fadvise was added as a VFS file operation. Previously it worked at page cache level only, so it wasn't usable for ZFS. This patch connects POSIX_FADV_WILLNEED up to dmu_prefetch() using the new VFS callback. (It may be possible to do something meaningful with the other kinds of advice too, but I haven't looked into that.)

How Has This Been Tested?

I used the program at https://github.com/macdice/some-io-tests to verify that random access to larger-than-memory files could be made faster by prefetching N reads. I've tested on a virtual machine with 1GB and 4GB of RAM using Debian with a 4.19 kernel. The best speed-up from doing that is around 4x on my laptop, comparing "./random-read 10gb-file 0 100000 0" and "./random-read 10gb-file 20 100000 0".

I get similar results with the same test on ext4 (though the speed-up is greater there, I'm not sure why, possibly tuning related).

I also tested an equivalent patch for FreeBSD VFS (about which more soon), and it shows the same sort of results on various systems.

This is a draft for now, because I haven't added anything under "tests" yet. I'm trying to work out what tests to provide for this -- perhaps a tiny program to just exercise it and check for invalid parameters, or perhaps there is a way to introspect enough things to verify its actual effect. Any suggestions?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

codecov · 2020-01-04T21:56:18Z

Codecov Report

Merging #9807 (47bdf58) into master (161ed82) will increase coverage by 4.41%.
The diff coverage is n/a.

❗ Current head 47bdf58 differs from pull request most recent head 9bd54d4. Consider uploading reports for the commit 9bd54d4 to get more accurate results

@@            Coverage Diff             @@
##           master    #9807      +/-   ##
==========================================
+ Coverage   75.17%   79.59%   +4.41%     
==========================================
  Files         402      289     -113     
  Lines      128071    82335   -45736     
==========================================
- Hits        96283    65536   -30747     
+ Misses      31788    16799   -14989

Flag	Coverage Δ
kernel	`79.59% <ø> (+0.83%)`	⬆️
user	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
module/os/linux/zfs/zpl_file.c	`90.38% <ø> (+0.87%)`	⬆️
module/zfs/vdev_raidz_math_sse2.c	`14.39% <0.00%> (-84.85%)`	⬇️
module/icp/algs/skein/skein_block.c	`31.94% <0.00%> (-68.06%)`	⬇️
module/os/linux/spl/spl-vmem.c	`54.16% <0.00%> (-45.84%)`	⬇️
module/icp/algs/edonr/edonr.c	`45.72% <0.00%> (-42.59%)`	⬇️
include/os/linux/zfs/sys/trace_acl.h	`33.33% <0.00%> (-33.34%)`	⬇️
module/zcommon/zfs_namecheck.c	`54.78% <0.00%> (-32.91%)`	⬇️
module/icp/algs/skein/skein.c	`18.33% <0.00%> (-30.84%)`	⬇️
module/zfs/gzip.c	`46.15% <0.00%> (-30.77%)`	⬇️
module/zfs/ddt_zap.c	`57.14% <0.00%> (-29.23%)`	⬇️
... and 368 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10271af...9bd54d4. Read the comment docs.

behlendorf

Thanks for looking in to this!

Regarding test cases I'd suggest starting by taking a look at tests/zfs-tests/tests/functional/arc/dbufstats_001_pos.ksh. It leverages the /proc/spl/kstat/zfs/dbufs kstat file to check which blocks have been cached by the ARC. Additionally, the /proc/spl/kstat/zfs/dbufs file provides some more coarse, but useful, statistics.

You'll also want to include some logic similar to that used by the tmpfile/setup.sh which skips the test case when fadvise isn't supported by the kernel+zfs. When log_unsupported is called in setup the whole test group is skipped, when placed in an individual test case only that test is skipped.

behlendorf · 2020-01-06T22:21:15Z

module/os/linux/zfs/zpl_file.c

+
+	if (offset < 0 || len < 0) {
+		error = -EINVAL;
+		goto out;


It'd be nice not to need the out: label and instead just return (-EINVAL); here.

behlendorf · 2020-01-06T22:53:17Z

module/os/linux/zfs/zpl_file.c

+	ZFS_ENTER(zfsvfs);
+	ZFS_VERIFY_ZP(zp);
+
+	switch (advice) {


By chance did you investigate additionally calling generic_fadvise(). While the majority of cached ZFS data is stored in the ARC, we do make use of the page cache for mmap(2). It looks like calling generic_fadvise() first would let us add the missing EPIPE error handling, honor the mmap case, and set ra_pages which shouldn't be harmless and may in fact be desirable. You'd of course still need to explicitly call dmu_prefetch() here.

Could we just call generic_fadvise if zp->z_is_mapped is true?

My only concern with that is it'd be nice to keep the generic error handling (EPIPE, EINVAL) or we need to handle that ourselves. A reasonable middle ground might be to only pass the full len when zp->z_is_mapped.

allanjude

Reviewed By: allanjude@freebsd.org

h1z1 · 2020-01-11T22:03:45Z

Nice. Beats having to parse blockdump output which is what I do now for this.

neilrubin · 2022-02-07T03:56:41Z

I'd like to offer a strong vote in favor of incorporating this patch. While the author's motivation was to improve database random read performance, my testing suggests it also could be a big help for single-threaded sequential file access.

My Tests Show Double Throughput on Sequential File Reads by Real-World Backup Jobs

I have several sets of data which consist of tens of TB of files, with average file sizes around 10 MB (zfs recordsize=1M), which I'd like to back up to LTO-8 tape. I'm using bacula, which (like many common Unix programs) sequentially reads files in chunks, file-by-file, with only a single thread for any given backup job. Bacula calls "posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED)" on every file before starting to read from it.

Without this patch, I was getting throughput of around 80 MB/s backing up from a mirrored array of 40 7200 rpm hdds. Apparently zfetch was ramping up too slowly to be of help, and ZFS was effectively reading from one drive at at time (zfetch works great with multi-GB files on my system, but doesn't do much good on smaller multi-record files). With this patch, throughput almost doubles to 150 MB/s. Actual performance is more than double, because the tape drive has a minimum write speed of 112 MB/s. so 80 MB/s throughput either means spooling to disk before writing to tape or constant back-hitching of the tape.

FADVISE_SEQUENTIAL Should Have the Same Behavior as FADVISE_WILLNEED

I would also strongly suggest treating POSIX_FADV_SEQUENTIAL the same as FADVISE_WILLNEED. The "copy" function used by cp and mv in the GNU coreutils calls "fdadvise (source_desc, 0, 0, FADVISE_SEQUENTIAL)" before it starts reading the source file in chunks. (https://github.com/coreutils/coreutils/blob/master/src/copy.c) Although I have not tested it, that suggests to me that pre-fetching the file (up to size dmu_prefetch_max) when FADVISE_SEQUENTIAL is specified would similarly increase the performance of cp and mv, for files that are multiples of the recordsize, but too small to really benefit from zfetch.

This behavior would be a little different than the documented behavior of the Linux kernel, which is to simply double the readahead buffer in response to FADVISE_SEQUENTIAL. But it seems like the better behavior for a system like ZFS, which is designed for hardware that can manage much better throughput when prefetching and which is designed to use large caches. If an application goes to the trouble of telling the OS that it is going to read the whole file (or some large chuck), it is probably a throughput-sensitive application, and what it is saying is probably true. It seems to me that the filesystem should take it at its word and be at least as aggressive in pre-fetching as it is when the zfetch heuristics are fully ramped up.

Increasing performance like this when using standard OS utilities to do normal things like copy and move directories of files, with sizes typical of user data like PDFs, Office files, and photographs (order of magnitude 10 MB), without needing fancy heuristics in the filesystem or multi-threaded I/O on the application side, seems like a huge win to me.

The M4 Code Appears to Need Updating

I will note that simply applying the patch as written did not successfully enable the feature on my system (applied to ZFS 2.0.6, with Linux kernel 5.13.0 on Ubuntu). From poking around a little, it looks to me like the system for testing for OS features has changed a little since this patch was submitted, but I don't know M4 well enough to figure out what the issue is there. But, when I commented out the HAVE_FADVISE #ifdef (forcing the added code to be included) the code compiled successfully, and I observed the roughly 2x speed up on my backup workloads.

macdice · 2022-02-07T21:22:41Z

Nice result @neilrubin! Thanks for poking this sleeping patch, which I'd love to see in the tree. So it looks like to have a chance of landing this we need (1) updated feature test for Linux 5.x, I can do that soon, (2) call generic_fadvise() as suggested, will look into that, and (3) a basic test -- if you have any concrete ideas for how to satisfy that requirement I'm all ears. I looked around a bit, understood that fio was used and wondered about teaching it to issue these calls (see axboe/fio#911), but that seems tricky and an unnecessary distraition; hmm, yeah, maybe something basic perhaps just asserting that that blocks are in the ARC within X seconds is all we need?

behlendorf · 2022-02-08T00:50:35Z

It'd be great to see this work dusted off and integrated. It sounds like for some common workloads there's a significant performance win to be had.

@macdice your plan of attack sounds good to me. As for the test case, something as simple as adding a small test program which is a wrapper around posix_fadvise would be nice. It could simply make sure the system call succeeds, then verify after a brief delay the reads are ARC cache hits. Basically, just exercise the minimum functionality,

I suspect @amotin may also be interested in getting this wired up for FreeBSD given the encouraging results. With a little more work it looks like we could also support the described behavior of POSIX_FADV_SEQUENTIAL, POSIX_FADV_NORMAL, and POSIX_FADV_RANDOM by extending the zfetch code a bit.

Allow user programs to request prefetching of regions of files. Other kinds of advice are ignored for now. Signed-off-by: Thomas Munro <tmunro@freebsd.org>

behlendorf · 2022-07-25T19:28:39Z

config/kernel-fadvise.m4

+dnl #
+dnl # Linux 4.19 API
+dnl #
+AC_DEFUN([ZFS_AC_KERNEL_FADVISE], [


This configure check should be updated to operate in parallel like all the others with a ZFS_AC_KERNEL_FADVISE and ZFS_AC_KERNEL_SRC_FADVISE

macdice · 2022-09-25T02:12:37Z

Just saw that #13694 was merged so I'll close this one. Very happy to see this feature -- thanks very much @Finix1979 & @behlendorf!

macdice force-pushed the fadvise-linux branch from 70f33b6 to 47bdf58 Compare January 5, 2020 11:30

behlendorf added the Status: Code Review Needed Ready for review and testing label Jan 6, 2020

behlendorf reviewed Jan 6, 2020

View reviewed changes

allanjude approved these changes Jan 7, 2020

View reviewed changes

macdice mentioned this pull request Nov 13, 2020

Add support for posix_fadvise #10308

Closed

12 tasks

ahrens assigned tonynguien Jun 4, 2021

macdice force-pushed the fadvise-linux branch from 47bdf58 to 792cfd6 Compare February 12, 2022 08:31

Add Linux posix_fadvise(..., POSIX_FADV_WILLNEED) support.

9bd54d4

Allow user programs to request prefetching of regions of files. Other kinds of advice are ignored for now. Signed-off-by: Thomas Munro <tmunro@freebsd.org>

macdice force-pushed the fadvise-linux branch from 792cfd6 to 9bd54d4 Compare February 12, 2022 08:54

behlendorf reviewed Jul 25, 2022

View reviewed changes

Finix1979 mentioned this pull request Jul 26, 2022

Add Linux posix_fadvise support #13694

Merged

13 tasks

koobs mentioned this pull request Sep 22, 2022

OpenZFS: Extend posix_fadvise(2) support to FreeBSD bsdlabs/zfs#2

Open

macdice closed this Sep 25, 2022

macdice mentioned this pull request Sep 26, 2022

Add FreeBSD posix_fadvise support. #13958

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Linux posix_fadvise(..., POSIX_FADV_WILLNEED) support. #9807

Add Linux posix_fadvise(..., POSIX_FADV_WILLNEED) support. #9807

macdice commented Jan 4, 2020

codecov bot commented Jan 4, 2020 •

edited

Loading

behlendorf left a comment

behlendorf Jan 6, 2020

behlendorf Jan 6, 2020

Finix1979 Jul 21, 2022

behlendorf Jul 25, 2022

allanjude left a comment

h1z1 commented Jan 11, 2020

neilrubin commented Feb 7, 2022

macdice commented Feb 7, 2022

behlendorf commented Feb 8, 2022

behlendorf Jul 25, 2022

macdice commented Sep 25, 2022

Add Linux posix_fadvise(..., POSIX_FADV_WILLNEED) support. #9807

Add Linux posix_fadvise(..., POSIX_FADV_WILLNEED) support. #9807

Conversation

macdice commented Jan 4, 2020

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

codecov bot commented Jan 4, 2020 • edited Loading

Codecov Report

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf Jan 6, 2020

Choose a reason for hiding this comment

behlendorf Jan 6, 2020

Choose a reason for hiding this comment

Finix1979 Jul 21, 2022

Choose a reason for hiding this comment

behlendorf Jul 25, 2022

Choose a reason for hiding this comment

allanjude left a comment

Choose a reason for hiding this comment

h1z1 commented Jan 11, 2020

neilrubin commented Feb 7, 2022

macdice commented Feb 7, 2022

behlendorf commented Feb 8, 2022

behlendorf Jul 25, 2022

Choose a reason for hiding this comment

macdice commented Sep 25, 2022

codecov bot commented Jan 4, 2020 •

edited

Loading