Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Linux posix_fadvise(..., POSIX_FADV_WILLNEED) support. #9807

Closed
wants to merge 1 commit into from

Conversation

macdice
Copy link

@macdice macdice commented Jan 4, 2020

Allow user programs to request asynchronous prefetching of regions of files.

Motivation and Context

Some programs including PostgreSQL and MongoDB use POSIX_FADV_WILLNEED in cases where they have visibility of future random reads. Doing that some time before you call pread() is a very simple kind of asynchronous I/O, and works pretty well at reducing I/O waits if you can get the distance right. PostgreSQL has an "effective_io_concurrency" setting that controls how far ahead it issues the advice in certain cases.

Description

In Linux 4.19, fadvise was added as a VFS file operation. Previously it worked at page cache level only, so it wasn't usable for ZFS. This patch connects POSIX_FADV_WILLNEED up to dmu_prefetch() using the new VFS callback. (It may be possible to do something meaningful with the other kinds of advice too, but I haven't looked into that.)

How Has This Been Tested?

I used the program at https://github.com/macdice/some-io-tests to verify that random access to larger-than-memory files could be made faster by prefetching N reads. I've tested on a virtual machine with 1GB and 4GB of RAM using Debian with a 4.19 kernel. The best speed-up from doing that is around 4x on my laptop, comparing "./random-read 10gb-file 0 100000 0" and "./random-read 10gb-file 20 100000 0".

I get similar results with the same test on ext4 (though the speed-up is greater there, I'm not sure why, possibly tuning related).

I also tested an equivalent patch for FreeBSD VFS (about which more soon), and it shows the same sort of results on various systems.

This is a draft for now, because I haven't added anything under "tests" yet. I'm trying to work out what tests to provide for this -- perhaps a tiny program to just exercise it and check for invalid parameters, or perhaps there is a way to introspect enough things to verify its actual effect. Any suggestions?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@codecov
Copy link

codecov bot commented Jan 4, 2020

Codecov Report

Merging #9807 (47bdf58) into master (161ed82) will increase coverage by 4.41%.
The diff coverage is n/a.

❗ Current head 47bdf58 differs from pull request most recent head 9bd54d4. Consider uploading reports for the commit 9bd54d4 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9807      +/-   ##
==========================================
+ Coverage   75.17%   79.59%   +4.41%     
==========================================
  Files         402      289     -113     
  Lines      128071    82335   -45736     
==========================================
- Hits        96283    65536   -30747     
+ Misses      31788    16799   -14989     
Flag Coverage Δ
kernel 79.59% <ø> (+0.83%) ⬆️
user ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/os/linux/zfs/zpl_file.c 90.38% <ø> (+0.87%) ⬆️
module/zfs/vdev_raidz_math_sse2.c 14.39% <0.00%> (-84.85%) ⬇️
module/icp/algs/skein/skein_block.c 31.94% <0.00%> (-68.06%) ⬇️
module/os/linux/spl/spl-vmem.c 54.16% <0.00%> (-45.84%) ⬇️
module/icp/algs/edonr/edonr.c 45.72% <0.00%> (-42.59%) ⬇️
include/os/linux/zfs/sys/trace_acl.h 33.33% <0.00%> (-33.34%) ⬇️
module/zcommon/zfs_namecheck.c 54.78% <0.00%> (-32.91%) ⬇️
module/icp/algs/skein/skein.c 18.33% <0.00%> (-30.84%) ⬇️
module/zfs/gzip.c 46.15% <0.00%> (-30.77%) ⬇️
module/zfs/ddt_zap.c 57.14% <0.00%> (-29.23%) ⬇️
... and 368 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10271af...9bd54d4. Read the comment docs.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking in to this!

Regarding test cases I'd suggest starting by taking a look at tests/zfs-tests/tests/functional/arc/dbufstats_001_pos.ksh. It leverages the /proc/spl/kstat/zfs/dbufs kstat file to check which blocks have been cached by the ARC. Additionally, the /proc/spl/kstat/zfs/dbufs file provides some more coarse, but useful, statistics.

You'll also want to include some logic similar to that used by the tmpfile/setup.sh which skips the test case when fadvise isn't supported by the kernel+zfs. When log_unsupported is called in setup the whole test group is skipped, when placed in an individual test case only that test is skipped.


if (offset < 0 || len < 0) {
error = -EINVAL;
goto out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice not to need the out: label and instead just return (-EINVAL); here.

ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);

switch (advice) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By chance did you investigate additionally calling generic_fadvise(). While the majority of cached ZFS data is stored in the ARC, we do make use of the page cache for mmap(2). It looks like calling generic_fadvise() first would let us add the missing EPIPE error handling, honor the mmap case, and set ra_pages which shouldn't be harmless and may in fact be desirable. You'd of course still need to explicitly call dmu_prefetch() here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just call generic_fadvise if zp->z_is_mapped is true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern with that is it'd be nice to keep the generic error handling (EPIPE, EINVAL) or we need to handle that ourselves. A reasonable middle ground might be to only pass the full len when zp->z_is_mapped.

Copy link
Contributor

@allanjude allanjude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed By: allanjude@freebsd.org

@h1z1
Copy link

h1z1 commented Jan 11, 2020

Nice. Beats having to parse blockdump output which is what I do now for this.

@neilrubin
Copy link

I'd like to offer a strong vote in favor of incorporating this patch. While the author's motivation was to improve database random read performance, my testing suggests it also could be a big help for single-threaded sequential file access.

My Tests Show Double Throughput on Sequential File Reads by Real-World Backup Jobs

I have several sets of data which consist of tens of TB of files, with average file sizes around 10 MB (zfs recordsize=1M), which I'd like to back up to LTO-8 tape. I'm using bacula, which (like many common Unix programs) sequentially reads files in chunks, file-by-file, with only a single thread for any given backup job. Bacula calls "posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED)" on every file before starting to read from it.

Without this patch, I was getting throughput of around 80 MB/s backing up from a mirrored array of 40 7200 rpm hdds. Apparently zfetch was ramping up too slowly to be of help, and ZFS was effectively reading from one drive at at time (zfetch works great with multi-GB files on my system, but doesn't do much good on smaller multi-record files). With this patch, throughput almost doubles to 150 MB/s. Actual performance is more than double, because the tape drive has a minimum write speed of 112 MB/s. so 80 MB/s throughput either means spooling to disk before writing to tape or constant back-hitching of the tape.

FADVISE_SEQUENTIAL Should Have the Same Behavior as FADVISE_WILLNEED

I would also strongly suggest treating POSIX_FADV_SEQUENTIAL the same as FADVISE_WILLNEED. The "copy" function used by cp and mv in the GNU coreutils calls "fdadvise (source_desc, 0, 0, FADVISE_SEQUENTIAL)" before it starts reading the source file in chunks. (https://github.com/coreutils/coreutils/blob/master/src/copy.c) Although I have not tested it, that suggests to me that pre-fetching the file (up to size dmu_prefetch_max) when FADVISE_SEQUENTIAL is specified would similarly increase the performance of cp and mv, for files that are multiples of the recordsize, but too small to really benefit from zfetch.

This behavior would be a little different than the documented behavior of the Linux kernel, which is to simply double the readahead buffer in response to FADVISE_SEQUENTIAL. But it seems like the better behavior for a system like ZFS, which is designed for hardware that can manage much better throughput when prefetching and which is designed to use large caches. If an application goes to the trouble of telling the OS that it is going to read the whole file (or some large chuck), it is probably a throughput-sensitive application, and what it is saying is probably true. It seems to me that the filesystem should take it at its word and be at least as aggressive in pre-fetching as it is when the zfetch heuristics are fully ramped up.

Increasing performance like this when using standard OS utilities to do normal things like copy and move directories of files, with sizes typical of user data like PDFs, Office files, and photographs (order of magnitude 10 MB), without needing fancy heuristics in the filesystem or multi-threaded I/O on the application side, seems like a huge win to me.

The M4 Code Appears to Need Updating

I will note that simply applying the patch as written did not successfully enable the feature on my system (applied to ZFS 2.0.6, with Linux kernel 5.13.0 on Ubuntu). From poking around a little, it looks to me like the system for testing for OS features has changed a little since this patch was submitted, but I don't know M4 well enough to figure out what the issue is there. But, when I commented out the HAVE_FADVISE #ifdef (forcing the added code to be included) the code compiled successfully, and I observed the roughly 2x speed up on my backup workloads.

@macdice
Copy link
Author

macdice commented Feb 7, 2022

Nice result @neilrubin! Thanks for poking this sleeping patch, which I'd love to see in the tree. So it looks like to have a chance of landing this we need (1) updated feature test for Linux 5.x, I can do that soon, (2) call generic_fadvise() as suggested, will look into that, and (3) a basic test -- if you have any concrete ideas for how to satisfy that requirement I'm all ears. I looked around a bit, understood that fio was used and wondered about teaching it to issue these calls (see axboe/fio#911), but that seems tricky and an unnecessary distraition; hmm, yeah, maybe something basic perhaps just asserting that that blocks are in the ARC within X seconds is all we need?

@behlendorf
Copy link
Contributor

It'd be great to see this work dusted off and integrated. It sounds like for some common workloads there's a significant performance win to be had.

@macdice your plan of attack sounds good to me. As for the test case, something as simple as adding a small test program which is a wrapper around posix_fadvise would be nice. It could simply make sure the system call succeeds, then verify after a brief delay the reads are ARC cache hits. Basically, just exercise the minimum functionality,

I suspect @amotin may also be interested in getting this wired up for FreeBSD given the encouraging results. With a little more work it looks like we could also support the described behavior of POSIX_FADV_SEQUENTIAL, POSIX_FADV_NORMAL, and POSIX_FADV_RANDOM by extending the zfetch code a bit.

Allow user programs to request prefetching of regions of files.  Other
kinds of advice are ignored for now.

Signed-off-by: Thomas Munro <tmunro@freebsd.org>
dnl #
dnl # Linux 4.19 API
dnl #
AC_DEFUN([ZFS_AC_KERNEL_FADVISE], [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configure check should be updated to operate in parallel like all the others with a ZFS_AC_KERNEL_FADVISE and ZFS_AC_KERNEL_SRC_FADVISE

@macdice
Copy link
Author

macdice commented Sep 25, 2022

Just saw that #13694 was merged so I'll close this one. Very happy to see this feature -- thanks very much @Finix1979 & @behlendorf!

@macdice macdice closed this Sep 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants