Implement uncached prefetch #14243

amotin · 2022-11-30T21:55:59Z

Before this change primarycache property was handled only on dbuf layer, controlling dbuf cache and through a hack an ARC evictions. Since speculative prefetcher is implemented on ARC level, it had to be disabled for uncacheable buffers because otherwise falsely prefetched and never read buffers would stay cached in ARC.

This change gives ARC a knowledge about uncacheable buffers. It is passed to arc_read() and arc_write() and stored in ARC header. When remove_reference() drops last reference on the ARC header, it can either immediately destroy it, or if it is marked as prefetch, put it into new arc_uncached state. That state is scanned every second, looking for stale buffers that were not demand read (in which case they are evicted immediately).

To handle cases of short or misaligned reads, this change tracks at dbuf layer buffers that were read from the beginning, but not to the end. It is assumed that such buffers may receive further reads, and so they are stored in dbuf cache. If some of following reads reaches the end of such buffer, it is immediately evicted. Otherwise it will follow regular dbuf cache eviction and will be evicted also from ARC the same moment. Since dbuf layer does not know the actual file size, this logic is not applied to the last buffers of dnodes, which are always evicted same as before.

Since uncacheable buffers should no longer stay in ARC for long, this patch also tries to optimize I/O by allocating ARC physical buffers as linear to allow buffer sharing. It allows to avoid one of two memory copies for uncompressed data for both reads and writes by the cost of some higher KVA usage in case of prefetch. In case decompression is needed the sharing is impossible, but this still allows to avoid extra memory copy since decompression still require a linear buffer.

With the combination of enabled prefetch and avoided memory copy this change improves sequential single-threaded read speed from a wide NVMe pool from 2049 to 3932 MiB/s. During write profiler shows 22% reduction of unhalted CPU cycles at the same throughput of 3653 MiB/s.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2022-11-30T22:06:37Z

I think this may nicely fill the gap between regular cached I/O and zero-copy O_DIRECT of #10018 , handling cases where full zero-copy is not possible, or where user can not give up on prefetch and writeback.

amotin · 2022-12-09T16:07:10Z

As discussed on the latest monthly meeting, I've enabled linear buffer allocation for data that need to be decompressed. While those buffers can not be shared, since decompression code requires linear buffers, this still allows to avoid extra memory copy. My tests show read speed improvement for 50% LZ4 compressed 128KB blocks from 2900MB/s to 3400MB/s.

ghost

This seems good to me!

amotin · 2022-12-20T03:55:00Z

Just added some more comments and rebased.

behlendorf

Looks good, and has worked as expected in my local testing. I am a little concerned about always allocating uncachable buffers as linear. But this should only be an issue if they were allocated to accumulate and as you pointed out they're short lived so this should be okay.

behlendorf · 2022-12-22T20:01:55Z

Performance results of primarycache=none and primarycache=metadata running on a striped NVMe pool look very good.

amotin · 2022-12-22T20:14:57Z

Rebased it on new master after #14123 merge.

grwilson

Can you update arcstat to get the new "uncached" stats?

amotin · 2022-12-23T03:16:04Z

Can you update arcstat to get the new "uncached" stats?

Yes, it will need update after both PRs.

amotin · 2022-12-24T03:49:08Z

Can you update arcstat to get the new "uncached" stats?

Done.

Before this change primarycache property was handled only on dbuf layer, controlling dbuf cache and through a hack an ARC evictions. Since speculative prefetcher is implemented on ARC level, it had to be disabled for uncacheable buffers because otherwise falsely prefetched and never read buffers would stay cached in ARC. This change gives ARC a knowledge about uncacheable buffers. It is passed to arc_read() and arc_write() and stored in ARC header. When remove_reference() drops last reference on the ARC header, it can either immediately destroy it, or if it is marked as prefetch, put it into new arc_uncached state. That state is scanned every second, looking for stale buffers that were not demand read (in which case they are evicted immediately). To handle cases of short or misaligned reads, this change tracks at dbuf layer buffers that were read from the beginning, but not to the end. It is assumed that such buffers may receive further reads, and so they are stored in dbuf cache. If some of following reads reaches the end of such buffer, it is immediately evicted. Otherwise it will follow regular dbuf cache eviction and will be evicted also from ARC the same moment. Since dbuf layer does not know the actual file size, this logic is not applied to the last buffers of dnodes, which are always evicted same as before. Since uncacheable buffers should no longer stay in ARC for long, this patch also tries to optimize I/O by allocating ARC physical buffers as linear to allow buffer sharing. It allows to avoid one of two memory copies for uncompressed data for both reads and writes by the cost of some higher KVA usage in case of prefetch. In case decompression is needed the sharing is impossible, but this still allows to avoid extra memory copy since decompression still require a linear buffer. With the combination of enabled prefetch and avoided memory copy this change improves sequential single-threaded read speed from a wide NVMe pool from 2049 to 3932 MiB/s. During write profiler shows 22% reduction of unhalted CPU cycles at the same throughput of 3653 MiB/s. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

amotin · 2022-12-27T20:49:30Z

I've added one more chunk for dmu_tx_check_ioerr() to avoid double block read during short/misaligned uncached write as described in #14333 . It keeps uncacheable L0 blocks read by dmu_tx_check_ioerr() in dbuf cache until they will be read again by dmu_buf_will_dirty(). For higher level blocks this trick may not work, leaving uncacheable metadata in cache, since they seem to not always be read explicitly later to clean the flag. But it should be less critical, since dmu_buf_hold_array_by_dnode() should be able to handle indirects read errors, and also hopefully not many run primarycache=none with random write workload.

Recent ARC commits added new statistic counters, such as iohits, uncached state, etc. Represent those. Also some of previously reported numbers were confusing or even made no sense. Cleanup and restructure existing reports. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Issue #14115 Issue #14123 Issue #14243 Closes #14320

shodanshok · 2023-01-07T20:36:50Z

Thanks for implementing that feature, which will be very useful on fast NVMe devices!

One question: being uncached buffer linearly allocated, does it means increased memory fragmentation despite zfs_abd_scatter_enabled ? Anything to be aware of?

Thanks.

amotin · 2023-01-07T20:44:36Z

One question: being uncached buffer linearly allocated, does it means increased memory fragmentation despite zfs_abd_scatter_enabled ? Anything to be aware of?

If data are not compressed, then only for amount of preferched, but not yet read data and for no more than one second. For compressed data it may add up to another size of dbuf cache, that is 3% of ARC by default. So I don't think it should cause problems. Dbuf cache and all dirty data are always linear, so this is not a dramatic change.

Previously the primarycache property was handled only in the dbuf layer. Since the speculative prefetcher is implemented in the ARC, it had to be disabled for uncacheable buffers. This change gives the ARC knowledge about uncacheable buffers via arc_read() and arc_write(). So when remove_reference() drops the last reference on the ARC header, it can either immediately destroy it, or if it is marked as prefetch, put it into a new arc_uncached state. That state is scanned every second, evicting stale buffers that were not demand read. This change also tracks dbufs that were read from the beginning, but not to the end. It is assumed that such buffers may receive further reads, and so they are stored in dbuf cache. If a following reads reaches the end of the buffer, it is immediately evicted. Otherwise it will follow regular dbuf cache eviction. Since the dbuf layer does not know actual file sizes, this logic is not applied to the final buffer of a dnode. Since uncacheable buffers should no longer stay in the ARC for long, this patch also tries to optimize I/O by allocating ARC physical buffers as linear to allow buffer sharing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#14243

Recent ARC commits added new statistic counters, such as iohits, uncached state, etc. Represent those. Also some of previously reported numbers were confusing or even made no sense. Cleanup and restructure existing reports. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Issue openzfs#14115 Issue openzfs#14123 Issue openzfs#14243 Closes openzfs#14320

New Features - Block cloning (#13392) - Linux container support (#14070, #14097, #12263) - Scrub error log (#12812, #12355) - BLAKE3 checksums (#12918) - Corrective "zfs receive" - Vdev and zpool user properties Performance - Fully adaptive ARC (#14359) - SHA2 checksums (#13741) - Edon-R checksums (#13618) - Zstd early abort (#13244) - Prefetch improvements (#14603, #14516, #14402, #14243, #13452) - General optimization (#14121, #14123, #14039, #13680, #13613, #13606, #13576, #13553, #12789, #14925, #14948) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

amotin force-pushed the uncached branch from 968ee2e to 16ae937 Compare December 1, 2022 00:47

amotin added Status: Code Review Needed Ready for review and testing Type: Performance Performance improvement or performance problem labels Dec 1, 2022

amotin requested review from grwilson and ahrens December 2, 2022 17:47

amotin force-pushed the uncached branch from 16ae937 to 220d2d2 Compare December 9, 2022 15:58

ghost approved these changes Dec 14, 2022

View reviewed changes

amotin force-pushed the uncached branch from 220d2d2 to d104477 Compare December 20, 2022 03:53

amotin force-pushed the uncached branch 2 times, most recently from 60567c5 to 4ca49ca Compare December 20, 2022 20:13

behlendorf approved these changes Dec 20, 2022

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 20, 2022

amotin force-pushed the uncached branch 4 times, most recently from 44178ac to f9647b9 Compare December 22, 2022 15:20

amotin force-pushed the uncached branch from f9647b9 to f437271 Compare December 22, 2022 20:13

grwilson approved these changes Dec 23, 2022

View reviewed changes

amotin mentioned this pull request Dec 24, 2022

Update arc_summary and arcstat outputs #14320

Merged

13 tasks

amotin force-pushed the uncached branch from f437271 to 1375829 Compare December 27, 2022 20:41

amotin mentioned this pull request Dec 29, 2022

Remove some dead ARC code. #14340

Merged

13 tasks

ahrens assigned mmaybee Jan 4, 2023

mmaybee merged commit ed2f7ba into openzfs:master Jan 5, 2023

amotin mentioned this pull request Jan 6, 2023

NVMe Read Performance Issues with ZFS (submit_bio to io_schedule) #8381

Open

adamdmoss mentioned this pull request Jan 17, 2023

PANIC at dmu.c:1135:dmu_write() #14401

Open

rob-wing mentioned this pull request Jan 20, 2023

configure zed's diagnosis engine with vdev properties #13805

Merged

13 tasks

amotin mentioned this pull request Jan 30, 2023

Subpar performance of RAIDZ, reads are slower than writes. #9375

Open

amotin deleted the uncached branch October 9, 2023 13:23

amotin mentioned this pull request Jul 2, 2024

Direct IO Support #10018

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement uncached prefetch #14243

Implement uncached prefetch #14243

amotin commented Nov 30, 2022 •

edited

Loading

amotin commented Nov 30, 2022 •

edited

Loading

amotin commented Dec 9, 2022

ghost left a comment

amotin commented Dec 20, 2022

behlendorf left a comment

behlendorf commented Dec 22, 2022 •

edited

Loading

amotin commented Dec 22, 2022

grwilson left a comment

amotin commented Dec 23, 2022

amotin commented Dec 24, 2022

amotin commented Dec 27, 2022 •

edited

Loading

shodanshok commented Jan 7, 2023

amotin commented Jan 7, 2023

Implement uncached prefetch #14243

Implement uncached prefetch #14243

Conversation

amotin commented Nov 30, 2022 • edited Loading

Types of changes

Checklist:

amotin commented Nov 30, 2022 • edited Loading

amotin commented Dec 9, 2022

ghost left a comment

Choose a reason for hiding this comment

amotin commented Dec 20, 2022

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf commented Dec 22, 2022 • edited Loading

amotin commented Dec 22, 2022

grwilson left a comment

Choose a reason for hiding this comment

amotin commented Dec 23, 2022

amotin commented Dec 24, 2022

amotin commented Dec 27, 2022 • edited Loading

shodanshok commented Jan 7, 2023

amotin commented Jan 7, 2023

amotin commented Nov 30, 2022 •

edited

Loading

amotin commented Nov 30, 2022 •

edited

Loading

behlendorf commented Dec 22, 2022 •

edited

Loading

amotin commented Dec 27, 2022 •

edited

Loading