Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arc_read()/arc_access() refactoring and cleanup. #14123

Merged
merged 1 commit into from Dec 22, 2022
Merged

Conversation

amotin
Copy link
Member

@amotin amotin commented Nov 2, 2022

ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable.

  • Decouple prefetch status tracking from b_refcnt. It made sense originally, but became highly cryptic over the years. Move all the logic into arc_access(). While there, clean up and comment state transitions in arc_access(). Some transitions were weird IMO.
  • Unify arc_access() calls to arc_read() instead of sometimes calling it from arc_read_done(). To avoid extra state changes and checks add one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
  • Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with the same callback mechanism to not falsely account them as hits. Count those as "iohits", an intermediate between "hits" and "misses". While there, call read callbacks in original request order, that should be good for fairness and random speculations/allocations/aggregations.
  • Introduce additional statistic counters for prefetch, accounting predictive vs prescient and hits vs iohits vs misses.
  • Remove hash_lock argument from functions not needing it.
  • Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
  • Fix few false positive tests found in the process.

How Has This Been Tested?

Lots of code reading, basic read/write tests with ZFS_DEBUG and lots of ZTS runs.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin mentioned this pull request Nov 2, 2022
13 tasks
@shodanshok
Copy link
Contributor

As this patch refactors arc_read() and arc_access(), any clue if it would affect #14120 ?

@amotin
Copy link
Member Author

amotin commented Nov 2, 2022

As this patch refactors arc_read() and arc_access(), any clue if it would affect #14120 ?

I saw that report of yours, but I am not certain about its causes. I don't think this patch should change MRU/MFU distribution (arc_p), especially after you disabled prefetch, that is mostly affected by this PR. So I don't think it would, but its hard to say for sure, its a quite tangled area.

@amotin
Copy link
Member Author

amotin commented Nov 3, 2022

@mmaybee @ahrens Could you please take a look on this PR? And in particular, could you explain me why prefetched read in arc_mru_ghost and especially arc_mfu_ghost states promoted buffer to arc_mru instead of arc_mfu? Don't you remember what thinking was behind it 16 years ago? Unfortunately message in the illumos commit 13506d1eefb is not descriptive.

@ahrens
Copy link
Member

ahrens commented Nov 4, 2022

could you explain me why prefetched read in arc_mru_ghost and especially arc_mfu_ghost states promoted buffer to arc_mru instead of arc_mfu?

I think a prefetched read in arc_mru_ghost means that the block was read by a predictive prefetch, and then evicted before being demand-accessed. In that case it seems reasonable for a subsequent access to move it to arc_mru, since it was only demand-accessed once. However, I see that if we get a hit on a prefetched block in arc_mru, we move it to mfu. It seems odd to me that we would have a block that was only requested once by the user (but happens to have been correctly predictively prefetched) to be in the MFU.

It definitely make sense that a hit on a prefetched block in arc_mru_ghost or arc_mru should go to the same state, so the current logic seems wrong. But I would think it should go to arc_mru, not arc_mfu, because it was only "hit" once.

@amotin
Copy link
Member Author

amotin commented Nov 5, 2022

I think a prefetched read in arc_mru_ghost means that the block was read by a predictive prefetch, and then evicted before being demand-accessed. In that case it seems reasonable for a subsequent access to move it to arc_mru, since it was only demand-accessed once.

Thanks, @ahrens. I should have noticed that part myself. Fixed.

However, I see that if we get a hit on a prefetched block in arc_mru, we move it to mfu. It seems odd to me that we would have a block that was only requested once by the user (but happens to have been correctly predictively prefetched) to be in the MFU.

Neither my nor I think the existing code promotes previously prefetched block from MRU to MFU on demand hit. Only on following access, whether it will be another demand or another prefetch after the current demand. It may look like promotes only on demand requests could be right, but from anon or ghost we promote to MRU on any request, including prefetch, so the current logic is at least consistent.

@amotin
Copy link
Member Author

amotin commented Nov 29, 2022

As promised during the Developer Summit, I am working on uncached prefetch implementation. Since those two patches are going to overlap in arc.c, I'd appreciate some reviews for this one meanwhile. Thanks in advance.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring and cleanup and test changes look good. I have not so much expertise in the area to comment on the behavioral changes.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks pretty reasonable and it's held up well in my local stress testing. I'll keep throwing interesting workloads at it to be certain, but right now this LGTM.

With the addition of the new iostat kstat entries we'll want to update the arcstat and arc_summary commands. Otherwise, we'll end up reporting slightly skewed results due to the ignored "iohit" counters.

module/zfs/arc.c Outdated Show resolved Hide resolved
include/sys/arc_impl.h Show resolved Hide resolved
@amotin
Copy link
Member Author

amotin commented Dec 20, 2022

Added sync_pool to couple more unstable tests.

@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Dec 20, 2022
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Dec 20, 2022
@amotin
Copy link
Member Author

amotin commented Dec 20, 2022

Removed hash_lock argument from arc_evict_hdr(), noticed by @ryao.

@amotin
Copy link
Member Author

amotin commented Dec 20, 2022

Slightly refactor beginning of arc_access() as proposed by @ryao, and fix minor stats issues I spotted while at it.

include/sys/arc.h Show resolved Hide resolved
module/zfs/arc.c Outdated
* its logic to balance MRU/MFU based on the original state.
*/
arc_adapt(arc_hdr_size(hdr), hdr->b_l1hdr.b_state);
add_reference(hdr, hdr); /* For IO_IN_PROGRESS. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add an additional reference here? Maybe a comment indicating why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stop arc_access() from putting this header without any buffers and so other references but obviously nonevictable onto the evictable list of MRU or MFU state. Added comment.

module/zfs/arc.c Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
ASSERT(HDR_HAS_L1HDR(hdr));
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we guarantee that an I/O is not in progress?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because header having reference will never appear on evictable headers list of any state. That is why I added references for in-progress headers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

@@ -174,6 +178,7 @@ typedef struct l1arc_buf_hdr {
zfs_refcount_t b_refcnt;

arc_callback_t *b_acb;
arc_callback_t **b_acbp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two accesses to the callback?

Copy link
Member Author

@amotin amotin Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To efficiently implement FIFO to keep original order. First pointer points to the first element, second -- to the acb_next of the last element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ordering critical here? Just wondering about preserving the ordering vs the additional memory requirement for each l1hdr.

Copy link
Member Author

@amotin amotin Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be. In many benchmarks I saw that requests arriving in order from different threads and blocked on the same indirect block are reaching the data in wrong order, confusing speculative prefetcher and potentially I/O scheduler and disks. Some time ago I made prefetcher to do speculations for read before reading indirect blocks, that significantly helped reads, but for writes it appeared difficult. I hope this change should if not fix the problem, but may be improve it a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It is probably more efficient to make arc_callback_t have an extra pointer for the previous element so that we can process the callback list in reverse order in arc_read_done. This would mean that the memory penalty would only apply when we have a read in progress and not to every l1hdr and gives you the ordering you need. Would that suffice?

Copy link
Member Author

@amotin amotin Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be cheaper for the header, indeed, but still could add one more traverse. I'll think about it closer tomorrow morning. But if we care so much about this structure, we could probably save much more by hiding b_freeze_* under ZFS_DEBUG.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. First traversal left reversed to not traverse extra time. Hope there will be no significant downsides. The callbacks though should be called in right order, that should be more important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I also like the idea of putting b_freeze_* behind a debug flag to reduce the memory.

ARC code was many times significantly modified over the years, that
created significant amount of tangled and potentially broken code.
This should make arc_access()/arc_read() code some more readable.

 - Decouple prefetch status tracking from b_refcnt.  It made sense
originally, but became highly cryptic over the years.  Move all the
logic into arc_access().  While there, clean up and comment state
transitions in arc_access().  Some transitions were weird IMO.
 - Unify arc_access() calls to arc_read() instead of sometimes calling
it from arc_read_done().  To avoid extra state changes and checks add
one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
 - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with
the same callback mechanism to not falsely account them as hits. Count
those as "iohits", an intermediate between "hits" and "misses". While
there, call read callbacks in original request order, that should be
good for fairness and random speculations/allocations/aggregations.
 - Introduce additional statistic counters for prefetch, accounting
predictive vs prescient and hits vs iohits vs misses.
 - Remove hash_lock argument from functions not needing it.
 - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite
to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set.  We may
wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
 - Fix few false positive tests found in the process.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
@amotin amotin mentioned this pull request Dec 22, 2022
13 tasks
@behlendorf behlendorf merged commit c935fe2 into openzfs:master Dec 22, 2022
behlendorf pushed a commit that referenced this pull request Jan 5, 2023
Recent ARC commits added new statistic counters, such as iohits,
uncached state, etc.  Represent those.  Also some of previously
reported numbers were confusing or even made no sense.  Cleanup
and restructure existing reports.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Issue #14115 
Issue #14123
Issue #14243 
Closes #14320
ryao pushed a commit to ryao/zfs that referenced this pull request Jan 10, 2023
ARC code was many times significantly modified over the years, that
created significant amount of tangled and potentially broken code.
This should make arc_access()/arc_read() code some more readable.

 - Decouple prefetch status tracking from b_refcnt.  It made sense
originally, but became highly cryptic over the years.  Move all the
logic into arc_access().  While there, clean up and comment state
transitions in arc_access().  Some transitions were weird IMO.
 - Unify arc_access() calls to arc_read() instead of sometimes calling
it from arc_read_done().  To avoid extra state changes and checks add
one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
 - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with
the same callback mechanism to not falsely account them as hits. Count
those as "iohits", an intermediate between "hits" and "misses". While
there, call read callbacks in original request order, that should be
good for fairness and random speculations/allocations/aggregations.
 - Introduce additional statistic counters for prefetch, accounting
predictive vs prescient and hits vs iohits vs misses.
 - Remove hash_lock argument from functions not needing it.
 - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite
to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set.  We may
wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
 - Fix few false positive tests found in the process.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes openzfs#14123
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 3, 2023
ARC code was many times significantly modified over the years, that
created significant amount of tangled and potentially broken code.
This should make arc_access()/arc_read() code some more readable.

 - Decouple prefetch status tracking from b_refcnt.  It made sense
originally, but became highly cryptic over the years.  Move all the
logic into arc_access().  While there, clean up and comment state
transitions in arc_access().  Some transitions were weird IMO.
 - Unify arc_access() calls to arc_read() instead of sometimes calling
it from arc_read_done().  To avoid extra state changes and checks add
one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
 - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with
the same callback mechanism to not falsely account them as hits. Count
those as "iohits", an intermediate between "hits" and "misses". While
there, call read callbacks in original request order, that should be
good for fairness and random speculations/allocations/aggregations.
 - Introduce additional statistic counters for prefetch, accounting
predictive vs prescient and hits vs iohits vs misses.
 - Remove hash_lock argument from functions not needing it.
 - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite
to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set.  We may
wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
 - Fix few false positive tests found in the process.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes openzfs#14123
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 3, 2023
Recent ARC commits added new statistic counters, such as iohits,
uncached state, etc.  Represent those.  Also some of previously
reported numbers were confusing or even made no sense.  Cleanup
and restructure existing reports.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Issue openzfs#14115 
Issue openzfs#14123
Issue openzfs#14243 
Closes openzfs#14320
amotin added a commit to amotin/zfs that referenced this pull request Oct 2, 2023
Earlier as part of openzfs#14123 I've removed one use of b_cv.  This patch
reuses the same approach to remove the other one from much more
rare code path.

This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from
200 to 184 bytes) and seems even more on Linux.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
behlendorf pushed a commit that referenced this pull request Oct 4, 2023
Earlier as part of #14123 I've removed one use of b_cv.  This patch
reuses the same approach to remove the other one from much more
rare code path.

This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from
200 to 184 bytes) and seems even more on Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15340
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 6, 2023
Earlier as part of openzfs#14123 I've removed one use of b_cv.  This patch
reuses the same approach to remove the other one from much more
rare code path.

This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from
200 to 184 bytes) and seems even more on Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15340
behlendorf pushed a commit that referenced this pull request Oct 7, 2023
Earlier as part of #14123 I've removed one use of b_cv.  This patch
reuses the same approach to remove the other one from much more
rare code path.

This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from
200 to 184 bytes) and seems even more on Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15340
@amotin amotin deleted the arc branch October 9, 2023 13:23
behlendorf added a commit that referenced this pull request Oct 13, 2023
New Features
- Block cloning (#13392)
- Linux container support (#14070, #14097, #12263)
- Scrub error log (#12812, #12355)
- BLAKE3 checksums (#12918)
- Corrective "zfs receive"
- Vdev and zpool user properties

Performance
- Fully adaptive ARC (#14359)
- SHA2 checksums (#13741)
- Edon-R checksums (#13618)
- Zstd early abort (#13244)
- Prefetch improvements (#14603, #14516, #14402, #14243, #13452)
- General optimization (#14121, #14123, #14039, #13680, #13613,
  #13606, #13576, #13553, #12789, #14925, #14948)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Earlier as part of openzfs#14123 I've removed one use of b_cv.  This patch
reuses the same approach to remove the other one from much more
rare code path.

This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from
200 to 184 bytes) and seems even more on Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15340
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants