Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix prefetching of indirect blocks while destroying #14603

Merged
merged 2 commits into from Mar 24, 2023

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Mar 10, 2023

Motivation and Context

When traversing a tree of block pointers (e.g. for zfs destroy <fs> or zfs send), we prefetch the indirect blocks that will be needed, in traverse_prefetch_metadata(). In the case of zfs destroy <fs>, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as zpool get freeing decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so td_resume is nonzero. At this point, all calls to traverse_prefetch_metadata() will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out td_resume, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block.

Description

This commit addresses the issue by reusing the existing resume_skip_check() to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros td_resume).

Note, this bug likely predates (was not introduced by) #11803.

How Has This Been Tested?

With high-latency storage, I saw a >10x improvement in space reclamation performance following a zfs destroy <fs>.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ahrens ahrens requested review from grwilson and mmaybee March 10, 2023 04:12
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 10, 2023
@ryao
Copy link
Contributor

ryao commented Mar 10, 2023

The zloop test failed because this PR's branch was not rebased on the most recent master, so it is missing the recently merged fix from #14583. The FreeBSD test failure is a pre-existing issue.

When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the new code will prefetch again the indirect blocks that were prefetched beyond the resume point on previous iteration. Though that is probably not as bad as the opposite.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 16, 2023
@behlendorf behlendorf merged commit d2d4f85 into openzfs:master Mar 24, 2023
7 of 10 checks passed
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Mar 26, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#14603
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Mar 26, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#14603
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this pull request Sep 26, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#14603
behlendorf added a commit that referenced this pull request Oct 13, 2023
New Features
- Block cloning (#13392)
- Linux container support (#14070, #14097, #12263)
- Scrub error log (#12812, #12355)
- BLAKE3 checksums (#12918)
- Corrective "zfs receive"
- Vdev and zpool user properties

Performance
- Fully adaptive ARC (#14359)
- SHA2 checksums (#13741)
- Edon-R checksums (#13618)
- Zstd early abort (#13244)
- Prefetch improvements (#14603, #14516, #14402, #14243, #13452)
- General optimization (#14121, #14123, #14039, #13680, #13613,
  #13606, #13576, #13553, #12789, #14925, #14948)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants