Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs: add zpool load ddt subcommand #9464

Closed
wants to merge 11 commits into from
Closed

Conversation

wca
Copy link
Contributor

@wca wca commented Oct 14, 2019

Motivation and Context

This change adds a new zpool load ddt command which causes a pool's DDT to be loaded into ARC. The primary goal is to remove the need to "warm" a pool's cache before deduplication stops slowing write performance. It may also provide a way to reload portions of a DDT if they have been flushed due to inactivity.

Description

This change:

  • adds a zpool load ddt subcommand,
  • adds a new DDTLOAD ioctl,
  • adds a new DDT subsystem loadall hook for loading all entries of a given DDT object,
  • implements the hook for the ZAP object adapter, by prefetching all level zero blocks for existing DDT ZAP objects.

How Has This Been Tested?

I wrote a simple C program to generate trivially dedupable 512-byte counters, and which writes them out to a file. Then I compared DDT lookup latency by exporting, importing, and copying such files to a new name in the same directory. The goal was to force deterministic DDT lookup patterns in the I/O pipeline.

I also added a new functional test to cover zpool load in general:

Test: /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_load/setup (run as root) [00:00] [PASS]
Test: /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_load/zpool_load_001_pos (run as root) [00:25] [PASS]
Test: /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zpool_load/cleanup (run as root) [00:00] [PASS]

bpftrace script used below

kprobe:ddt_zap_lookup {
	@ts = nsecs;
}

kretprobe:ddt_zap_lookup {
	@times = hist(nsecs - @ts);
}

Tests using export/import cycle (no ddtload):

[128, 256)             1 |                    
[256, 512)            12 |                                                    |
[512, 1K)             20 |                                                    |
[1K, 2K)              39 |                                                    |
[2K, 4K)              54 |                                                    |
[4K, 8K)              69 |                                                    |
[8K, 16K)          17036 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)          1311 |@@@@                                                |
[32K, 64K)           235 |                                                    |
[64K, 128K)           49 |                                                    |
[128K, 256K)          38 |                                                    |
[256K, 512K)          43 |                                                    |
[512K, 1M)            20 |                                                    |
[1M, 2M)               3 |                                                    |
[2M, 4M)               4 |                                                    |
[4M, 8M)               6 |                                                    |
[8M, 16M)              5 |                                                    |
[16M, 32M)             8 |                                                    |
[32M, 64M)           240 |                                                    |
[64M, 128M)          224 |                                                    |
[128M, 256M)         114 |                                                    |
[256M, 512M)           1 |                                                    |
[128, 256)             2 |                                                    |
[256, 512)             7 |                                                    |
[512, 1K)             13 |                                                    |
[1K, 2K)              40 |                                                    |
[2K, 4K)              58 |                                                    |
[4K, 8K)              65 |                                                    |
[8K, 16K)          17081 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)          1290 |@@@                                                 |
[32K, 64K)           235 |                                                    |
[64K, 128K)           56 |                                                    |
[128K, 256K)          22 |                                                    |
[256K, 512K)          31 |                                                    |
[512K, 1M)            20 |                                                    |
[1M, 2M)               5 |                                                    |
[2M, 4M)               5 |                                                    |
[4M, 8M)               4 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             4 |                                                    |
[32M, 64M)           273 |                                                    |
[64M, 128M)          235 |                                                    |
[128M, 256M)          82 |                                                    |
[256M, 512M)           4 |                                                    |
[256, 512)             8 |                                                    |
[512, 1K)             10 |                                                    |
[1K, 2K)              35 |                                                    |
[2K, 4K)              56 |                                                    |
[4K, 8K)             230 |                                                    |
[8K, 16K)          16923 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)          1282 |@@@                                                 |
[32K, 64K)           255 |                                                    |
[64K, 128K)           51 |                                                    |
[128K, 256K)          40 |                                                    |
[256K, 512K)          51 |                                                    |
[512K, 1M)            35 |                                                    |
[1M, 2M)              12 |                                                    |
[2M, 4M)              13 |                                                    |
[4M, 8M)               6 |                                                    |
[8M, 16M)             12 |                                                    |
[16M, 32M)            17 |                                                    |
[32M, 64M)           189 |                                                    |
[64M, 128M)          172 |                                                    |
[128M, 256M)         134 |                                                    |
[256M, 512M)           1 |                                                    |

Tests performed after export/import/ddtload cycle:

[128, 256)             3 |                                                    |
[256, 512)            24 |                                                    |
[512, 1K)             22 |                                                    |
[1K, 2K)              55 |                                                    |
[2K, 4K)              74 |                                                    |
[4K, 8K)              87 |                                                    |
[8K, 16K)          17281 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)          1668 |@@@@@                                               |
[32K, 64K)           263 |                                                    |
[64K, 128K)           48 |                                                    |
[128K, 256K)           3 |                                                    |
[256K, 512K)           3 |                                                    |
[512K, 1M)             1 |                                                    |
[128, 256)             2 |                                                    |
[256, 512)            15 |                                                    |
[512, 1K)             23 |                                                    |
[1K, 2K)              37 |                                                    |
[2K, 4K)              57 |                                                    |
[4K, 8K)             462 |@                                                   |
[8K, 16K)          16914 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)          1669 |@@@@@                                               |
[32K, 64K)           287 |                                                    |
[64K, 128K)           54 |                                                    |
[128K, 256K)           8 |                                                    |
[256K, 512K)           4 |                                                    |
[128, 256)             1 |                                                    |
[256, 512)            12 |                                                    |
[512, 1K)             34 |                                                    |
[1K, 2K)              55 |                                                    |
[2K, 4K)              93 |                                                    |
[4K, 8K)             698 |@@                                                  |
[8K, 16K)          16795 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)          1514 |@@@@                                                |
[32K, 64K)           268 |                                                    |
[64K, 128K)           59 |                                                    |
[128K, 256K)           3 |                                                    |

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

This change, like most performance-related changes, isn't feasible to implement automated testing for, and only affects pool operation if the command is run, so I elected not to run the existing test suite either.

@ahrens ahrens added the Type: Feature Feature request or new feature label Oct 14, 2019
@ahrens
Copy link
Member

ahrens commented Oct 14, 2019

This is a cool idea. Could you send an email to developer@open-zfs.org to let folks on all platforms know about this proposed new feature?

@richardelling
Copy link
Contributor

This is actually pretty simple to test and I think you should create a test.

  1. create pool
  2. enable ddt
  3. write some data
  4. export pool
  5. measure in-core size of ddt
  6. import pool
  7. zpool ddtload
  8. measure in-core size of ddt
  9. compare

@codecov
Copy link

codecov bot commented Oct 15, 2019

Codecov Report

Merging #9464 into master will decrease coverage by 3%.
The diff coverage is 86%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #9464     +/-   ##
=========================================
- Coverage      79%      76%     -3%     
=========================================
  Files         385      381      -4     
  Lines      121481   120850    -631     
=========================================
- Hits        96461    92368   -4093     
- Misses      25020    28482   +3462
Flag Coverage Δ
#kernel 76% <93%> (-4%) ⬇️
#user 65% <32%> (-2%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82e996c...335fbea. Read the comment docs.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 24, 2019
@behlendorf
Copy link
Contributor

Neat idea! I can definitely see the value in being able to force load the DDT. Which got me wondering if there are other things which might be useful to preload on request. Assuming for the moment there are, then it would be nice to make the command a little more generic and future proof. Perhaps something similar in spirit to the zpool wait command, i.e:

zpool load [-t type[,type]...] pool

Where type can initially only be ddt but it leaves us the flexibility to add new types. One related idea would be to load all of the metadata for each dataset in the pool. For pools with a large number of filesystems and clones this could significantly speed up the first run of zpool list. Given our track record, I suspect we'll end up coming with additional uses for something like this.

I agree with @richardelling that we'd also want to include at least some basic functional tests. Plus some sanity tests for the CLI options under tests/functional/cli_root/zpool_load/.

wca and others added 6 commits January 1, 2020 15:39
Implement an ioctl which interfaces with a DDT adapter callback for
loading all entries of a given DDT object type, and calls it for every
DDT object that exists in a pool.

Implement the ZAP adapter callback by prefetching the entire zap object.

This subcommand enables users to pre-warm (or re-warm) the cache for DDT
entries if they reboot or otherwise perform an export/import cycle, and
skip the wait for the entries to be loaded, to restore normal I/O write
performance to the pool.

Signed-off-by:	Will Andrews <will@firepipe.net>
Add dmu_prefetch_wait() to make ddtload synchronous, which consists of
calling dmu_buf_hold_array_by_dnode() on all blocks.  This also fixes the
issue that ddtload finished too early for tables larger than DMU_MAX_ACCESS.

Signed-off-by:	Will Andrews <will@firepipe.net>
Signed-off-by:	Will Andrews <will@firepipe.net>
Export this via a new 'dedupcached' r/o pool property.  Also change the
existing ddt object stats to be returned as they were, so `zpool status -D`
shows the full quantities, which is easier to compare to dedupcached.
Averages can be generated by users directly.

dmu: expose dmu_object_cached_size() which generates the ARC L1 and L2
sizes for a given ZFS object.  This works by reading all L1 dbufs in the
object and asking ARC the state of each non-hole blkptr_t found.

arc: expose arc_cached() which indicates whether a given object is in
ARC and if so whether L1, MRU/MFU, or L2 cached.

ZFS_IOC_POOL_GET_PROPS: Add the ability to request specific properties
in addition to the usual "get all".  This enables the new DEDUPCACHED
property to only be generated when requested (via zpool status -D),
which avoids delaying `zpool import` as well as regular `zpool status`.

Signed-off-by:	Will Andrews <will@firepipe.net>
Change the zpool status -D output to be more easily parseable, and further
support the -p flag to render the raw byte values.  This enables the test to
perform more precise arithmetic comparisons.

Signed-off-by:	Will Andrews <will@firepipe.net>
This allows the subcommand to take other types of data to force-load in the
future, as requested during review.

Signed-off-by:	Will Andrews <will@firepipe.net>
@wca
Copy link
Contributor Author

wca commented Jan 1, 2020

I've updated the branch to fix the merge conflict owing to commit c5ebfbb. Additionally:

  • zpool ddtload is now zpool load ddt.
  • There's now a test for this functionality.
  • It's now possible to query the ARC cache state for any object, and further to do so for a pool's DDT.
  • Larger DDTs will now be fully loaded.
  • zpool load ddt is now synchronous, and can be cancelled via SIGINT.

@wca wca changed the title zfs: add zpool ddtload subcommand zfs: add zpool load ddt subcommand Jan 1, 2020
@ahrens
Copy link
Member

ahrens commented Jan 3, 2020

I like the idea of making this more generic - I could imagine wanting to preload other kinds of metadata (e.g. most objects in the MOS). But I think it would be better to stick with the CLI template of zpool <subcommand> <flags> <target>, e.g. zpool load rpool or zpool load -t ddt rpool or zpool load --ddt rpool. As opposed to the current proposal of zpool <subcommand> <subsubcommand> <flags> <target>, e.g. zpool load ddt rpool (Or maybe ddt is a <subtarget> rather than a <subsubcommand>, but in that case should the order be zpool <subcommand> <target> <subtarget>, e.g. zpool load rpool ddt? - another argument for making it a flag.)

Without reading the manpage, it's hard to guess what zpool load rpool would do. An admin might wonder "Do I need to load the pool before using it? How is that different than zpool import rpool?"

Maybe we can find a more specific verb to use for this subcommand. For example zpool prefetch rpool (and zpool prefetch -t ddt rpool)

@ahrens
Copy link
Member

ahrens commented Jan 3, 2020

I agree that the new behavior of being synchronous (and cancelable with SIGINT) is better than kicking off an un-cancelable, un-observable background task.

I'm not sure how this would fit into the current implementation, but from a user interface point of view, it might be more consistent with other long-running tasks to make zpool prefetch -t ddt rpool return immediately, kicking it off in the background. And to add the ability to observe its progress with zpool status, and wait for it with zpool wait -t prefetch (or wait with zpool prefetch --wait -t ddt rpool).

However, a counter-argument to this is that unlike other background tasks (scrub, remove, trim, initialize, zfs destroy), zpool prefetch doesn't change any on-disk state, and it doesn't continue across reboot/export/import.

@wca
Copy link
Contributor Author

wca commented Jan 6, 2020

I'm not sure how this would fit into the current implementation, but from a user interface point of view, it might be more consistent with other long-running tasks to make zpool prefetch -t ddt rpool return immediately, kicking it off in the background. And to add the ability to observe its progress with zpool status, and wait for it with zpool wait -t prefetch (or wait with zpool prefetch --wait -t ddt rpool).

However, a counter-argument to this is that unlike other background tasks (scrub, remove, trim, initialize, zfs destroy), zpool prefetch doesn't change any on-disk state, and it doesn't continue across reboot/export/import.

I'm not sure it's not worth the extra infrastructure required to support a background prefetch for the DDT; a synchronous foreground implementation is much simpler to manage.

Another counter argument for this is the fact that, at any time, ARC can choose to evict blocks that were loaded, i.e. the system workload requires too much memory for other purposes to retain the DDT blocks. How would such a scenario be reflected in the status? In the end, I think the user cares more about the actual cached state of the pool's DDT, and not about the progress of this particular action.

Without reading the manpage, it's hard to guess what zpool load rpool would do.

What does zpool wait rpool or zpool prefetch rpool mean without any type argument?

I think I am comfortable with changing this PR to use the form zpool wait -t ddt rpool rather than zpool load ddt rpool. That implies a synchronous implementation, and isn't easily confused with zpool import as you point out.

Also, I recognize the PR needs to be updated to fix the compile issue in debug mode (ie my testing didn't check assertions), so I'd like to address this issue as part of the same update.

@ahrens
Copy link
Member

ahrens commented Jan 6, 2020

What does zpool wait rpool or zpool prefetch rpool mean without any type argument?

From the zpool-wait manpage:

If no activities are specified, the command waits until background activity of every type [...] has ceased.

So I think zpool prefetch rpool would mean to read everything that we know how to prefetch, same as zpool prefetch -t X,Y,Z rpool, where X,Y,Z is a list of all valid options. But if that's too confusing then we could say that you have to use the -t flag to say what to prefetch.

I think I am comfortable with changing this PR to use the form zpool wait -t ddt rpool rather than zpool load ddt rpool. That implies a synchronous implementation, and isn't easily confused with zpool import as you point out.

zpool wait is used to wait for activity that's been kicked off by some other action. What command would pick off the prefetching/loading of the DDT?

I'm suggesting something like zpool prefetch -t ddt rpool. If the implementation is always synchronous, then there would be no changes to zpool wait (it wouldn't support waiting for prefetch).

@wca
Copy link
Contributor Author

wca commented Jan 6, 2020

I'm suggesting something like zpool prefetch -t ddt rpool. If the implementation is always synchronous, then there would be no changes to zpool wait (it wouldn't support waiting for prefetch).

I'm not sure this quite fits within the notion of prefetch in ZFS, which is generally a background process, but I'm okay with doing it that way if others agree.

Remove the props lock hold when looking up dedupcached, as the lock isn't
needed to protect the context accessed.

Update the test to switch to zpool status -DD.

Signed-off-by:	Will Andrews <will@firepipe.net>
Signed-off-by:	Will Andrews <will@firepipe.net>
Signed-off-by:	Will Andrews <will@firepipe.net>
Requested by Matt Ahrens.

Signed-off-by:	Will Andrews <will@firepipe.net>
@behlendorf
Copy link
Contributor

Thanks for updating this. It looks like the CI failed since the dn_struct_rwlock lock wasn't held by dmu_object_cached_size when calling dbuf_hold_impl.

[ 5860.973163] VERIFY(RW_LOCK_HELD(&dn->dn_struct_rwlock)) failed
[ 5860.977780] PANIC at dbuf.c:3267:dbuf_hold_impl()
[ 5860.981817] Showing stack for process 8094
[ 5860.998303] Call Trace:
[ 5861.001224]  dump_stack+0x66/0x90
[ 5861.004540]  spl_panic+0xd3/0xfb [spl]
[ 5861.046436]  dbuf_hold_impl+0x701/0xc20 [zfs]
[ 5861.054103]  dmu_object_cached_size+0x14f/0x1d0 [zfs]
[ 5861.058228]  ddt_get_pool_dedup_cached+0x69/0xb0 [zfs]
[ 5861.062447]  spa_prop_add+0x47/0x70 [zfs]
[ 5861.066142]  spa_prop_get_nvlist+0x41/0x80 [zfs]
[ 5861.070126]  zfs_ioc_pool_get_props+0x15d/0x200 [zfs]
[ 5861.074302]  zfsdev_ioctl_common+0x1d4/0x5f0 [zfs]
[ 5861.078373]  zfsdev_ioctl+0x4d/0xd0 [zfs]
[ 5861.082077]  do_vfs_ioctl+0xa4/0x630
[ 5861.089220]  ksys_ioctl+0x60/0x90
[ 5861.092559]  __x64_sys_ioctl+0x16/0x20
[ 5861.096077]  do_syscall_64+0x53/0x110

@mattmacy
Copy link
Contributor

@wca unrelated to current PR - but please take a look at #9923 #9924 #9936 #9937 #9939 #9952 #9954 - The last in particular would ideally be refactored in to smaller commits.

@mattmacy
Copy link
Contributor

@wca and #9936 (comment)

Signed-off-by:	Will Andrews <will@firepipe.net>
.Dt ZPOOL-LOAD 8
.Os Linux
.Sh NAME
.Nm zpool Ns Pf - Cm load
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should match the subcommand name from the implementation in zpool_main.c, which has zpool prefetch, not zpool load.

Comment on lines +168 to +169
.It Xr zpool-load 8
Force loads specific types of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - zpool-prefetch.8

@ahrens
Copy link
Member

ahrens commented Apr 30, 2020

Could you update the first comment (the one with "Description", etc) to reflect the design changes? e.g. changing the subcommand name, adding the zpool status output, sync vs async. Could you also include example output from zpool status that shows the new functionality?

I'd also like to get feedback on this proposal that I mentioned in #9464 (comment):

I think it would be better to stick with the CLI template of zpool <subcommand> <flags> <target>, e.g. zpool prefetch -t ddt rpool or zpool prefetch --ddt rpool. As opposed to the current proposal of zpool <subcommand> <subsubcommand> <flags> <target>, e.g. zpool prefetch ddt rpool (Or maybe ddt is a <subtarget> rather than a <subsubcommand>, but in that case should the order be zpool <subcommand> <target> <subtarget>, e.g. zpool prefetch rpool ddt? - another argument for making it a flag.)

Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, github remembered some pending comments that I wrote a year ago but never submitted!

Comment on lines +1643 to +1644
if (issig(JUSTLOOKING) && issig(FORREAL))
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return EINTR, so that callers can know that the counts are incomplete.

Comment on lines +1628 to +1629
for (uint64_t off = 0; off < doi.doi_max_offset;
off += doi.doi_metadata_block_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should iterate through the L1's using dnode_next_offset, which will skip L1's that are not present (in sparse files). See get_next_chunk() for an example of finding L1's.

if (err != 0)
continue;

err = dbuf_read(db, NULL, DB_RF_CANFAIL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get an i/o error here (or from dbuf_hold_impl). I guess if the L1 can't be read, nothing under it could be cached, so this seems reasonable. Could you add a comment here explaining that, since at first glance ignoring the error seems questionable.

Comment on lines -494 to -498
/* ... and compute the averages. */
if (ddo_total->ddo_count != 0) {
ddo_total->ddo_dspace /= ddo_total->ddo_count;
ddo_total->ddo_mspace /= ddo_total->ddo_count;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is changing the meaning of ddo_[dm]space, and also what we print in print_dedup_stats()? Maybe the change will be obvious since the number will be so much larger?

if (HDR_HAS_L1HDR(hdr)) {
arc_state_t *state = hdr->b_l1hdr.b_state;
if (state == arc_mru || state == arc_mru_ghost)
flags |= ARC_CACHED_IN_MRU;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this to return with flags including ARC_CACHED_IN_L1

@ahrens
Copy link
Member

ahrens commented Jun 10, 2021

I noticed that this PR hasn’t been updated in some time. We would like to see this feature added, but it seems that it isn’t quite complete. @wca, are you planning to continue work on this in the near future? If not, we'll probably close this PR and you can reopen it or submit a new PR if/when you have time to get back to it.

@ahrens
Copy link
Member

ahrens commented Nov 9, 2021

If anyone has time to pick the up later, feel free to reopen this PR (or open a new one).

@ahrens ahrens closed this Nov 9, 2021
@allanjude allanjude mentioned this pull request Feb 14, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants