New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs: add zpool load ddt
subcommand
#9464
Conversation
This is a cool idea. Could you send an email to developer@open-zfs.org to let folks on all platforms know about this proposed new feature? |
This is actually pretty simple to test and I think you should create a test.
|
Codecov Report
@@ Coverage Diff @@
## master #9464 +/- ##
=========================================
- Coverage 79% 76% -3%
=========================================
Files 385 381 -4
Lines 121481 120850 -631
=========================================
- Hits 96461 92368 -4093
- Misses 25020 28482 +3462
Continue to review full report at Codecov.
|
Neat idea! I can definitely see the value in being able to force load the DDT. Which got me wondering if there are other things which might be useful to preload on request. Assuming for the moment there are, then it would be nice to make the command a little more generic and future proof. Perhaps something similar in spirit to the
Where I agree with @richardelling that we'd also want to include at least some basic functional tests. Plus some sanity tests for the CLI options under |
Implement an ioctl which interfaces with a DDT adapter callback for loading all entries of a given DDT object type, and calls it for every DDT object that exists in a pool. Implement the ZAP adapter callback by prefetching the entire zap object. This subcommand enables users to pre-warm (or re-warm) the cache for DDT entries if they reboot or otherwise perform an export/import cycle, and skip the wait for the entries to be loaded, to restore normal I/O write performance to the pool. Signed-off-by: Will Andrews <will@firepipe.net>
Add dmu_prefetch_wait() to make ddtload synchronous, which consists of calling dmu_buf_hold_array_by_dnode() on all blocks. This also fixes the issue that ddtload finished too early for tables larger than DMU_MAX_ACCESS. Signed-off-by: Will Andrews <will@firepipe.net>
Signed-off-by: Will Andrews <will@firepipe.net>
Export this via a new 'dedupcached' r/o pool property. Also change the existing ddt object stats to be returned as they were, so `zpool status -D` shows the full quantities, which is easier to compare to dedupcached. Averages can be generated by users directly. dmu: expose dmu_object_cached_size() which generates the ARC L1 and L2 sizes for a given ZFS object. This works by reading all L1 dbufs in the object and asking ARC the state of each non-hole blkptr_t found. arc: expose arc_cached() which indicates whether a given object is in ARC and if so whether L1, MRU/MFU, or L2 cached. ZFS_IOC_POOL_GET_PROPS: Add the ability to request specific properties in addition to the usual "get all". This enables the new DEDUPCACHED property to only be generated when requested (via zpool status -D), which avoids delaying `zpool import` as well as regular `zpool status`. Signed-off-by: Will Andrews <will@firepipe.net>
Change the zpool status -D output to be more easily parseable, and further support the -p flag to render the raw byte values. This enables the test to perform more precise arithmetic comparisons. Signed-off-by: Will Andrews <will@firepipe.net>
This allows the subcommand to take other types of data to force-load in the future, as requested during review. Signed-off-by: Will Andrews <will@firepipe.net>
I've updated the branch to fix the merge conflict owing to commit c5ebfbb. Additionally:
|
zpool load ddt
subcommand
I like the idea of making this more generic - I could imagine wanting to preload other kinds of metadata (e.g. most objects in the MOS). But I think it would be better to stick with the CLI template of Without reading the manpage, it's hard to guess what Maybe we can find a more specific verb to use for this subcommand. For example |
I agree that the new behavior of being synchronous (and cancelable with SIGINT) is better than kicking off an un-cancelable, un-observable background task. I'm not sure how this would fit into the current implementation, but from a user interface point of view, it might be more consistent with other long-running tasks to make However, a counter-argument to this is that unlike other background tasks (scrub, remove, trim, initialize, zfs destroy), |
I'm not sure it's not worth the extra infrastructure required to support a background prefetch for the DDT; a synchronous foreground implementation is much simpler to manage. Another counter argument for this is the fact that, at any time, ARC can choose to evict blocks that were loaded, i.e. the system workload requires too much memory for other purposes to retain the DDT blocks. How would such a scenario be reflected in the status? In the end, I think the user cares more about the actual cached state of the pool's DDT, and not about the progress of this particular action.
What does I think I am comfortable with changing this PR to use the form Also, I recognize the PR needs to be updated to fix the compile issue in debug mode (ie my testing didn't check assertions), so I'd like to address this issue as part of the same update. |
From the zpool-wait manpage:
So I think
I'm suggesting something like |
I'm not sure this quite fits within the notion of prefetch in ZFS, which is generally a background process, but I'm okay with doing it that way if others agree. |
Remove the props lock hold when looking up dedupcached, as the lock isn't needed to protect the context accessed. Update the test to switch to zpool status -DD. Signed-off-by: Will Andrews <will@firepipe.net>
Signed-off-by: Will Andrews <will@firepipe.net>
Signed-off-by: Will Andrews <will@firepipe.net>
Requested by Matt Ahrens. Signed-off-by: Will Andrews <will@firepipe.net>
Thanks for updating this. It looks like the CI failed since the
|
@wca and #9936 (comment) |
Signed-off-by: Will Andrews <will@firepipe.net>
.Dt ZPOOL-LOAD 8 | ||
.Os Linux | ||
.Sh NAME | ||
.Nm zpool Ns Pf - Cm load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should match the subcommand name from the implementation in zpool_main.c, which has zpool prefetch
, not zpool load
.
.It Xr zpool-load 8 | ||
Force loads specific types of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here - zpool-prefetch.8
Could you update the first comment (the one with "Description", etc) to reflect the design changes? e.g. changing the subcommand name, adding the zpool status output, sync vs async. Could you also include example output from I'd also like to get feedback on this proposal that I mentioned in #9464 (comment):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, github remembered some pending comments that I wrote a year ago but never submitted!
if (issig(JUSTLOOKING) && issig(FORREAL)) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should return EINTR, so that callers can know that the counts are incomplete.
for (uint64_t off = 0; off < doi.doi_max_offset; | ||
off += doi.doi_metadata_block_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should iterate through the L1's using dnode_next_offset, which will skip L1's that are not present (in sparse files). See get_next_chunk() for an example of finding L1's.
if (err != 0) | ||
continue; | ||
|
||
err = dbuf_read(db, NULL, DB_RF_CANFAIL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could get an i/o error here (or from dbuf_hold_impl). I guess if the L1 can't be read, nothing under it could be cached, so this seems reasonable. Could you add a comment here explaining that, since at first glance ignoring the error seems questionable.
/* ... and compute the averages. */ | ||
if (ddo_total->ddo_count != 0) { | ||
ddo_total->ddo_dspace /= ddo_total->ddo_count; | ||
ddo_total->ddo_mspace /= ddo_total->ddo_count; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is changing the meaning of ddo_[dm]space, and also what we print in print_dedup_stats()? Maybe the change will be obvious since the number will be so much larger?
if (HDR_HAS_L1HDR(hdr)) { | ||
arc_state_t *state = hdr->b_l1hdr.b_state; | ||
if (state == arc_mru || state == arc_mru_ghost) | ||
flags |= ARC_CACHED_IN_MRU; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected this to return with flags including ARC_CACHED_IN_L1
I noticed that this PR hasn’t been updated in some time. We would like to see this feature added, but it seems that it isn’t quite complete. @wca, are you planning to continue work on this in the near future? If not, we'll probably close this PR and you can reopen it or submit a new PR if/when you have time to get back to it. |
If anyone has time to pick the up later, feel free to reopen this PR (or open a new one). |
Motivation and Context
This change adds a new
zpool load ddt
command which causes a pool's DDT to be loaded into ARC. The primary goal is to remove the need to "warm" a pool's cache before deduplication stops slowing write performance. It may also provide a way to reload portions of a DDT if they have been flushed due to inactivity.Description
This change:
zpool load ddt
subcommand,loadall
hook for loading all entries of a given DDT object,How Has This Been Tested?
I wrote a simple C program to generate trivially dedupable 512-byte counters, and which writes them out to a file. Then I compared DDT lookup latency by exporting, importing, and copying such files to a new name in the same directory. The goal was to force deterministic DDT lookup patterns in the I/O pipeline.
I also added a new functional test to cover
zpool load
in general:bpftrace script used below
Tests using export/import cycle (no
ddtload
):Tests performed after export/import/ddtload cycle:
Types of changes
Checklist:
Signed-off-by
.This change, like most performance-related changes, isn't feasible to implement automated testing for, and only affects pool operation if the command is run, so I elected not to run the existing test suite either.