New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Clone Deletion #8416
Fast Clone Deletion #8416
Conversation
87ebe7a
to
16563fb
Compare
a3e2143
to
aa731b3
Compare
aa731b3
to
737c202
Compare
737c202
to
fae035d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd seen the presentation about this before, so it was nice to see the code that went along with it. I don't have situations where I have lots of clones to delete frequently, but I can see the need for it. This looks like a marked improvement there.
Nothing obvious leaped out at me as problematical. I'll probably try to go over the changes again, but in the meanwhile it looks good.
|
Thanks for taking a look! I'll go through you comments soon. |
|
@shartse can you please rebase this on master now that redacted send/recv has been merged. |
fae035d
to
42a9c47
Compare
.../zfs-tests/tests/functional/cli_root/zfs_destroy/zfs_clone_livelist_condense_and_disable.ksh
Outdated
Show resolved
Hide resolved
3d54795
to
829955c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the review feedback. Rebasing this one last time on master should resolve the kernel.org build failure. The cstyle checker also caught a few minor things. Then this looks good to me!
829955c
to
9b4ff86
Compare
|
@behlendorf - Great thanks for you feedback! I updated this review with one more fix to address a race condition we found recently. It'd be great if you could take a look at that as well and let me know if you have any concerns. |
61b91a3
to
28f8cd7
Compare
Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Signed-off-by: sara hartse <sara.hartse@delphix.com>
28f8cd7
to
ee94c36
Compare
Codecov Report
@@ Coverage Diff @@
## master #8416 +/- ##
==========================================
+ Coverage 78.78% 79.02% +0.24%
==========================================
Files 400 400
Lines 121004 121645 +641
==========================================
+ Hits 95328 96130 +802
+ Misses 25676 25515 -161
Continue to review full report at Codecov.
|
Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Closes openzfs#8416
Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Closes openzfs#8416
|
Can someone look at this corruption bug? |
Video of talk at BSDCan
Note: I've updated the deletion performance figures and test script since this presentation as they weren't doing quite what I thought they were. Effectively, I was passing the
strideoption inddin bytes instead of blocks. This meant I was writing more than I thought to the clone and in the sparse case I was writing off the end of the file. The old tests still demonstrated meaningful performance differences, just in a less realistic scenario.Motivation and Context
Deleting zfs clones requires determining which parts of the block tree belong to the clone and which are still part of the original snapshot. The basic algorithm is to traverse the block tree, at each block checking when it was born. If it was created after the clone was created then free it and recursively check its child blocks. If the block was born before the clone was created then we know it is part of the snapshot. Otherwise, continue traversing the block tree.
This algorithm is non ideal in the case of sparsely modified clones (those with a small number of scattered writes). Even though there is very little written to the clones it takes a long time to scan through the block tree, i.e. deletion time for a clone with sparse writes is proportional to the size of the underlying block tree, which is the size snapshot. Our goal to change this be to proportional to the space used by the clone only.
The following figure shows the performance difference between deleting clones with with the same amount of modification, just with contiguous vs. sparsely distributed writes.
The script I wrote for this test is available here: lx_fast_delete
Description
Livelist Algorithm
Livelists use the same
deadlist_tstruct as Deadlists and are also used to track block pointers over the lifetime of a dataset. Livelists however, belong to clones and track the block pointers that are clone-specific (were born after the clone's creation). The exception is embedded block pointers which are not included in Livelists because they do not need to be freed.When it comes time to delete the clone, the livelist provides a quick reference as to what needs to be freed. For this reason, livelists also track when clone-specific block pointers are freed before deletion to prevent double frees. Each block pointers in a livelist is marked as a
FREEor anALLOCand the deletion algorithm iterates backwards over the livelist, matchingFREE/ALLOCpairs and then freeing thoseALLOCswhich remain. Livelists are also updated in the case when block pointers are remapped during removal: the old version of the blkptr is cancelled out with aFREEand the new version is tracked with anALLOC.Sublists
To bound the amount of memory required for deletion, livelists over a certain size are spread over multiple entries (made natural by the use of the
deadlist_tdata structure). . Entries are grouped by birth txg so we can be sure theALLOC/FREEpair for a given block pointer will be in the same entry. This allows us to delete livelists incrementally over multiple syncs, one entry at a time. The threshold size at which we create a new sub-livelist is an important tunable for livelist performance - we had to balance the fact that larger sublists mean fewer sublists (decreasing the cost of insertion) against the consideration that sublists will be loaded into memory and shouldn't take up an inordinate amount of space. We settled on ~500000 entries, corresponding to roughly 128MCondensing
We still have the issue that the livelist can grow arbitrarily large - continually gaining sub-livelists. However, because of the restriction that block pointers of the same txg must reside in the same sub-livelist, we can determine if a given block has been freed (in time proportional to the size of the sub-livelist instead of the total number of frees and allocs). This means we can periodically clear out
ALLOC/FREEpairs from sublists and then merge the smaller lists together.Disabling
Livelists are not always the most effective deletion method. We can approximate how much of a performance gain a livelist will give us based on the pecentage of blocks shared between the clone and its origin. Zero percent shared means that the clone has completely diverged and that the old method is maximally effective: every read from the block tree will result in lots of frees. Livelists give us gains when they track blocks scattered across the tree, when one read in the old method might only result in a few frees. Once the clone has been overwritten enough, writes are no longer sparse and we'll no longer get much of a benefit from tracking them with a livelist. We chose a lower limit of 75 percent shared (25 percent overwritten). Once the amount of shared space drops below this threshold, the clone will revert to the old deletion method.
Testing
Performance Results
Deletion
I repeated the first test discussed above for deleting sparsely written clones and found a dramatic performance improvement using the new algorithm. As the above figure shows, the old deletion time grows linearly with respect to the size of the snapshot, while the new deletion time grows at a much smaller rate. In the following figure, I compare the new deletion time to the size of the data written to the clone and we can see that it's now proportional to that.
I also examined performance changes to the old algorithm's "best case" scenario - that of contiguously overwritten clones. I wasn't expecting to be a performance improvement with the livelist strategy since the old method was already very efficient (because of the way the overwritten block pointers laid out in the block tree) and we can see from the figure below that the livelist does add some overhead. However, since the deletion time is still proportional to the amount of data written and real world scenarios are likely to be a hybrid of sparse and contiguous write patterns, I'm not concerned about this performance change.
Writes
One of the main drawbacks of the livelist deletion algorithm is the way that it imposes extra work done at write-time to track the modified block pointers for clones. I explored possible write performance degradation by constructing a "worst case" scenario for writes. This scenario is one in which a write modifying blocks tracked in many different livelist sublists and a
FREEentry has to be appended to each one.Xbytes of data inNdifferent chunks to the clone such that each chunk is separated into it's own sublist.Ybut spread evenly intoNpiece such that each touches a different sublist.synctime which is when the block pointers are flushed to the livelist.The script I used for this test is available here: lx_write_test
The following figure shows the results from repeated experiments with
N = 50,N = 1andN = 0(the old algorithm that doesn't use livelists). We can see that a larger number of sublists does result in degraded write performance. This motivates the reasoning for having the sublist size be as large as is reasonable, given memory constraints, as well as the condensing of sublists and disabling of livelists altogether once the clone has been overwritten past a certain point.Note: the figure shows the average time of 1000 writes and the standard deviation.
Correctness
I added zfs tests to cover basic assumptions about livelists and well as edge cases for deletion, condensing and disabling.
This Illumos implementation feature has been active internally at Delphix for the past 10 months.
I've been running these changes through
zloopfor several days without hitting anything related to this.Types of changes
Checklist:
Signed-off-by.