Make block-dedupe the default

The idea of find-dupes is a great one - we want to cut down on the number of extent references placed on disk by building exents out of our dupe blocks tree. The problem is that we've never been able to get this to perform reasonably well and give good dedupe results at the same time. The design doc in our wiki has the full details but the most relevant excerpt would be: We're trying to balance at least 3 very important resources: - cpu usage - memory usage - quality of dedupe Right now we catch all possible extents (100% dedupe quality) at the expense of a ton of memory and CPU. Turning down the quality in favor of fewer expended resources tends to get us in situations where the pattern of dedupe is seemingly random, or we always miss at least some obvious cases (such as identical files). We can continue to experiment until we get something that works well - there's still many options going forward. In the meantime however, the number of bug reports I have recieved where find-dupes is a severe performance problem is too high. We want to ensure a smooth user experience, especially for those with large dedupe sets so make find-dupes optional. Signed-off-by: Mark Fasheh <mfasheh@suse.de>
markfasheh · Sep 26, 2016 · ac32d43 · ac32d43
1 parent c2e3229
commit ac32d43
Show file tree

Hide file tree

Showing 5 changed files with 23 additions and 30 deletions.
diff --git a/FAQ.md b/FAQ.md
@@ -2,7 +2,7 @@
 
 ### Is there an upper limit to the amount of data duperemove can process?
 
-Duperemove v0.10 is fast at reading and cataloging data. Dedupe runs
+Duperemove v0.11 is fast at reading and cataloging data. Dedupe runs
 will be memory limited unless the '--hashfile' option is used. '--hashfile'
 allows duperemove to temporarily store duplicated hashes to disk, thus removing
 the large memory overhead and allowing for a far larger amount of data to be
@@ -13,13 +13,6 @@ Actual performance numbers are dependent on hardware - up to date
 testing information is kept [on the wiki](https://github.com/markfasheh/duperemove/wiki/Performance-Numbers)
 
 
-### Why does it not print out all duplicate extents?
-
-Internally duperemove is classifying extents based on various criteria
-like length, number of identical extents, etc. The printout we give is
-based on the results of that classification.
-
-
 ### How can I find out my space savings after a dedupe?
 
 Duperemove will print out an estimate of the saved space after a

diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ This README is for duperemove v0.11.
 Duperemove is a simple tool for finding duplicated extents and
 submitting them for deduplication. When given a list of files it will
 hash their contents on a block by block basis and compare those hashes
-to each other, finding and categorizing extents that match each
+to each other, finding and categorizing blocks that match each
 other. When given the -d option, duperemove will submit those
 extents for deduplication using the Linux kernel extent-same ioctl.
 

diff --git a/docs/duperemove.html b/docs/duperemove.html
@@ -41,7 +41,7 @@ <h1><a name="2"></a>"DESCRIPTION"</h1>
 <p class="pp j"><b>duperemove</b> is a simple tool for finding duplicated extents and
 submitting them for deduplication. When given a list of files it will
 hash their contents on a block by block basis and compare those hashes
-to each other, finding and categorizing extents that match each
+to each other, finding and categorizing blocks that match each
 other. When given the <b>-d</b> option, <b>duperemove</b> will submit
 those extents for deduplication using the Linux kernel extent-same
 ioctl.
@@ -81,11 +81,6 @@ <h2><a name="4"></a>"Readonly / Non-deduplicating Mode"</h2>
 deduplication at a later time.
 <br />
 <br />
-It is important to note that this mode will not print out <b>all</b>
-instances of matching extents, just those it would consider for
-deduplication.
-<br />
-<br />
 Generally, duperemove does not concern itself with the underlying
 representation of the extents it processes. Some of them could be
 compressed, undergoing I/O, or even have already been deduplicated. In
@@ -275,7 +270,7 @@ <h1><a name="6"></a>"OPTIONS"</h1>
 <br />
 </p>
 
-<p class="tp1"><b>---dedupe-options=</b><i>options</i>
+<p class="tp1"><b>--dedupe-options=</b><i>options</i>
 </p>
 
 <p class="tp2 j">Comma separated list of options which alter how we dedupe. Prepend 'no' to an
@@ -313,10 +308,15 @@ <h1><a name="6"></a>"OPTIONS"</h1>
 <p class="tp1"><b>[no]block</b>
 </p>
 
-<p class="tp2 j">Defaults to <b>off</b>. Dedupe by block - don't optimize our data into
-extents before dedupe. Generally this is undesirable as it will
-greatly increase the total number of dedupe requests. There is also a
-larger potential for file fragmentation.
+<p class="tp2 j">Defaults to <b>on</b>. Duperemove submits duplicate blocks directly to
+the dedupe engine.
+<br />
+<br />
+Duperemove can optionally optimize the duplicate block lists into
+larger extents prior to dedupe submission. The search algorithm used
+for this however has a very high memory and cpu overhead, but may
+reduce the number of extent references created during dedupe. If you'd
+like to try this, run with 'noblock'.
 <br />
 <br />
 </p>

diff --git a/duperemove.8 b/duperemove.8
@@ -8,7 +8,7 @@ duperemove \- Find duplicate extents and print them to stdout
 \fBduperemove\fR is a simple tool for finding duplicated extents and
 submitting them for deduplication. When given a list of files it will
 hash their contents on a block by block basis and compare those hashes
-to each other, finding and categorizing extents that match each
+to each other, finding and categorizing blocks that match each
 other. When given the \fB-d\fR option, \fBduperemove\fR will submit
 those extents for deduplication using the Linux kernel extent-same
 ioctl.
@@ -36,10 +36,6 @@ seeing what duperemove might do when run with \fB-d\fR. The output could
 also be used by some other software to submit the extents for
 deduplication at a later time.
 
-It is important to note that this mode will not print out \fBall\fR
-instances of matching extents, just those it would consider for
-deduplication.
-
 Generally, duperemove does not concern itself with the underlying
 representation of the extents it processes. Some of them could be
 compressed, undergoing I/O, or even have already been deduplicated. In
@@ -186,10 +182,14 @@ fiemap during the file scan stage, you will also want to use the
 \fB--lookup-extents=no\fR option.
 .TP
 \fB[no]block\fR
-Defaults to \fBoff\fR. Dedupe by block - don't optimize our data into
-extents before dedupe. Generally this is undesirable as it will
-greatly increase the total number of dedupe requests. There is also a
-larger potential for file fragmentation.
+Defaults to \fBon\fR. Duperemove submits duplicate blocks directly to
+the dedupe engine.
+
+Duperemove can optionally optimize the duplicate block lists into
+larger extents prior to dedupe submission. The search algorithm used
+for this however has a very high memory and cpu overhead, but may
+reduce the number of extent references created during dedupe. If you'd
+like to try this, run with 'noblock'.
 .RE
 
 .TP

diff --git a/duperemove.c b/duperemove.c
@@ -56,7 +56,7 @@ unsigned int blocksize = DEFAULT_BLOCKSIZE;
 int run_dedupe = 0;
 int recurse_dirs = 0;
 int one_file_system = 1;
-int block_dedupe = 0;
+int block_dedupe = 1;
 int dedupe_same_file = 0;
 int skip_zeroes = 0;