Skip to content

Commit

Permalink
Make block-dedupe the default
Browse files Browse the repository at this point in the history
The idea of find-dupes is a great one - we want to cut down on the number of
extent references placed on disk by building exents out of our dupe blocks
tree.

The problem is that we've never been able to get this to perform reasonably
well and give good dedupe results at the same time. The design doc in our
wiki has the full details but the most relevant excerpt would be:

We're trying to balance at least 3 very important resources:
 - cpu usage
 - memory usage
 - quality of dedupe

Right now we catch all possible extents (100% dedupe quality) at the expense
of a ton of memory and CPU. Turning down the quality in favor of fewer
expended resources tends to get us in situations where the pattern of dedupe
is seemingly random, or we always miss at least some obvious cases (such as
identical files).

We can continue to experiment until we get something that works well -
there's still many options going forward. In the meantime however, the
number of bug reports I have recieved where find-dupes is a severe
performance problem is too high. We want to ensure a smooth user experience,
especially for those with large dedupe sets so make find-dupes optional.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
  • Loading branch information
Mark Fasheh committed Sep 26, 2016
1 parent c2e3229 commit ac32d43
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 30 deletions.
9 changes: 1 addition & 8 deletions FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

### Is there an upper limit to the amount of data duperemove can process?

Duperemove v0.10 is fast at reading and cataloging data. Dedupe runs
Duperemove v0.11 is fast at reading and cataloging data. Dedupe runs
will be memory limited unless the '--hashfile' option is used. '--hashfile'
allows duperemove to temporarily store duplicated hashes to disk, thus removing
the large memory overhead and allowing for a far larger amount of data to be
Expand All @@ -13,13 +13,6 @@ Actual performance numbers are dependent on hardware - up to date
testing information is kept [on the wiki](https://github.com/markfasheh/duperemove/wiki/Performance-Numbers)


### Why does it not print out all duplicate extents?

Internally duperemove is classifying extents based on various criteria
like length, number of identical extents, etc. The printout we give is
based on the results of that classification.


### How can I find out my space savings after a dedupe?

Duperemove will print out an estimate of the saved space after a
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This README is for duperemove v0.11.
Duperemove is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will
hash their contents on a block by block basis and compare those hashes
to each other, finding and categorizing extents that match each
to each other, finding and categorizing blocks that match each
other. When given the -d option, duperemove will submit those
extents for deduplication using the Linux kernel extent-same ioctl.

Expand Down
22 changes: 11 additions & 11 deletions docs/duperemove.html
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ <h1><a name="2"></a>"DESCRIPTION"</h1>
<p class="pp j"><b>duperemove</b> is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will
hash their contents on a block by block basis and compare those hashes
to each other, finding and categorizing extents that match each
to each other, finding and categorizing blocks that match each
other. When given the <b>-d</b> option, <b>duperemove</b> will submit
those extents for deduplication using the Linux kernel extent-same
ioctl.
Expand Down Expand Up @@ -81,11 +81,6 @@ <h2><a name="4"></a>"Readonly / Non-deduplicating Mode"</h2>
deduplication at a later time.
<br />
<br />
It is important to note that this mode will not print out <b>all</b>
instances of matching extents, just those it would consider for
deduplication.
<br />
<br />
Generally, duperemove does not concern itself with the underlying
representation of the extents it processes. Some of them could be
compressed, undergoing I/O, or even have already been deduplicated. In
Expand Down Expand Up @@ -275,7 +270,7 @@ <h1><a name="6"></a>"OPTIONS"</h1>
<br />
</p>

<p class="tp1"><b>---dedupe-options=</b><i>options</i>
<p class="tp1"><b>--dedupe-options=</b><i>options</i>
</p>

<p class="tp2 j">Comma separated list of options which alter how we dedupe. Prepend 'no' to an
Expand Down Expand Up @@ -313,10 +308,15 @@ <h1><a name="6"></a>"OPTIONS"</h1>
<p class="tp1"><b>[no]block</b>
</p>

<p class="tp2 j">Defaults to <b>off</b>. Dedupe by block - don't optimize our data into
extents before dedupe. Generally this is undesirable as it will
greatly increase the total number of dedupe requests. There is also a
larger potential for file fragmentation.
<p class="tp2 j">Defaults to <b>on</b>. Duperemove submits duplicate blocks directly to
the dedupe engine.
<br />
<br />
Duperemove can optionally optimize the duplicate block lists into
larger extents prior to dedupe submission. The search algorithm used
for this however has a very high memory and cpu overhead, but may
reduce the number of extent references created during dedupe. If you'd
like to try this, run with 'noblock'.
<br />
<br />
</p>
Expand Down
18 changes: 9 additions & 9 deletions duperemove.8
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ duperemove \- Find duplicate extents and print them to stdout
\fBduperemove\fR is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will
hash their contents on a block by block basis and compare those hashes
to each other, finding and categorizing extents that match each
to each other, finding and categorizing blocks that match each
other. When given the \fB-d\fR option, \fBduperemove\fR will submit
those extents for deduplication using the Linux kernel extent-same
ioctl.
Expand Down Expand Up @@ -36,10 +36,6 @@ seeing what duperemove might do when run with \fB-d\fR. The output could
also be used by some other software to submit the extents for
deduplication at a later time.

It is important to note that this mode will not print out \fBall\fR
instances of matching extents, just those it would consider for
deduplication.

Generally, duperemove does not concern itself with the underlying
representation of the extents it processes. Some of them could be
compressed, undergoing I/O, or even have already been deduplicated. In
Expand Down Expand Up @@ -186,10 +182,14 @@ fiemap during the file scan stage, you will also want to use the
\fB--lookup-extents=no\fR option.
.TP
\fB[no]block\fR
Defaults to \fBoff\fR. Dedupe by block - don't optimize our data into
extents before dedupe. Generally this is undesirable as it will
greatly increase the total number of dedupe requests. There is also a
larger potential for file fragmentation.
Defaults to \fBon\fR. Duperemove submits duplicate blocks directly to
the dedupe engine.

Duperemove can optionally optimize the duplicate block lists into
larger extents prior to dedupe submission. The search algorithm used
for this however has a very high memory and cpu overhead, but may
reduce the number of extent references created during dedupe. If you'd
like to try this, run with 'noblock'.
.RE

.TP
Expand Down
2 changes: 1 addition & 1 deletion duperemove.c
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ unsigned int blocksize = DEFAULT_BLOCKSIZE;
int run_dedupe = 0;
int recurse_dirs = 0;
int one_file_system = 1;
int block_dedupe = 0;
int block_dedupe = 1;
int dedupe_same_file = 0;
int skip_zeroes = 0;

Expand Down

0 comments on commit ac32d43

Please sign in to comment.