Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Tools for deduping file systems
C C++ Makefile

Fix typo in cmp_filerecs()

We're accidentally going in the same rbtree direction regardless of subvol
id value, which is sub-optimal at best.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
latest commit a38bf63038
Mark Fasheh authored
Failed to load latest commit information.
.gitignore - initial commit for duperemove from my private repo. This will be the
FAQ.md Pull the FAQ into it's own file, that way I don't have to update 2 pl…
LICENSE - initial commit for duperemove from my private repo. This will be the
LICENSE.libbloom few comments
LICENSE.xxhash Add a license file for xxHash.
Makefile Fix building, move -lm flag
README.md Update README.md with a link to v0.09 branch.
SubmittingPatches - Add documentation describing how to submit patches to duperemove.
bloom.c Bump bloom internal size to uint64_t
bloom.h Bump bloom internal size to uint64_t
bswap.h Use proper 8 and 16 bit types in bswap.h
btrfs-extent-same.8 - Add btrfs-extent-same.8
btrfs-extent-same.c - Add author attribution
btrfs-ioctl.h - sync btrfs_ioctl_same_args defnition to kernel
btrfs-util.c Inodes between subvolumes on a btrfs file system can have the same i_…
btrfs-util.h Inodes between subvolumes on a btrfs file system can have the same i_…
csum-murmur3.c Make a few local declarations static.
csum-sha256.c Make a few local declarations static.
csum-test.c Add option to choose hash type used.
csum-xxhash.c Make a few local declarations static.
csum.c Make a few local declarations static.
csum.h Use murmur3 for default hash.
d_tree.c Combine digest_new and digest_insert. Check for malloc errors inside
d_tree.h Free d_tree asap
debug.h Import unlikely() from linux kernel source and use it for abort_on().
dedupe.c Clean up some of our debug and verbose prints
dedupe.h - Putting filerecs on lists for the dedupe context doesn't work out b…
duperemove.8 - Introduce --hashfile= option. Users can use this to get the bloom f…
duperemove.c Remove debug print.
file_scan.c Merge branch 'memfree' of git://github.com/clobrother/duperemove
file_scan.h A lot of style fixes
filerec.c Fix typo in cmp_filerecs()
filerec.h Free filerecs before exiting. This doesn't really make a difference in
find_dupes.c Clean up some of our debug and verbose prints
find_dupes.h Move the code implementing our duplicate extent search into find_dupes.c
hash-tree.c Few fixes & cleanup
hash-tree.h Few fixes & cleanup
hashstats.8 - Add manpages for show-shared-extents and hashstats.
hashstats.c Check for errors from read_hash_tree(), use a function to pretty print
kernel.h - initial commit for duperemove from my private repo. This will be the
list.h - initial commit for duperemove from my private repo. This will be the
memstats.c Move memstat tracking code memstats.c
memstats.h alloc tracking macros should declare the variable inside the C file a…
rbtree.c - initial commit for duperemove from my private repo. This will be the
rbtree.h - initial commit for duperemove from my private repo. This will be the
rbtree.txt - initial commit for duperemove from my private repo. This will be the
results-tree.c Use remove_extent() to cleverly free dupe_extents
results-tree.h Use remove_extent() to cleverly free dupe_extents
run_dedupe.c Use remove_extent() to cleverly free dupe_extents
run_dedupe.h - Change --hash-threads option to --io-threads.
serialize.c Check for errors from read_hash_tree(), use a function to pretty print
serialize.h Check for errors from read_hash_tree(), use a function to pretty print
sha256-config.h Add csum-sha256, using the polarssl implementation I imported in the
sha256.c Add csum-sha256, using the polarssl implementation I imported in the
sha256.h Add csum-sha256, using the polarssl implementation I imported in the
show-shared-extents.8 - Add manpages for show-shared-extents and hashstats.
util.c Move memstat tracking code memstats.c
util.h - add -h option to pretty-print numbers
xxhash.c xxhash: Disable XXH_FORCE_NATIVE_FORMAT
xxhash.h Added xxhash algorithm

README.md

This README is for the development branch of duperemove. If you're looking for a stable version which is continually updated with fixes, please see v0.09 branch.

Duperemove

Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing extents that match each other. When given the -d option, duperemove will submit those extents for deduplication using the btrfs-extent-same ioctl.

Duperemove has two major modes of operation one of which is a subset of the other.

Readonly / Non-deduplicating Mode

When run without -d (the default) duperemove will print out one or more tables of matching extents it has determined would be ideal candidates for deduplication. As a result, readonly mode is useful for seeing what duperemove might do when run with '-d'. The output could also be used by some other software to submit the extents for deduplication at a later time.

It is important to note that this mode will not print out all instances of matching extents, just those it would consider for deduplication.

Generally, duperemove does not concern itself with the underlying representation of the extents it processes. Some of them could be compressed, undergoing I/O, or even have already been deduplicated. In dedupe mode, the kernel handles those details and therefore we try not to replicate that work.

Deduping Mode

This functions similarly to readonly mode with the exception that the duplicated extents found in our "read, hash, and compare" step will actually be submitted for deduplication. An estimate of the total data deduplicated will be printed after the operation is complete. This estimate is calculated by comparing the total amount of shared bytes in each file before and after the dedupe.

See the duperemove man page for further details about running duperemove.

Requirements

The latest stable code can be found in v0.09-branch.

Kernel: Duperemove needs a kernel version equal to or greater than 3.13

Libraries: Duperemove uses glib2.

FAQ

Please see the FAQ file provided in the duperemove source

Usage Examples

Duperemove takes a list of files and directories to scan for dedupe. If a directory is specified, all regular files within it will be scanned. Duperemove can also be told to recursively scan directories with the '-r' switch. If '-h' is provided, duperemove will print numbers in powers of 1024 (e.g., "128K").

Assume this abitrary layout for the following examples.

.
├── dir1
│   ├── file3
│   ├── file4
│   └── subdir1
│       └── file5
├── file1
└── file2

This will dedupe files 'file1' and 'file2':

duperemove -dh file1 file2

This does the same but adds any files in dir1 (file3 and file4):

duperemove -dh file1 file2 dir1

This will dedupe exactly the same as above but will recursively walk dir1, thus adding file5.

duperemove -dhr file1 file2 dir1/

An actual run, output will differ according to duperemove version.

duperemove -dhr file1 file2 dir1
Using 128K blocks
Using hash: SHA256
Using 2 threads for file hashing phase
csum: file1     [1/5]
csum: file2     [2/5]
csum: dir1/file3       [3/5]
csum: dir1/subdir1/file5       [4/5]
csum: dir1/file4       [5/5]
Hashed 80 blocks, resulting in 17 unique hashes. Calculating duplicate
extents - this may take some time.
[########################################]
Search completed with no errors.
Simple read and compare of file data found 2 instances of extents that might
benefit from deduplication.
Start           Length          Filename (2 extents)
0.0     2.0M    "file2"
0.0     2.0M    "dir1//file4"
Start           Length          Filename (3 extents)
0.0     2.0M    "file1"
0.0     2.0M    "dir1//file3"
0.0     2.0M    "dir1//subdir1/file5"
Dedupe 1 extents with target: (0.0, 2.0M), "file2"
Dedupe 2 extents with target: (0.0, 2.0M), "file1"
Kernel processed data (excludes target files): 6.0M
Comparison of extent info shows a net change in shared extents of: 10.0M
Something went wrong with that request. Please try again.