Tools for deduping file systems
Clone or download
Mark Fasheh
Mark Fasheh Move excluded check out of walk_dir so we can check entire paths
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Latest commit 13c29f8 Sep 15, 2018
Permalink
Failed to load latest commit information.
docs Move some FAQ items from the wiki into the man page Sep 30, 2016
.gitignore - initial commit for duperemove from my private repo. This will be the Apr 16, 2013
LICENSE - initial commit for duperemove from my private repo. This will be the Apr 16, 2013
LICENSE.xxhash Add a license file for xxHash. Nov 22, 2014
Makefile Update version string on master branch May 15, 2018
README.md Update README to point to v0.11-branch for latest stable code May 15, 2018
SubmittingPatches - Add documentation describing how to submit patches to duperemove. Jul 31, 2014
bswap.h Use proper 8 and 16 bit types in bswap.h Dec 29, 2014
btrfs-extent-same.8 - Add btrfs-extent-same.8 Mar 12, 2014
btrfs-extent-same.c btrfs-extent-same: Fix file mode of deduplicated extents Mar 10, 2016
btrfs-ioctl.h Add some missing copyright headers Jan 8, 2016
btrfs-util.c Allow dedupe on xfs (experimental) Jul 26, 2016
btrfs-util.h Allow dedupe on xfs (experimental) Jul 26, 2016
csum-murmur3.c Fix unaligned access in murmur3_add_to_running_checksum() Jul 9, 2015
csum-sha256.c Make a few local declarations static. Dec 29, 2014
csum-test.c Add option to choose hash type used. Dec 9, 2014
csum-xxhash.c Make a few local declarations static. Dec 29, 2014
csum.c Remove sha256 Sep 17, 2016
csum.h Remove sha256 Sep 17, 2016
dbfile.c dbfile: bump version, maintain compatibility with version 2 hashfiles Sep 7, 2018
dbfile.h dbfile: bump version, maintain compatibility with version 2 hashfiles Sep 7, 2018
debug.c Add quiet option, macro Sep 4, 2018
debug.h Add quiet option, macro Sep 4, 2018
dedupe.c print_btrfs_same_info() %lli -> %lld Sep 16, 2016
dedupe.h Add some missing copyright headers Jan 8, 2016
duperemove.8 Merge branch 'feature/excludes' of git://github.com/greezybacon/duper… Sep 12, 2018
duperemove.c Merge branch 'feature/excludes' of git://github.com/greezybacon/duper… Sep 12, 2018
file_scan.c Move excluded check out of walk_dir so we can check entire paths Sep 14, 2018
file_scan.h Merge branch 'feature/excludes' of git://github.com/greezybacon/duper… Sep 12, 2018
filerec.c Restore post dedupe fiemap Sep 11, 2018
filerec.h Restore post dedupe fiemap Sep 11, 2018
find_dupes.c Merge branch 'dedupe-by-extent' Sep 11, 2018
find_dupes.h fflush standard out after file scan phase in order to make viewing an… Jul 1, 2015
hash-tree.c run_dedupe: avoid size_list corruption in push_blocks() Sep 29, 2016
hash-tree.h Sort file_hash_head_blocks Sep 16, 2016
hashstats.8 Update hashstats and duperemove man pages. Oct 8, 2015
hashstats.c dbfile: move in-memory config state to a struct Sep 5, 2018
interval_tree.c Always use 64-bit integer with interval_tree Jul 1, 2016
interval_tree.h Always use 64-bit integer with interval_tree Jul 1, 2016
interval_tree_generic.h Add interval tree implementation Jul 1, 2016
kernel.h Update list.h from Linux v4.6 Jul 13, 2016
list.h Update list.h from Linux v4.6 Jul 13, 2016
list_sort.c Import list_sort from linux kernel Jul 10, 2016
list_sort.h Import list_sort from linux kernel Jul 10, 2016
memstats.c Load extent flags and poff fields from the db when we allocate our du… Sep 5, 2018
memstats.h Load extent flags and poff fields from the db when we allocate our du… Sep 5, 2018
rbtree.c Add interval tree implementation Jul 1, 2016
rbtree.h Add interval tree implementation Jul 1, 2016
rbtree.txt - initial commit for duperemove from my private repo. This will be the Apr 16, 2013
rbtree_augmented.h Add interval tree implementation Jul 1, 2016
results-tree.c Load extent flags and poff fields from the db when we allocate our du… Sep 5, 2018
results-tree.h Load extent flags and poff fields from the db when we allocate our du… Sep 5, 2018
run_dedupe.c Merge branch 'dedupe-by-extent' Sep 11, 2018
run_dedupe.h Block dedupe to skip extent finding stage Jul 10, 2016
show-shared-extents.8 - Add manpages for show-shared-extents and hashstats. Dec 9, 2014
stats.c debug: Print some stats totals Sep 26, 2016
stats.h debug: calculate stats on find_all_dupes() coverage Sep 2, 2016
util.c util.c: don't leak fd in detect_ht() Apr 30, 2018
util.h Detect whether a cpu has hypthreading Sep 16, 2016
xxhash.c xxhash: Disable XXH_FORCE_NATIVE_FORMAT Nov 22, 2014
xxhash.h Added xxhash algorithm Nov 17, 2014

README.md

This README is for duperemove v0.11.

Duperemove

Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.

Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data.

Duperemove can also take input from the fdupes program.

See the duperemove man page for further details about running duperemove.

Requirements

The latest stable code (v0.11) can be found in the v0.11 branch on github.

Kernel: Duperemove needs a kernel version equal to or greater than 3.13

Libraries: Duperemove uses glib2 and sqlite3.

FAQ

Please see the FAQ section in the duperemove man page

For bug reports and feature requests please use the github issue tracker

Examples

Please see the examples section of the duperemove man page for a complete set of usage examples, including hashfile usage.

A simple example, with program output

Duperemove takes a list of files and directories to scan for dedupe. If a directory is specified, all regular files within it will be scanned. Duperemove can also be told to recursively scan directories with the '-r' switch. If '-h' is provided, duperemove will print numbers in powers of 1024 (e.g., "128K").

Assume this abitrary layout for the following examples.

.
├── dir1
│   ├── file3
│   ├── file4
│   └── subdir1
│       └── file5
├── file1
└── file2

This will dedupe files 'file1' and 'file2':

duperemove -dh file1 file2

This does the same but adds any files in dir1 (file3 and file4):

duperemove -dh file1 file2 dir1

This will dedupe exactly the same as above but will recursively walk dir1, thus adding file5.

duperemove -dhr file1 file2 dir1/

An actual run, output will differ according to duperemove version.

Using 128K blocks
Using hash: murmur3
Using 4 threads for file hashing phase
csum: /btrfs/file1 	[1/5] (20.00%)
csum: /btrfs/file2 	[2/5] (40.00%)
csum: /btrfs/dir1/subdir1/file5 	[3/5] (60.00%)
csum: /btrfs/dir1/file3 	[4/5] (80.00%)
csum: /btrfs/dir1/file4 	[5/5] (100.00%)
Total files:  5
Total hashes: 80
Loading only duplicated hashes from hashfile.
Hashing completed. Calculating duplicate extents - this may take some time.
Simple read and compare of file data found 3 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 512.0K with id 0971ffa6
Start		Filename
512.0K	"/btrfs/file1"
1.5M	"/btrfs/dir1/file4"
Showing 2 identical extents of length 1.0M with id b34ffe8f
Start		Filename
0.0	"/btrfs/dir1/file4"
0.0	"/btrfs/dir1/file3"
Showing 3 identical extents of length 1.5M with id f913dceb
Start		Filename
0.0	"/btrfs/file2"
0.0	"/btrfs/dir1/file3"
0.0	"/btrfs/dir1/subdir1/file5"
Using 4 threads for dedupe phase
[0x147f4a0] Try to dedupe extents with id 0971ffa6
[0x147f770] Try to dedupe extents with id b34ffe8f
[0x147f680] Try to dedupe extents with id f913dceb
[0x147f4a0] Dedupe 1 extents (id: 0971ffa6) with target: (512.0K, 512.0K), "/btrfs/file1"
[0x147f770] Dedupe 1 extents (id: b34ffe8f) with target: (0.0, 1.0M), "/btrfs/dir1/file4"
[0x147f680] Dedupe 2 extents (id: f913dceb) with target: (0.0, 1.5M), "/btrfs/file2"
Kernel processed data (excludes target files): 4.5M
Comparison of extent info shows a net change in shared extents of: 5.5M

Links of interest

The duperemove wiki has both design and performance documentation.

duperemove-tests has a growing assortment of regression tests.

Duperemove web page