Skip to content

Performance Numbers ‐ v0.09

jack edited this page Sep 28, 2023 · 1 revision

With this page I want to provide some example performance numbers so users have an idea of what to expect when they run duperemove on non-trivial data sets.

The following tests were run on a Dell Precision T3610 workstation with a copy of /home from my workstation rsynced to a fresh btrfs partition. You can find more information about the hardware and software setup here

The version of duperemove used here is v0.09beta2 plus a few extra bug fixes (no performance improvements) that will be part of v0.09beta3.

There are 1151400 files in the data set (about 760 gigabytes of data). Of those files, duperemove finds 1151142 to be hashable. Average size of the files works out to about 700K but the truth is that it's a very mixed set of general user data (dotfiles, dotfile directories, source code, documents) and media files (ISO images, music, movies, books).

The first two tests measure performance of the file hash and extent finding steps independent of each other. Finally we do a full combined run with dedupe to get a more realistic test.

File scan / hash performance

weyoun2:~ # time ./duperemove -hr --hash-threads=16 --write-hashes=/root/slash-home-pre-dedupe.dup /btrfs/ &> slash-home-write-hashes.log
real    26m54.741s
user    79m3.896s
sys     5m2.168s

The large user time is partially attributable to the hash function in use here (and that there's 16 threads at work). I expect to be merging alternative hash algorithms in the near future.

Find extents performance

weyoun2:~ # time ./duperemove --read-hashes=/root/slash-home-pre-dedupe.dup &> /dev/null
real    18m2.981s
user    18m1.232s
sys     0m1.676s

Full run

We reboot to run with no disk cache present. The numbers until now were just breaking down the first two steps for informational purposes. This is representative of what a user would actually experience if they ran duperemove against this data set. I saved the results to a file to check for errors.

weyoun2:~ # time ./duperemove -dhr --hash-threads=16 /btrfs/ &> full_run.txt
real    120m37.026s
user    99m28.944s
sys     62m46.664s

So, on this hardware duperemove took about 2 hours to hash and dedupe 760 Gigabytes of data. The dedupe step was the longest, at around 1.25 hours whereas the other two took around .75 hours. Performance optimizations to the dedupe step are planned (see Development Tasks) so hopefully we can get that number down in the future.