Skip to content

Development Tasks

markfasheh edited this page Mar 27, 2015 · 26 revisions

Reasonably developed ideas for Duperemove. If you're interested in taking one of these on, let me know.

If a task has been checkmarked, it is at least feature complete and in master branch.

Small / Medium Tasks

  • Import sha256 module from polarssl

    • it looks like their sha256.c and sha256.h files are pretty nicely self contained
    • doing this would allow us to drop the dependency on libgcrypt, which in turn makes it easier for us to choose a new default hash.
  • Multi-threaded dedupe stage

    • dedupe_extent_list() is a great candidate for putting on a worker thread. A quick glance shows only that filerec->fd would be written between processes, but this can be easily handled by just letting each thread store the fd locally.
  • Test/benchmark the following possible enhancements for csum_whole_file()

    • posix_fadvise with POSIX_FADV_SEQUENTIAL
    • readahead(2)
    • mmap (with madvise)
  • csum_whole_file() still does a read/checksum of holes and unwritten extents (even though we detect and mark them now). If we calculate and store (in memory) the checksum of a zero'd block, we can skip the read and copy our known value directly into the block digest member.

  • csum_whole_file() should do the count of pre-deduped shared bytes for each file while fiemapping for extent flags. Then we won't have to do it in the dedupe stage - reducing the total number of calls to fiemap we make.

  • (not needed now that we have the bloom filter code) Improve memory usage by freeing non-duplicated hashes after the csum step

    • This is a tiny bit tricky because the extent search assumes file block hashes are logically contiguous - it needs to know that if the next block isn't contiguous that the search can end.
  • Add a mode (or new binary called duperemove-fdupes?) where we take a file list from fdupes and do whole-file dedupe on said list.

    • For this we skip most of the file scan and all of the csum stages
    • It does not interact with most options (i.e., we're not writing hashfiles from this)
    • Might require small enhancements to the dedupe core but otherwise it can handle this pretty easily.
    • Related Discussion

Large Tasks

  • Store results of our search to speed up subsequent runs.

    • When writing hashes, store latest btrfs transaction id in the hash file (import find-new from btrfsprogs for this)
    • When reading hashes and we have a transaction id, do a scan of btrfs objects to see which inodes have changed.
      • Changed inodes get re-checksummed
      • Deleted inodes get their hashes removed
      • New inodes get checksummed
    • To tie this all together, we need to add an option (maybe simply called --hash-file?) which acts as a create-or-update mode for hash file usage. Now the user can run duperemove, with one command, on a regular basis and we'll automatically keep the hash database updated.
  • Lookup extent owners during checksum phase

    • We can use this later for a more accurate duplicate extent search.
    • Can also be used to skip reading extents that have already been read. For example, if files A and B share an extent, we can just copy the hashes from the first read instead of checksumming it twice.
  • Multi-threaded extent search.

    • This needs locking and atomic updates to some fields, so it's not quite as straight forward as threading the hash or dedupe stage. A start would be to have find_all_dups queue up appropriate dup_blocks_lists to worker threads. The threads have to properly lock the filerec compared trees and results trees. b_seen also needs to be read/written atomically.
  • feature-complete now, thanks to Jack Kenrick Largely decrease memory footprint by using a bloom filter (thanks to darkling on #btrfs for this idea). Currently we store all hashes in memory in an index (hash tree) and search that index to find duplicates. With this algorithm, our working set in memory is all hashes. Instead, we can use a bloom filter to record when a hash has been seen, allowing us to reducing the working set to only those hashes with dupes. From http://en.wikipedia.org/wiki/Bloom_filter:

    A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.

    • The general idea is that the bloom filter can tell us (with a certain amount of confidence) whether a hash was seen already. Memory overhead for a bloom filter is tiny, far less than that which we would save by being able to discard non-duped blocks:

      fewer than 10 bits per element are required for a 1% false positive probability

    • However, we can't get the exact hash values back from the bloom filter, so we would have to do multiple read/csum passes of the data:

    PASS 1: Read+csum file blocks, for each checksum:

    • add the csum to the bloom filter
    • if the csum already exists in the bloom filter, add the block to our index of duplicate blocks
    • if the csum already exists in the index of duplicated blocks, dedupe it against the block we found there

    PASS 2: Read+hash file blocks, for each hash:

    • if the hash exists in the bloom filter and is not in the index, we dedupe it against what is in the index.

    The 2nd pass is basically re-finding that 1st block of any set of dupes which was put in the bloom filter, but never stored in the index. In theory the user could skip that pass if they didn't care about missing blocks with only one duplicate.

    There are two potential drawbacks:

    1. The 2nd file scan is expensive. We can work around this by streaming our hashes to a temporary file during the 1st pass, and streaming them back from the file during our 2nd pass.

    2. We don't get the chance to make extents for dedupe but instead pass individual blocks which might result in less efficient space layout. This can be worked around too though by just pushing the dedupe portion of our algorithm to the 2nd pass and updating the extent search code to handle discontiguous regions (which happens to be a TODO item above).