Skip to content

Development Tasks

markfasheh edited this page Nov 17, 2014 · 26 revisions

Reasonably developed ideas for Duperemove. If you're interested in taking one of these on, let me know.

Small / Medium Tasks

  • Multi-threaded dedupe stage

    • dedupe_extent_list() is a great candidate for putting on a worker thread. A quick glance shows only that filerec->fd would be written between processes, but this can be easily handled by just letting each thread store the fd locally.
  • Test/benchmark the following possible enhancements for csum_whole_file()

    • posix_fadvise with POSIX_FADV_SEQUENTIAL
    • readahead(2)
    • mmap (with madvise)
  • csum_whole_file() still does a read/checksum of holes and unwritten extents (even though we detect and mark them now). If we calculate and store (in memory) the checksum of a zero'd block, we can skip the read and copy our known value directly into the block digest member.

  • csum_whole_file() should do the count of pre-deduped shared bytes for each file while fiemapping for extent flags. Then we won't have to do it in the dedupe stage - reducing the total number of calls to fiemap we make.

  • Improve memory usage by freeing non-duplicated hashes after the csum step

    • This is a tiny bit tricky because the extent search assumes file block hashes are logically contiguous - it needs to know that if the next block isn't contiguous that the search can end.

Large Tasks

  • Store results of our search to speed up subsequent runs.

    • When writing hashes, store latest btrfs transaction id in the hash file (import find-new from btrfsprogs for this)
    • When reading hashes and we have a transaction id, do a scan of btrfs objects to see which inodes have changed.
      • Changed inodes get re-checksummed
      • Deleted inodes get their hashes removed
      • New inodes get checksummed
    • To tie this all together, we need to add an option (maybe simply called --hash-file?) which acts as a create-or-update mode for hash file usage. Now the user can run duperemove, with one command, on a regular basis and we'll automatically keep the hash database updated.
  • Lookup extent owners during checksum phase

    • We can use this later for a more accurate duplicate extent search.
    • Can also be used to skip reading extents that have already been read. For example, if files A and B share an extent, we can just copy the hashes from the first read instead of checksumming it twice.
  • Multi-threaded extent search.

    • This needs locking and atomic updates to some fields, so it's not quite as straight forward as threading the hash or dedupe stage. A start would be to have find_all_dups queue up appropriate dup_blocks_lists to worker threads. The threads have to properly lock the filerec compared trees and results trees. b_seen also needs to be read/written atomically.