Development Tasks

Reasonably developed ideas for Duperemove. If you're interested in taking one of these on, let me know.

Small / Medium Tasks

Multi-threaded dedupe stage
- dedupe_extent_list() is a great candidate for putting on a worker thread. A quick glance shows only that filerec->fd would be written between processes, but this can be easily handled by just letting each thread store the fd locally.
Test/benchmark the following possible enhancements for csum_whole_file()
- posix_fadvise with POSIX_FADV_SEQUENTIAL
- readahead(2)
- mmap (with madvise)
csum_whole_file() still does a read/checksum of holes and unwritten extents (even though we detect and mark them now). If we calculate and store (in memory) the checksum of a zero'd block, we can skip the read and copy our known value directly into the block digest member.
csum_whole_file() should do the count of pre-deduped shared bytes for each file while fiemapping for extent flags. Then we won't have to do it in the dedupe stage - reducing the total number of calls to fiemap we make.
Improve memory usage by freeing non-duplicated hashes after the csum step
- This is a tiny bit tricky because the extent search assumes file block hashes are logically contiguous - it needs to know that if the next block isn't contiguous that the search can end.

Store results of our search to speed up subsequent runs.
- When writing hashes, store latest btrfs transaction id in the hash file (import find-new from btrfsprogs for this)
- When reading hashes and we have a transaction id, do a scan of btrfs objects to see which inodes have changed.
  - Changed inodes get re-checksummed
  - Deleted inodes get their hashes removed
  - New inodes get checksummed
- To tie this all together, we need to add an option (maybe simply called --hash-file?) which acts as a create-or-update mode for hash file usage. Now the user can run duperemove, with one command, on a regular basis and we'll automatically keep the hash database updated.
Lookup extent owners during checksum phase
- We can use this later for a more accurate duplicate extent search.
- Can also be used to skip reading extents that have already been read. For example, if files A and B share an extent, we can just copy the hashes from the first read instead of checksumming it twice.
Multi-threaded extent search.
- This needs locking and atomic updates to some fields, so it's not quite as straight forward as threading the hash or dedupe stage. A start would be to have find_all_dups queue up appropriate dup_blocks_lists to worker threads. The threads have to properly lock the filerec compared trees and results trees. b_seen also needs to be read/written atomically.