Skip to content

phiresky/dupegone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

small and fast duplicate file finder

Walks a directory, adding all files into a sorted map. Then for every same-size-set, Hashes the first 64kb of the files, and if that matches hash the whole file.

on a sample of 24000 files, 16GB, mostly pdf:

  • fdupes: 14:03 min
  • this: 9:21 min
  • rmlint: 7:38 min

I was planning to make the fastest duplicate file finder, but then I found rmlint, which seems to do pretty much everything possible.

Interestingly enough, simply sha1sum-ing every complete file in the order find finds them is even faster than using any of these tools. Well fuck me for trying to use a nice algorithm when in reality the only thing taking any time is the harddrive skipping around between:

find "$1" -type f -exec sha1sum {} + | sort | uniq -w 32 --all-repeated=separate

(by /u/Rangi42) takes 4:41 min. pic sort -R randomizes file order. As you can see, the same thing with random file order takes 3 min vs 20 seconds.

Made mostly as an exercise of C++ and algorithms/data structures

Dependencies

  • openssl for sha1
  • boost for filesystem access

About

small fast duplicate file finder in c++

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages