GitHub - phiresky/dupegone: small fast duplicate file finder in c++

small and fast duplicate file finder

Walks a directory, adding all files into a sorted map. Then for every same-size-set, Hashes the first 64kb of the files, and if that matches hash the whole file.

on a sample of 24000 files, 16GB, mostly pdf:

fdupes: 14:03 min
this: 9:21 min
rmlint: 7:38 min

I was planning to make the fastest duplicate file finder, but then I found rmlint, which seems to do pretty much everything possible.

Interestingly enough, simply sha1sum-ing every complete file in the order find finds them is even faster than using any of these tools. Well fuck me for trying to use a nice algorithm when in reality the only thing taking any time is the harddrive skipping around between:

find "$1" -type f -exec sha1sum {} + | sort | uniq -w 32 --all-repeated=separate

(by /u/Rangi42) takes 4:41 min. sort -R randomizes file order. As you can see, the same thing with random file order takes 3 min vs 20 seconds.

Made mostly as an exercise of C++ and algorithms/data structures

Dependencies

openssl for sha1
boost for filesystem access

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dupegone.cpp		dupegone.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

small and fast duplicate file finder

Dependencies

About

Releases

Packages

Languages

License

phiresky/dupegone

Folders and files

Latest commit

History

Repository files navigation

small and fast duplicate file finder

Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages