-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dedup whole files #43
Comments
No, but this is something that will happen implicitly anyway as duperemove discovers that the majority of a file is the same. |
There are two issues still worth thinking about
There is another project bedup but it's unmaintained and currently broken for new installs |
Yes, though if it's a big deal you can mitigate this somewhat by using a smaller blocksize.
Ok, so duperemove is a lot more concerned with individual block dedupe than you want. Internally we're classifying duplicates on a block by block basis, changing to whole files (optionally) is a possibility but it goes against the grain of everything else we're doing in duperemove so it could get a bit messy adding it in there. Well, we could probably hack it by saying "only dedupe extents which cover the entire file". A couple questions:
|
My answers to those questions, although I am not the original requester:
|
I've not checked how often duplicate blocks are found in identical vs non-identical files but a cursory glance at the output makes me think most of them (for me) are in the identical ones Originally identical files which have diverged could well benefit from the incremental approach and I guess that was one of the motivations behind it? Though unfortunately only approximations of large blocks are going to end up matched due to block alignment and stuff I tried running with 4k block size but had to stop before hashing completed due to being at 14/16 GB of memory. After running across my system at default values about 600MB of RAM was eaten due to the bug I've seen men |
So if we want speed we can't hack this into the current extent search / checksum stage - that code is intentionally chopping up the checksums into blocks. Taking our input from fdupes is doable, though it would be a special mode where we skip the file scan and checksum stages and proceed directly to dedupe. The downside would be that most other features of duperemove aren't available in this mode (but that doesn't seem to be a big deal for your use case). I should add, the one thing I want to avoid is having duperemove do yet another kind of file checksum. |
Perhaps it would be possible to build a seperate binary called "duperemovefiles" which takes a list of known to be idenitcal files? Hopefully this would be able to share some deduping code without messing with what "dupemove" does, which as you say is concern itself with blocks. This would keep the checksumming completely seperate or could be dropped completely from "duperemovefiles" |
Either way (different binary, or special mode of duperemove) it's the same amount of work. Taking a file list from fdupes for whole-file dedupe seems useful, I've added it to the list of development tasks: https://github.com/markfasheh/duperemove/wiki/Development-Tasks Let me know if you agree/disagree with that writeup, otherwise I'll close this issue for now. |
Looks good to me |
FYI we now have an --fdupes option which will take output from fdupes and run whole-file dedupe on the results. Please give it a shot (example: fdupes . | duperemove --fdupes) and file any bugs you might find. Thanks again for the suggestion! |
Sorry for not getting back to you sooner, just wanted to say thanks for much for implementing this, I'm trying it now |
Thanks for adding this. It gives me a way to be able to dedupe large quantities of small files safely without having to set the block size awfully low. |
Is there any option to skip files are already deduped? run "fdupes . | duperemove --fdupes" twice the 2. run also needs many time trying to dedupe files there are already deduped in the previous run. |
The readme mentions that hashing is done per 128KB block by default, is there anyway to force hashing and dedup of whole files only?
The text was updated successfully, but these errors were encountered: