dedup whole files #43

pwr22 · 2015-01-31T21:39:16Z

The readme mentions that hashing is done per 128KB block by default, is there anyway to force hashing and dedup of whole files only?

markfasheh · 2015-02-02T21:15:25Z

No, but this is something that will happen implicitly anyway as duperemove discovers that the majority of a file is the same.

pwr22 · 2015-02-02T22:28:04Z

There are two issues still worth thinking about

What if the last chunk of the file is smaller than the block size? Will this always lead to the final chunk being stored uniquely each time and unnecessary fragmentation?
It is faster to do the hashing with the largest allowed blocksize (1M) and it is much much faster to do the dedupe processing after this. I only ever want to dedupe entire files and I think this would be even faster so for people like me maybe some kind of --whole-files option could be extremely useful

There is another project bedup but it's unmaintained and currently broken for new installs

markfasheh · 2015-02-03T21:39:43Z

What if the last chunk of the file is smaller than the block size? Will this always lead to the final chunk being stored uniquely each time and unnecessary fragmentation?

Yes, though if it's a big deal you can mitigate this somewhat by using a smaller blocksize.

I only ever want to dedupe entire files and I think this would be even faster so for people like me maybe some kind of --whole-files option could be extremely useful

Ok, so duperemove is a lot more concerned with individual block dedupe than you want. Internally we're classifying duplicates on a block by block basis, changing to whole files (optionally) is a possibility but it goes against the grain of everything else we're doing in duperemove so it could get a bit messy adding it in there. Well, we could probably hack it by saying "only dedupe extents which cover the entire file".

A couple questions:

Why is it that you require whole-file dedupe only
Do you need the software to discover the duplicated files, or is that information you would already have?

dioni21 · 2015-02-03T22:15:33Z

My answers to those questions, although I am not the original requester:

Speed?
Most distros have fdupes: http://en.wikipedia.org/wiki/Fdupes - Maybe an option to receive input from fdupes and call ioctls to dedup whole files?

pwr22 · 2015-02-03T23:01:43Z

Speed is a major concern and memory usage for storing the results of hashing large amounts of data, and my particular use case often has many duplicate files being written. I'm running lxc containers on top of btrfs
Was going to say the same thing as @dioni21

I've not checked how often duplicate blocks are found in identical vs non-identical files but a cursory glance at the output makes me think most of them (for me) are in the identical ones

Originally identical files which have diverged could well benefit from the incremental approach and I guess that was one of the motivations behind it? Though unfortunately only approximations of large blocks are going to end up matched due to block alignment and stuff

I tried running with 4k block size but had to stop before hashing completed due to being at 14/16 GB of memory. After running across my system at default values about 600MB of RAM was eaten due to the bug I've seen men

markfasheh · 2015-02-04T20:35:14Z

So if we want speed we can't hack this into the current extent search / checksum stage - that code is intentionally chopping up the checksums into blocks.

Taking our input from fdupes is doable, though it would be a special mode where we skip the file scan and checksum stages and proceed directly to dedupe. The downside would be that most other features of duperemove aren't available in this mode (but that doesn't seem to be a big deal for your use case).

I should add, the one thing I want to avoid is having duperemove do yet another kind of file checksum.

pwr22 · 2015-02-04T20:39:29Z

Perhaps it would be possible to build a seperate binary called "duperemovefiles" which takes a list of known to be idenitcal files? Hopefully this would be able to share some deduping code without messing with what "dupemove" does, which as you say is concern itself with blocks. This would keep the checksumming completely seperate or could be dropped completely from "duperemovefiles"

markfasheh · 2015-02-09T22:05:07Z

Either way (different binary, or special mode of duperemove) it's the same amount of work. Taking a file list from fdupes for whole-file dedupe seems useful, I've added it to the list of development tasks:

https://github.com/markfasheh/duperemove/wiki/Development-Tasks

Let me know if you agree/disagree with that writeup, otherwise I'll close this issue for now.

pwr22 · 2015-02-09T22:12:39Z

Looks good to me

markfasheh · 2015-05-15T20:44:59Z

FYI we now have an --fdupes option which will take output from fdupes and run whole-file dedupe on the results. Please give it a shot (example: fdupes . | duperemove --fdupes) and file any bugs you might find. Thanks again for the suggestion!

pwr22 · 2015-09-17T21:08:41Z

Sorry for not getting back to you sooner, just wanted to say thanks for much for implementing this, I'm trying it now

tlhonmey · 2015-10-02T19:14:54Z

Thanks for adding this. It gives me a way to be able to dedupe large quantities of small files safely without having to set the block size awfully low.

Floyddotnet · 2015-10-02T21:55:05Z

Is there any option to skip files are already deduped? run "fdupes . | duperemove --fdupes" twice the 2. run also needs many time trying to dedupe files there are already deduped in the previous run.

markfasheh closed this as completed Feb 2, 2015

markfasheh reopened this Feb 3, 2015

markfasheh added the enhancement label Feb 9, 2015

markfasheh closed this as completed May 15, 2015

Hi-Angel mentioned this issue Mar 31, 2018

dedup files smaller than block size #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dedup whole files #43

dedup whole files #43

pwr22 commented Jan 31, 2015

markfasheh commented Feb 2, 2015

pwr22 commented Feb 2, 2015

markfasheh commented Feb 3, 2015

dioni21 commented Feb 3, 2015

pwr22 commented Feb 3, 2015

markfasheh commented Feb 4, 2015

pwr22 commented Feb 4, 2015

markfasheh commented Feb 9, 2015

pwr22 commented Feb 9, 2015

markfasheh commented May 15, 2015

pwr22 commented Sep 17, 2015

tlhonmey commented Oct 2, 2015

Floyddotnet commented Oct 2, 2015

dedup whole files #43

dedup whole files #43

Comments

pwr22 commented Jan 31, 2015

markfasheh commented Feb 2, 2015

pwr22 commented Feb 2, 2015

markfasheh commented Feb 3, 2015

dioni21 commented Feb 3, 2015

pwr22 commented Feb 3, 2015

markfasheh commented Feb 4, 2015

pwr22 commented Feb 4, 2015

markfasheh commented Feb 9, 2015

pwr22 commented Feb 9, 2015

markfasheh commented May 15, 2015

pwr22 commented Sep 17, 2015

tlhonmey commented Oct 2, 2015

Floyddotnet commented Oct 2, 2015