Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedup whole files #43

Closed
pwr22 opened this issue Jan 31, 2015 · 13 comments
Closed

dedup whole files #43

pwr22 opened this issue Jan 31, 2015 · 13 comments

Comments

@pwr22
Copy link

pwr22 commented Jan 31, 2015

The readme mentions that hashing is done per 128KB block by default, is there anyway to force hashing and dedup of whole files only?

@markfasheh
Copy link
Owner

No, but this is something that will happen implicitly anyway as duperemove discovers that the majority of a file is the same.

@pwr22
Copy link
Author

pwr22 commented Feb 2, 2015

There are two issues still worth thinking about

  • What if the last chunk of the file is smaller than the block size? Will this always lead to the final chunk being stored uniquely each time and unnecessary fragmentation?
  • It is faster to do the hashing with the largest allowed blocksize (1M) and it is much much faster to do the dedupe processing after this. I only ever want to dedupe entire files and I think this would be even faster so for people like me maybe some kind of --whole-files option could be extremely useful

There is another project bedup but it's unmaintained and currently broken for new installs

@markfasheh
Copy link
Owner

  • What if the last chunk of the file is smaller than the block size? Will this always lead to the final chunk being stored uniquely each time and unnecessary fragmentation?

Yes, though if it's a big deal you can mitigate this somewhat by using a smaller blocksize.

  • I only ever want to dedupe entire files and I think this would be even faster so for people like me maybe some kind of --whole-files option could be extremely useful

Ok, so duperemove is a lot more concerned with individual block dedupe than you want. Internally we're classifying duplicates on a block by block basis, changing to whole files (optionally) is a possibility but it goes against the grain of everything else we're doing in duperemove so it could get a bit messy adding it in there. Well, we could probably hack it by saying "only dedupe extents which cover the entire file".

A couple questions:

  • Why is it that you require whole-file dedupe only
  • Do you need the software to discover the duplicated files, or is that information you would already have?

@markfasheh markfasheh reopened this Feb 3, 2015
@dioni21
Copy link

dioni21 commented Feb 3, 2015

My answers to those questions, although I am not the original requester:

  1. Speed?
  2. Most distros have fdupes: http://en.wikipedia.org/wiki/Fdupes - Maybe an option to receive input from fdupes and call ioctls to dedup whole files?

@pwr22
Copy link
Author

pwr22 commented Feb 3, 2015

  1. Speed is a major concern and memory usage for storing the results of hashing large amounts of data, and my particular use case often has many duplicate files being written. I'm running lxc containers on top of btrfs
  2. Was going to say the same thing as @dioni21

I've not checked how often duplicate blocks are found in identical vs non-identical files but a cursory glance at the output makes me think most of them (for me) are in the identical ones

Originally identical files which have diverged could well benefit from the incremental approach and I guess that was one of the motivations behind it? Though unfortunately only approximations of large blocks are going to end up matched due to block alignment and stuff

I tried running with 4k block size but had to stop before hashing completed due to being at 14/16 GB of memory. After running across my system at default values about 600MB of RAM was eaten due to the bug I've seen men

@markfasheh
Copy link
Owner

So if we want speed we can't hack this into the current extent search / checksum stage - that code is intentionally chopping up the checksums into blocks.

Taking our input from fdupes is doable, though it would be a special mode where we skip the file scan and checksum stages and proceed directly to dedupe. The downside would be that most other features of duperemove aren't available in this mode (but that doesn't seem to be a big deal for your use case).

I should add, the one thing I want to avoid is having duperemove do yet another kind of file checksum.

@pwr22
Copy link
Author

pwr22 commented Feb 4, 2015

Perhaps it would be possible to build a seperate binary called "duperemovefiles" which takes a list of known to be idenitcal files? Hopefully this would be able to share some deduping code without messing with what "dupemove" does, which as you say is concern itself with blocks. This would keep the checksumming completely seperate or could be dropped completely from "duperemovefiles"

@markfasheh
Copy link
Owner

Either way (different binary, or special mode of duperemove) it's the same amount of work. Taking a file list from fdupes for whole-file dedupe seems useful, I've added it to the list of development tasks:

https://github.com/markfasheh/duperemove/wiki/Development-Tasks

Let me know if you agree/disagree with that writeup, otherwise I'll close this issue for now.

@pwr22
Copy link
Author

pwr22 commented Feb 9, 2015

Looks good to me

@markfasheh
Copy link
Owner

FYI we now have an --fdupes option which will take output from fdupes and run whole-file dedupe on the results. Please give it a shot (example: fdupes . | duperemove --fdupes) and file any bugs you might find. Thanks again for the suggestion!

@pwr22
Copy link
Author

pwr22 commented Sep 17, 2015

Sorry for not getting back to you sooner, just wanted to say thanks for much for implementing this, I'm trying it now

@tlhonmey
Copy link

tlhonmey commented Oct 2, 2015

Thanks for adding this. It gives me a way to be able to dedupe large quantities of small files safely without having to set the block size awfully low.

@Floyddotnet
Copy link

Is there any option to skip files are already deduped? run "fdupes . | duperemove --fdupes" twice the 2. run also needs many time trying to dedupe files there are already deduped in the previous run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants