Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting SomeByteSize for first/last bytes checks #29

Open
mbirth opened this issue Jun 28, 2019 · 5 comments
Open

Allow setting SomeByteSize for first/last bytes checks #29

mbirth opened this issue Jun 28, 2019 · 5 comments
Labels
enhancement New feature or request good idea something that I find being a good idea to work on

Comments

@mbirth
Copy link

mbirth commented Jun 28, 2019

I tried rdfind with firmware update files. The problem with these is that the first 1000 Bytes are identical and even the last 64 (current default in Fileinfo.hh) don't differ that much between the files. So rdfind resorts to calculating checksums which takes a long time (large files) in comparison.

Could you maybe add a parameter to set the SomeByteSize value to higher values?

@pauldreik
Copy link
Owner

That sounds like a good idea - I think I also would need to hash the first SomeByteSize bytes in case they exceed the hash buffer size, which may slow things down.

@pauldreik
Copy link
Owner

It would be interesting to hear about your use case - how many duplicate files are there, how big are they? Thinking about how this partial comparisons could be improved further.

@pauldreik pauldreik added enhancement New feature or request good idea something that I find being a good idea to work on labels Jun 28, 2019
@mbirth
Copy link
Author

mbirth commented Jun 28, 2019

Basically I've mirrored http://gawisp.com/perry/ and since one firmware often works for multiple devices, it appears in multiple files that are all the same. E.g. from the fenix_D2_tactix/ directory, the D2Delta_520.gcd, D2DeltaS_520.gcd and D2DeltaPX_520.gcd are identical. But while these firmwares are ~10 MiB in size, firmwares for other devices can be up to 600 MiB.

@laktak
Copy link

laktak commented Sep 22, 2020

The first/last byte check does not seem very efficient, at least for me:

Now have 512092 files in total.
Total size is 3869933516438 bytes or 4 TiB
Removed 10321 files due to unique sizes from list.501771 files left.
Now eliminating candidates based on first bytes:removed 1252 files from list.500519 files left.
Now eliminating candidates based on last bytes:removed 332 files from list.500187 files left.

If you add this option it would be nice if 0 could turn that feature off.

@brainchild0
Copy link

I recently processed a very large collection of small files, most about several megabytes. Opening each file in turn carries of great deal of overhead, and on most modern systems, the difference between processing a kilobyte versus a few megabytes in negligible. It may be useful as an optimization to read a full file on the first pass if it is not large, and to cache the full-file checksum for the later stage, if needed. It may also be helpful for large data sets to collect both the header and footer on the same pass, instead of two separate ones.

I do agree that 1000 bytes is a small size, given the easiness of reading much more data very quickly on a modern system. The data may be compared as a checksum instead of a the raw data, the same as done for the file contents, for efficiency, as suggested above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good idea something that I find being a good idea to work on
Projects
None yet
Development

No branches or pull requests

4 participants