Allow setting SomeByteSize for first/last bytes checks #29

mbirth · 2019-06-28T12:34:36Z

I tried rdfind with firmware update files. The problem with these is that the first 1000 Bytes are identical and even the last 64 (current default in Fileinfo.hh) don't differ that much between the files. So rdfind resorts to calculating checksums which takes a long time (large files) in comparison.

Could you maybe add a parameter to set the SomeByteSize value to higher values?

The text was updated successfully, but these errors were encountered:

pauldreik · 2019-06-28T19:19:25Z

That sounds like a good idea - I think I also would need to hash the first SomeByteSize bytes in case they exceed the hash buffer size, which may slow things down.

pauldreik · 2019-06-28T19:20:40Z

It would be interesting to hear about your use case - how many duplicate files are there, how big are they? Thinking about how this partial comparisons could be improved further.

mbirth · 2019-06-28T19:45:27Z

Basically I've mirrored http://gawisp.com/perry/ and since one firmware often works for multiple devices, it appears in multiple files that are all the same. E.g. from the fenix_D2_tactix/ directory, the D2Delta_520.gcd, D2DeltaS_520.gcd and D2DeltaPX_520.gcd are identical. But while these firmwares are ~10 MiB in size, firmwares for other devices can be up to 600 MiB.

laktak · 2020-09-22T11:50:43Z

The first/last byte check does not seem very efficient, at least for me:

Now have 512092 files in total.
Total size is 3869933516438 bytes or 4 TiB
Removed 10321 files due to unique sizes from list.501771 files left.
Now eliminating candidates based on first bytes:removed 1252 files from list.500519 files left.
Now eliminating candidates based on last bytes:removed 332 files from list.500187 files left.

If you add this option it would be nice if 0 could turn that feature off.

brainchild0 · 2022-10-11T12:29:01Z

I recently processed a very large collection of small files, most about several megabytes. Opening each file in turn carries of great deal of overhead, and on most modern systems, the difference between processing a kilobyte versus a few megabytes in negligible. It may be useful as an optimization to read a full file on the first pass if it is not large, and to cache the full-file checksum for the later stage, if needed. It may also be helpful for large data sets to collect both the header and footer on the same pass, instead of two separate ones.

I do agree that 1000 bytes is a small size, given the easiness of reading much more data very quickly on a modern system. The data may be compared as a checksum instead of a the raw data, the same as done for the file contents, for efficiency, as suggested above.

pauldreik added enhancement New feature or request good idea something that I find being a good idea to work on labels Jun 28, 2019

mauromol mentioned this issue Jul 21, 2021

Allow to swap first bytes/last bytes checks #76

Open

fziglio mentioned this issue Aug 23, 2022

Do not eliminate candidates using first/last bytes for smaller files. #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting SomeByteSize for first/last bytes checks #29

Allow setting SomeByteSize for first/last bytes checks #29

mbirth commented Jun 28, 2019

pauldreik commented Jun 28, 2019

pauldreik commented Jun 28, 2019

mbirth commented Jun 28, 2019

laktak commented Sep 22, 2020

brainchild0 commented Oct 11, 2022

Allow setting SomeByteSize for first/last bytes checks #29

Allow setting SomeByteSize for first/last bytes checks #29

Comments

mbirth commented Jun 28, 2019

pauldreik commented Jun 28, 2019

pauldreik commented Jun 28, 2019

mbirth commented Jun 28, 2019

laktak commented Sep 22, 2020

brainchild0 commented Oct 11, 2022