-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow setting SomeByteSize for first/last bytes checks #29
Comments
That sounds like a good idea - I think I also would need to hash the first SomeByteSize bytes in case they exceed the hash buffer size, which may slow things down. |
It would be interesting to hear about your use case - how many duplicate files are there, how big are they? Thinking about how this partial comparisons could be improved further. |
Basically I've mirrored http://gawisp.com/perry/ and since one firmware often works for multiple devices, it appears in multiple files that are all the same. E.g. from the |
The first/last byte check does not seem very efficient, at least for me:
If you add this option it would be nice if |
I recently processed a very large collection of small files, most about several megabytes. Opening each file in turn carries of great deal of overhead, and on most modern systems, the difference between processing a kilobyte versus a few megabytes in negligible. It may be useful as an optimization to read a full file on the first pass if it is not large, and to cache the full-file checksum for the later stage, if needed. It may also be helpful for large data sets to collect both the header and footer on the same pass, instead of two separate ones. I do agree that 1000 bytes is a small size, given the easiness of reading much more data very quickly on a modern system. The data may be compared as a checksum instead of a the raw data, the same as done for the file contents, for efficiency, as suggested above. |
I tried rdfind with firmware update files. The problem with these is that the first 1000 Bytes are identical and even the last 64 (current default in
Fileinfo.hh
) don't differ that much between the files. So rdfind resorts to calculating checksums which takes a long time (large files) in comparison.Could you maybe add a parameter to set the
SomeByteSize
value to higher values?The text was updated successfully, but these errors were encountered: