Feat: --skip-content-hash, --max-prefix-size, --max-suffix-size options #202

johnpyp · 2023-06-03T22:27:11Z

Partially fixes: #201

(Completely open to changes in the naming/wording/api/etc.)

Changes

Upgrade deps
Add --max-prefix-size and --max-suffix-size options
- These options will set the max prefix and suffix size for the prescan, reducing the chance of duplicates before the full hash scan.
Add --skip-content-hash option
- Skips the final stage content hash, and just returns the result after the suffix stage (didn't implement for --transform)

Potential Follow-up

Random chunk checks:

--random-chunk-checks=5
--random-chunk-size=16MiB

Though prefix and suffix size checks are a great pre-filtering step, they are of course the parts of the file that would seem the most likely to be the same among different files. However, there are still cases where fully-hashing the file would take a prohibitively long time or be too expensive.

Instead of a full hash, we could use the file's byte-size as a seed to randomly select n chunks to read from and group in the same fashion as the prefix and suffix checks. Doing this should make it very unlikely for duplication while still being orders of magnitude faster than full content hashing. It also has the nice side effect of being a great continuous tuning-lever to find a balance between safety and speed.

- dirs 4.0 -> 5.0.1 - fallible-iterator 0.2 -> 0.3 - sysinfo 0.28 -> 0.29 - Required renaming DiskType to DiskKind in various places

--max-prefix-size - Configurable byte-size parameter for the max length of a file to hash for prefix checking --max-suffix-size - Same as --max-prefix-size, but for the suffix check

--skip-content-hash will skip the final stage, returning the results from the previous groupings as the final result. This can speed up the checking byorders of magnitude on large files, and alongside --max-prefix-size and --max-suffix-size, can still provide reasonable guarantees on whether files are duplicates.

pkolaczk

Thank you so much! This is a very nice feature.

johnpyp added 3 commits June 3, 2023 14:17

Upgrade dirs, fallible-iterator, sysinfo

badd973

- dirs 4.0 -> 5.0.1 - fallible-iterator 0.2 -> 0.3 - sysinfo 0.28 -> 0.29 - Required renaming DiskType to DiskKind in various places

Add --max-prefix-size and --max-suffix-size options

8f6a9a9

--max-prefix-size - Configurable byte-size parameter for the max length of a file to hash for prefix checking --max-suffix-size - Same as --max-prefix-size, but for the suffix check

pkolaczk approved these changes Jun 4, 2023

View reviewed changes

pkolaczk merged commit ccb4e18 into pkolaczk:main Jun 4, 2023

pkolaczk mentioned this pull request Jun 4, 2023

fclones 0.31.0 Homebrew/homebrew-core#132825

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: --skip-content-hash, --max-prefix-size, --max-suffix-size options #202

Feat: --skip-content-hash, --max-prefix-size, --max-suffix-size options #202

johnpyp commented Jun 3, 2023 •

edited

Loading

pkolaczk left a comment

Feat: --skip-content-hash, --max-prefix-size, --max-suffix-size options #202

Feat: --skip-content-hash, --max-prefix-size, --max-suffix-size options #202

Conversation

johnpyp commented Jun 3, 2023 • edited Loading

Changes

Potential Follow-up

pkolaczk left a comment

Choose a reason for hiding this comment

johnpyp commented Jun 3, 2023 •

edited

Loading