Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: --algo blake3 #6

Closed
spock opened this issue Dec 17, 2023 · 8 comments
Closed

question: --algo blake3 #6

spock opened this issue Dec 17, 2023 · 8 comments

Comments

@spock
Copy link
Contributor

spock commented Dec 17, 2023

Hi, thank you for a promising-looking file bitrot/hash checker. I especially like the built-in logic of "modified content and date are fine, modified content alone is not" - this exactly what I've been looking for!

For file integrity checking there is a rather new BLAKE3 algorithm, that is significantly faster (like 9x) than md5, but also claims to be better; they published an article with more details and benchmarks. It was designed specifically for file (content) hashing.

Primary (binary) implementation is in Rust (with parallelization), but there are also reference/educational non-parallel implementations in C and pure Python.

If you think this could be a nice --algo option, what could be the best way to integrate it? As you already have multi-worker support, I guess calling their single-threaded C library (or asking for single-thread processing from the main Rust library) would be the best? I haven't yet checked if Python bindings exist, but I'd assume they do.

@spock
Copy link
Contributor Author

spock commented Dec 17, 2023

This is somewhat related to closed #3

@spock
Copy link
Contributor Author

spock commented Dec 17, 2023

Rust library Python bindings do exist, default to 1 thread, and accept max_threads parameter https://pypi.org/project/blake3/

from blake3 import blake3

# Hash a large input using multiple threads. Note that this can be slower for
# inputs shorter than ~1 MB, and it's a good idea to benchmark it for your use
# case on your platform.
large_input = bytearray(1_000_000)
hash_single = blake3(large_input).digest()
hash_two = blake3(large_input, max_threads=2).digest()
hash_many = blake3(large_input, max_threads=blake3.AUTO).digest()
assert hash_single == hash_two == hash_many

laktak added a commit that referenced this issue Dec 19, 2023
@laktak
Copy link
Owner

laktak commented Dec 19, 2023

You can take a look in the blake3 branch. I have not had time to test it so please let me know how it performs.

@laktak
Copy link
Owner

laktak commented Dec 21, 2023

Thank you for suggesting blake3. It really has a lot of improvements over md5 so I've made it the default.

@laktak laktak closed this as completed Dec 21, 2023
@spock
Copy link
Contributor Author

spock commented Dec 21, 2023

Wow, thank you for such a quick integration! And sorry for not yet reacting to your request for testing - I would have done that within the next few days, as chkbit is my current favorite for collection hash checks.

Now I'm definitely going to try the new algorithm 😋 I did record hashing/checking times with md5 :)

Thank you!

@spock
Copy link
Contributor Author

spock commented Dec 21, 2023

A small update on speeds:

  • with a single worker chkbit is now 6 minutes ( 20% ) faster on my dataset (was: 30m, is: 24m)
  • with 6 workers, speedup is more modest at 20 seconds (was: 7m 4s, is: 6m 43s)
  • fancy output is nice and (with 6 workers) does not affect speed
  • I really like the new speed stats at the end!

Processed 48409 files in readonly mode.

  • 120.24 files/second
  • 1104.05 MB/second

@laktak
Copy link
Owner

laktak commented Dec 22, 2023

Hmm, I forgot to include elapsed so I had to fix that first ;)

With md5 (10 workers)

Processed 41417 files in readonly mode.
- 0:02:21 elapsed
- 292.57 files/second
- 2439.38 MB/second

With blake3

Processed 41417 files in readonly mode.
- 0:01:59 elapsed
- 345.36 files/second
- 2879.54 MB/second

@spock I think your IO is not able to keep up.

@spock
Copy link
Contributor Author

spock commented Dec 27, 2023

I agree, it looks like IO is my bottleneck with blake3. Hopefully the peculiarity of accessing a windows (NTFS) encrypted folder from within WSL2 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants