Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring and extending the Digest module with better hash functions #12307

Merged
merged 14 commits into from
Dec 18, 2023

Conversation

xavierleroy
Copy link
Contributor

The Digest standard library module provides functions to compute 128-bit digests (a.k.a. hashes) of strings, files, and other sequences of bytes. It's been used from the very beginning of OCaml to record dependencies over .cmi compiled interfaces in .cmi, .cmo and .cmx compiled files, supporting the infamous "make inconsistent assumptions on" link-time error.

The hashing function used is MD5. Back in 1996 it was the recommended cryptographic hash function, fast and reputedly secure. Today, it is known to be completely insecure, with easy collision attacks and chosen-prefix attacks. For many applications such as checksumming .cmi files, there is no security expectation and this is not a problem. But it can be a problem if users wrongly assume digests to be unforgeable. Also, any use of MD5 raises alarms in a security review, even if it is used for checksumming and not as a cryptographic hash function.

This PR does not (yet) propose to change the existing Digest functions to use another, more secure hashing function. Instead, it lays the ground work to do so.

The first commit introduces a signature Digest.S for hash functions, providing the same functions as Digest, and one implementation of this signature, Digest.MD5, which contains the same MD5 implementation that we have today for Digest, with a guarantee that it will always use MD5 in the future. This way, existing OCaml code that relies on Digest implementing MD5 can start using Digest.MD5 instead of Digest and be assured not to break in the near future.

The second commit adds modules Digest.BLAKE2_512, Digest.BLAKE2_256, and Digest.BLAKE2_128, using the BLAKE2b hash function specialized to 512, 256 or 128-bit hashes. BLAKE2b is an instance of a modern, cryptographically-safe hash function. (I'll explain later why I find it particularly interesting compared to other obvious choices like SHA-256.)

As soon as BLAKE2b is provided as submodules of Digest, we can start experimenting with it, e.g. try to use it for .cmi checksumming instead of MD5.

If everything goes well, in 2-3 major releases from now, we could consider changing the default Digest implementation from MD5 to, for instance, BLAKE2_128. (I have a hunch that most users of Digest don't care about the hashing algorithm used but do depend on hashes being 16-byte long.)

@c-cube
Copy link
Contributor

c-cube commented Jun 16, 2023

This seems lovely to me, as I looked at Digest in the past but never used it because its interface is so limited (and md5 is indeed quite obsolete now). However I think the interface Digest.S is still a bit restrictive.

It's good to have string and bytes and substring but they're still not enough for the general case. If channels were extensible, input would suffice, of course, but sadly they're not.

I think something like of_read : (bytes -> int -> int -> int) -> t would be very useful, and not just here. If you want to compress then encrypt/hash, that's probably what you're going to get from camlzip and the likes. Having to write to a file, or read entirely into a string, is both slower and more inconvenient than just having a stream like that.


(** {1 Generic interface} *)

module type S = sig
Copy link
Contributor

@dbuenzli dbuenzli Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to add a val id : string or val : name : string made to hold a lowercased ascii identifier for the digest function. This allows to easily index them by name via first-class modules. See for example here for something similar (for now not suggesting to add the lookup API, only the names).

stdlib/digest.ml Outdated
external update: state -> string -> int -> int -> unit = "caml_blake2_update"
external final: state -> int -> t = "caml_blake2_final"
external unsafe_string: int -> string -> string -> int -> int -> t
= "caml_blake2_string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about tweaking that to a HASH interface without the blake specific ints. And provide Digest.Make that derives S from that. This could be useful for the larger eco-system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the implementation is structured like this, with a Make functor parameterized by a low-level hash function, but I haven't exported it yet, and I'm not sure it would really be usable.

stdlib/digest.mli Outdated Show resolved Hide resolved
@xavierleroy
Copy link
Contributor Author

Thanks for the quick feedback.

Re: incremental computation of hashes, I was planning to add a rolling hash API at some point. It's now done in af40ba6. I had to add more C stubs for MD5, which is annoying, though.

Re: of_hex vs from_hex, see 002a780.

testsuite/tests/lib-digest/digests.ml Outdated Show resolved Hide resolved
stdlib/digest.mli Outdated Show resolved Hide resolved
stdlib/digest.mli Outdated Show resolved Hide resolved
val get_digest : state -> t
(** [get_digest st] returns the digest of all the data that has
been added to [st].
The rolling digest [st] should not be used afterwards. *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this due to destructively padding to 128 bytes at the end?

Then this could be made "safe" by padding a copy of the state. A copy function might be useful by itself.

Alternatively, if we want to restrict ourselves to more basic scenarios, various tricks could be used to hint at or enforce this linear discipline. For example with_state : (state -> unit) -> t similar to In_channel.with_open_text etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be made safe for reuse, but I'm not sure there's a use case. Opinions welcome.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is interesting, it provides two things that I would like to have:

  • a path towards a non-MD5 Digest implementation (this PR does only a first step, for compatibility reasons)
  • a better OCaml-side API for hashes where input can be fed gradually to a hasher state

I broadly agree with the approach and would be in favor of merging eventually. I have not reviewed the Blake2 implementation in details -- I think that someone should check that it corresponds to its published pseudo-code.

stdlib/digest.mli Show resolved Hide resolved
runtime/blake2.c Show resolved Hide resolved
stdlib/digest.mli Outdated Show resolved Hide resolved
stdlib/digest.ml Outdated Show resolved Hide resolved
stdlib/digest.ml Outdated Show resolved Hide resolved
stdlib/digest.ml Outdated Show resolved Hide resolved
stdlib/digest.ml Outdated Show resolved Hide resolved
@xavierleroy
Copy link
Contributor Author

Note to reviewers: the documentation needs more work, I'm not happy with the wording in several places.

@hannesm
Copy link
Member

hannesm commented Oct 27, 2023

I do appreciate this work. Thanks a lot.

I only have one question / remark: why is the "incremental" API mutable (is that for performance reasons? if it has been measured, with what usage pattern)?

If it stays mutable, a function val copy : state -> state would help for use cases where you get data, have to compute a hash value, but keep on feeding to produce another digest with more data that was hashed.

@xavierleroy
Copy link
Contributor Author

I see some legitimate questions about the new incremental hash API, and they make me think twice about my API choice.

Originally, this PR did not have an incremental API, it was only about providing multiple implementations of the existing API, so that we can gradually move away from MD5. The incremental API was added following reviewer's comments, and I didn't think hard about the API, just using the low-level, imperative, linear-use API that is typical of C crypto libraries and that I used previously in my Cryptokit library. To me it's an obvious choice, but not to everyone, apparently.

Since PRs are the wrong place to evolve a new API, I decided to remove the incremental API from this PR. Now we're back to the multiple implementations of the old, nonincremental API that I'd like to push in 5.2. I'm asking reviewers to please focus on this API and not ask for more functions until we get something merged.

The original code with the questionable incremental API is kept at https://github.com/xavierleroy/ocaml/tree/blake2b-incremental .

@gasche gasche self-assigned this Nov 29, 2023
@dra27 dra27 added the stdlib label Nov 29, 2023
@OlivierNicole
Copy link
Contributor

During the last triaging meeting it was agreed that I would review this PR. I was told that more-than-basic cryptographic knowledge wasn’t necessary. Did I understand correctly @gasche?

The longest task will be checking the C implementation with regard to the BLAKE2 RFC. Since said RFC provides a reference implementation, I was wondering why that implementation was not the one used?

@xavierleroy
Copy link
Contributor Author

During the last triaging meeting it was agreed that I would review this PR.

Thanks a lot for your help!

Since said RFC provides a reference implementation, I was wondering why that implementation was not the one used?

I took the BLAKE2b implementation from my Cryptokit library https://github.com/xavierleroy/cryptokit/blob/master/src/blake2.c , but it's essentially the RFC 7693 reference code with a few style changes to match the style of the rest of Cryptokit.

The longest task will be checking the C implementation with regard to the BLAKE2 RFC.

For symmetric cryptographic functions, it's quite hard to make a honest mistake and still pass the test vectors, owing to the "pseudo-random" nature of these functions. Well, the (non-cryptographic) buffering code could trigger a small buffer overflow yet pass the tests, so maybe this needs special attention.

So, if I were you I'd focus on the test vectors, which I'm not completely happy about : the 512-bit vectors are taken from public sources and/or running openssl, but the 256-bit and 128-bit vectors are home-made from the RFC implementation. Also, there's a "self test" in the RFC that I didn't implement and may be worth adding somewhere.

Copy link
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As advised, I did not review each and every line of cryptographic computations, but I:

  1. looked particularly hard to convince myself of the absence of overflows of the buffer;
  2. checked that I obtained the same results for the test vectors on 256-bit and 128-bit with the GNU coreutils implementation of b2sum;
  3. propose a patch adding the BLAKE2b self-test from the RFC at OlivierNicole@117bedb.

runtime/caml/blake2.h Outdated Show resolved Hide resolved
runtime/blake2.c Show resolved Hide resolved
runtime/caml/blake2.h Outdated Show resolved Hide resolved
@xavierleroy
Copy link
Contributor Author

Thanks a million @OlivierNicole for the review and for the "self test", which adds significant confidence!

Your remark about unsigned vs signed integers is well taken. Going back to the RFC code, they use size_t for key sizes, hash sizes, and buffer occupancy; this sound like a good choice, since size_t is unsigned and large enough to hold any data block size. So I changed my code to use size_t as well. Let me know what you think.

@OlivierNicole
Copy link
Contributor

Thank you, the change for size_t looks good to me.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously reviewed the interface, but not the crypto details. Together with Olivier's review of the actual code (the hardest part), we now have a full review, maybe slightly more.

I would be keen to see this PR merged in 5.2 as it introduces the Digest. submodules that people can use for forward-compatibility if they want a specific implementation.

Other avenues of extensions have been discussed and could come in further PRs:

  • rolling/incremental hashes
  • a bit more runtime information in each module, for example the id/name of the algorithm used, as suggested by @dbuenzli

stdlib/digest.mli Outdated Show resolved Hide resolved
Changes Outdated
@@ -258,6 +258,9 @@ Working version
Guillaume Munch-Maccagnoni, KC Sivaramakrishnan, Stefan Muenzel,
Xavier Leroy)

- #12307: Add BLAKE2b hashing and an MD5 submodule to the Digest module.
(Xavier Leroy, review by Olivier Nicole)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I suspect that more reviewers deserve a mention: myself, wikku (Wiktor Kuchta), Daniel and, possibly Hannes.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Note that reviewing is one thing and just commenting is another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @xavierleroy.

Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I checked that the implementation matches Cryptokit's modulo the nice modernisations and the API change works for me.

Given the effect on the PR's diff, a whitespace-only commit amending the indentation of the modules in stdlib/digest.ml could be direct-pushed after merging?


#ifdef CAML_INTERNALS

#include "mlvalues.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
#include "mlvalues.h"
#include "misc.h"

(just CAMLextern is needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done in d2a89ee

stdlib/digest.ml Outdated

module BLAKE2 (X: sig val hash_length : int end) : S = struct

type t = string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't L64-L133 be indented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no good excuse to keep this code un-indented at this point (but it makes editing simpler during development), so I re-indented in 715b727.

stdlib/digest.ml Outdated

(* MD5 hashing *)

module MD5 = struct

type t = string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, it's diff-noisy but this ought to be indented, too, at this final stage? (the GH interface at least makes space change clear, and there's patdiff and the --ignore- family of git-diff options elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 715b727 as well.

xavierleroy and others added 2 commits December 18, 2023 15:40
Co-authored-by: David Allsopp <david.allsopp@metastack.com>
@xavierleroy
Copy link
Contributor Author

FTR: if I'm not the one who merges this PR, I'd like it to be squash-merged, as the history is too non-linear.

@dra27 dra27 added the merge-me label Dec 18, 2023
@dra27
Copy link
Member

dra27 commented Dec 18, 2023

I was about to squash, but it occurs to me that it might be nice to keep the git-blame of the test @OlivierNicole wrote intact, unless he's not bothered? It would just be a final rebase squashing all the other commits together into one, but leaving that commit separate.

@xavierleroy
Copy link
Contributor Author

I definitely want @OlivierNicole to be credited, but I though he could just appear as co-author, as the Github "squash-merge" button does by default.

@OlivierNicole
Copy link
Contributor

I'm don't mind either way. Squashing is fine (although I have the impression that it gives me too much credit, given that @xavierleroy did most of the work). But thanks for asking!

@dra27 dra27 merged commit b82413d into ocaml:trunk Dec 18, 2023
10 checks passed
@xavierleroy
Copy link
Contributor Author

Three cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants