Refactoring and extending the Digest module with better hash functions #12307

xavierleroy · 2023-06-16T15:28:11Z

The Digest standard library module provides functions to compute 128-bit digests (a.k.a. hashes) of strings, files, and other sequences of bytes. It's been used from the very beginning of OCaml to record dependencies over .cmi compiled interfaces in .cmi, .cmo and .cmx compiled files, supporting the infamous "make inconsistent assumptions on" link-time error.

The hashing function used is MD5. Back in 1996 it was the recommended cryptographic hash function, fast and reputedly secure. Today, it is known to be completely insecure, with easy collision attacks and chosen-prefix attacks. For many applications such as checksumming .cmi files, there is no security expectation and this is not a problem. But it can be a problem if users wrongly assume digests to be unforgeable. Also, any use of MD5 raises alarms in a security review, even if it is used for checksumming and not as a cryptographic hash function.

This PR does not (yet) propose to change the existing Digest functions to use another, more secure hashing function. Instead, it lays the ground work to do so.

The first commit introduces a signature Digest.S for hash functions, providing the same functions as Digest, and one implementation of this signature, Digest.MD5, which contains the same MD5 implementation that we have today for Digest, with a guarantee that it will always use MD5 in the future. This way, existing OCaml code that relies on Digest implementing MD5 can start using Digest.MD5 instead of Digest and be assured not to break in the near future.

The second commit adds modules Digest.BLAKE2_512, Digest.BLAKE2_256, and Digest.BLAKE2_128, using the BLAKE2b hash function specialized to 512, 256 or 128-bit hashes. BLAKE2b is an instance of a modern, cryptographically-safe hash function. (I'll explain later why I find it particularly interesting compared to other obvious choices like SHA-256.)

As soon as BLAKE2b is provided as submodules of Digest, we can start experimenting with it, e.g. try to use it for .cmi checksumming instead of MD5.

If everything goes well, in 2-3 major releases from now, we could consider changing the default Digest implementation from MD5 to, for instance, BLAKE2_128. (I have a hunch that most users of Digest don't care about the hashing algorithm used but do depend on hashes being 16-byte long.)

c-cube · 2023-06-16T16:04:17Z

This seems lovely to me, as I looked at Digest in the past but never used it because its interface is so limited (and md5 is indeed quite obsolete now). However I think the interface Digest.S is still a bit restrictive.

It's good to have string and bytes and substring but they're still not enough for the general case. If channels were extensible, input would suffice, of course, but sadly they're not.

I think something like of_read : (bytes -> int -> int -> int) -> t would be very useful, and not just here. If you want to compress then encrypt/hash, that's probably what you're going to get from camlzip and the likes. Having to write to a file, or read entirely into a string, is both slower and more inconvenient than just having a stream like that.

dbuenzli · 2023-06-16T16:05:23Z

stdlib/digest.mli

+
+(** {1 Generic interface} *)
+
+module type S = sig


I think it would be nice to add a val id : string or val : name : string made to hold a lowercased ascii identifier for the digest function. This allows to easily index them by name via first-class modules. See for example here for something similar (for now not suggesting to add the lookup API, only the names).

dbuenzli · 2023-06-16T16:11:46Z

stdlib/digest.ml

+external update: state -> string -> int -> int -> unit = "caml_blake2_update"
+external final: state -> int -> t = "caml_blake2_final"
+external unsafe_string: int -> string -> string -> int -> int -> t
+                      = "caml_blake2_string"


What about tweaking that to a HASH interface without the blake specific ints. And provide Digest.Make that derives S from that. This could be useful for the larger eco-system.

Currently, the implementation is structured like this, with a Make functor parameterized by a low-level hash function, but I haven't exported it yet, and I'm not sure it would really be usable.

stdlib/digest.mli

xavierleroy · 2023-06-18T09:19:51Z

Thanks for the quick feedback.

Re: incremental computation of hashes, I was planning to add a rolling hash API at some point. It's now done in af40ba6. I had to add more C stubs for MD5, which is annoying, though.

Re: of_hex vs from_hex, see 002a780.

testsuite/tests/lib-digest/digests.ml

stdlib/digest.mli

wikku · 2023-06-18T10:56:16Z

stdlib/digest.mli

+  val get_digest : state -> t
+    (** [get_digest st] returns the digest of all the data that has
+        been added to [st].
+        The rolling digest [st] should not be used afterwards. *)


Isn't this due to destructively padding to 128 bytes at the end?

Then this could be made "safe" by padding a copy of the state. A copy function might be useful by itself.

Alternatively, if we want to restrict ourselves to more basic scenarios, various tricks could be used to hint at or enforce this linear discipline. For example with_state : (state -> unit) -> t similar to In_channel.with_open_text etc.

It could be made safe for reuse, but I'm not sure there's a use case. Opinions welcome.

gasche

This PR is interesting, it provides two things that I would like to have:

a path towards a non-MD5 Digest implementation (this PR does only a first step, for compatibility reasons)
a better OCaml-side API for hashes where input can be fed gradually to a hasher state

I broadly agree with the approach and would be in favor of merging eventually. I have not reviewed the Blake2 implementation in details -- I think that someone should check that it corresponds to its published pseudo-code.

stdlib/digest.mli

runtime/blake2.c

stdlib/digest.mli

stdlib/digest.ml

xavierleroy · 2023-10-27T17:10:14Z

Note to reviewers: the documentation needs more work, I'm not happy with the wording in several places.

hannesm · 2023-10-27T17:32:17Z

I do appreciate this work. Thanks a lot.

I only have one question / remark: why is the "incremental" API mutable (is that for performance reasons? if it has been measured, with what usage pattern)?

If it stays mutable, a function val copy : state -> state would help for use cases where you get data, have to compute a hash value, but keep on feeding to produce another digest with more data that was hashed.

xavierleroy · 2023-10-29T10:14:32Z

I see some legitimate questions about the new incremental hash API, and they make me think twice about my API choice.

Originally, this PR did not have an incremental API, it was only about providing multiple implementations of the existing API, so that we can gradually move away from MD5. The incremental API was added following reviewer's comments, and I didn't think hard about the API, just using the low-level, imperative, linear-use API that is typical of C crypto libraries and that I used previously in my Cryptokit library. To me it's an obvious choice, but not to everyone, apparently.

Since PRs are the wrong place to evolve a new API, I decided to remove the incremental API from this PR. Now we're back to the multiple implementations of the old, nonincremental API that I'd like to push in 5.2. I'm asking reviewers to please focus on this API and not ask for more functions until we get something merged.

The original code with the questionable incremental API is kept at https://github.com/xavierleroy/ocaml/tree/blake2b-incremental .

OlivierNicole · 2023-12-04T16:38:59Z

During the last triaging meeting it was agreed that I would review this PR. I was told that more-than-basic cryptographic knowledge wasn’t necessary. Did I understand correctly @gasche?

The longest task will be checking the C implementation with regard to the BLAKE2 RFC. Since said RFC provides a reference implementation, I was wondering why that implementation was not the one used?

xavierleroy · 2023-12-06T15:57:57Z

During the last triaging meeting it was agreed that I would review this PR.

Thanks a lot for your help!

Since said RFC provides a reference implementation, I was wondering why that implementation was not the one used?

I took the BLAKE2b implementation from my Cryptokit library https://github.com/xavierleroy/cryptokit/blob/master/src/blake2.c , but it's essentially the RFC 7693 reference code with a few style changes to match the style of the rest of Cryptokit.

The longest task will be checking the C implementation with regard to the BLAKE2 RFC.

For symmetric cryptographic functions, it's quite hard to make a honest mistake and still pass the test vectors, owing to the "pseudo-random" nature of these functions. Well, the (non-cryptographic) buffering code could trigger a small buffer overflow yet pass the tests, so maybe this needs special attention.

So, if I were you I'd focus on the test vectors, which I'm not completely happy about : the 512-bit vectors are taken from public sources and/or running openssl, but the 256-bit and 128-bit vectors are home-made from the RFC implementation. Also, there's a "self test" in the RFC that I didn't implement and may be worth adding somewhere.

OlivierNicole

As advised, I did not review each and every line of cryptographic computations, but I:

looked particularly hard to convince myself of the absence of overflows of the buffer;
checked that I obtained the same results for the test vectors on 256-bit and 128-bit with the GNU coreutils implementation of b2sum;
propose a patch adding the BLAKE2b self-test from the RFC at OlivierNicole@117bedb.

runtime/caml/blake2.h

runtime/blake2.c

runtime/caml/blake2.h

…gest.MD5 Also: use In_channel.with_open_bin.

`Digest.from_hex` remains available as an alias for `Digest.of_hex`.

"length" makes it clearer that the unit is bytes, not bits.

… size As pointed out during review, an unsigned type makes more sense. RFC 7693 uses size_t throughout; this is safe and nicely consistent.

xavierleroy · 2023-12-14T13:46:15Z

Thanks a million @OlivierNicole for the review and for the "self test", which adds significant confidence!

Your remark about unsigned vs signed integers is well taken. Going back to the RFC code, they use size_t for key sizes, hash sizes, and buffer occupancy; this sound like a good choice, since size_t is unsigned and large enough to hold any data block size. So I changed my code to use size_t as well. Let me know what you think.

OlivierNicole · 2023-12-14T15:44:53Z

Thank you, the change for size_t looks good to me.

gasche

I had previously reviewed the interface, but not the crypto details. Together with Olivier's review of the actual code (the hardest part), we now have a full review, maybe slightly more.

I would be keen to see this PR merged in 5.2 as it introduces the Digest. submodules that people can use for forward-compatibility if they want a specific implementation.

Other avenues of extensions have been discussed and could come in further PRs:

rolling/incremental hashes
a bit more runtime information in each module, for example the id/name of the algorithm used, as suggested by @dbuenzli

stdlib/digest.mli

gasche · 2023-12-14T16:05:49Z

Changes

@@ -258,6 +258,9 @@ Working version
   Guillaume Munch-Maccagnoni, KC Sivaramakrishnan, Stefan Muenzel,
   Xavier Leroy)

+- #12307: Add BLAKE2b hashing and an MD5 submodule to the Digest module.
+  (Xavier Leroy, review by Olivier Nicole)


(I suspect that more reviewers deserve a mention: myself, wikku (Wiktor Kuchta), Daniel and, possibly Hannes.)

Updated. Note that reviewing is one thing and just commenting is another.

I agree with @xavierleroy.

dra27

LGTM. I checked that the implementation matches Cryptokit's modulo the nice modernisations and the API change works for me.

Given the effect on the PR's diff, a whitespace-only commit amending the indentation of the modules in stdlib/digest.ml could be direct-pushed after merging?

dra27 · 2023-12-18T09:33:28Z

runtime/caml/blake2.h

+
+#ifdef CAML_INTERNALS
+
+#include "mlvalues.h"


Nit:

Suggested change

#include "mlvalues.h"

#include "misc.h"

(just CAMLextern is needed)

Makes sense. Done in d2a89ee

dra27 · 2023-12-18T10:02:38Z

stdlib/digest.ml

+
+module BLAKE2 (X: sig val hash_length : int end) : S = struct
+
+type t = string


Shouldn't L64-L133 be indented?

I have no good excuse to keep this code un-indented at this point (but it makes editing simpler during development), so I re-indented in 715b727.

dra27 · 2023-12-18T10:07:44Z

stdlib/digest.ml

+
+(* MD5 hashing *)
+
+module MD5 = struct

 type t = string


Similarly, it's diff-noisy but this ought to be indented, too, at this final stage? (the GH interface at least makes space change clear, and there's patdiff and the --ignore- family of git-diff options elsewhere)

Done in 715b727 as well.

Co-authored-by: David Allsopp <david.allsopp@metastack.com>

xavierleroy · 2023-12-18T15:43:17Z

FTR: if I'm not the one who merges this PR, I'd like it to be squash-merged, as the history is too non-linear.

dra27 · 2023-12-18T17:47:46Z

I was about to squash, but it occurs to me that it might be nice to keep the git-blame of the test @OlivierNicole wrote intact, unless he's not bothered? It would just be a final rebase squashing all the other commits together into one, but leaving that commit separate.

xavierleroy · 2023-12-18T17:57:17Z

I definitely want @OlivierNicole to be credited, but I though he could just appear as co-author, as the Github "squash-merge" button does by default.

OlivierNicole · 2023-12-18T18:14:31Z

I'm don't mind either way. Squashing is fine (although I have the impression that it gives me too much credit, given that @xavierleroy did most of the work). But thanks for asking!

xavierleroy · 2023-12-18T18:16:39Z

Three cheers!

dbuenzli reviewed Jun 16, 2023

View reviewed changes

wikku reviewed Jun 18, 2023

View reviewed changes

gasche reviewed Aug 7, 2023

View reviewed changes

xavierleroy force-pushed the blake2b branch from 002a780 to 497cbce Compare October 27, 2023 15:24

xavierleroy force-pushed the blake2b branch from 75f602d to 5962244 Compare October 29, 2023 10:14

gasche self-assigned this Nov 29, 2023

dra27 added the stdlib label Nov 29, 2023

OlivierNicole approved these changes Dec 13, 2023

View reviewed changes

runtime/caml/blake2.h Outdated Show resolved Hide resolved

runtime/blake2.c Show resolved Hide resolved

runtime/caml/blake2.h Outdated Show resolved Hide resolved

xavierleroy and others added 11 commits December 14, 2023 12:04

Digest: add generic interface Digest.S and specific implementation Di…

d74e68e

…gest.MD5 Also: use In_channel.with_open_bin.

Digest: add BLAKE2b in 512, 256 and 128-bit variants

7e5911b

Use shorter names BLAKEnnn instead of BLAKE2_nnn

86b5967

Use of_hex instead of from_hex in the new API

14252d3

`Digest.from_hex` remains available as an alias for `Digest.of_hex`.

More test vectors

7107b2a

Documentation updates

d41f69a

Renaming: hash_length instead of hash_size

96f5e71

"length" makes it clearer that the unit is bytes, not bits.

Missing root registration + harden against excessively long keys

67cdd9f

Digest: Add BLAKE2b self-test

949c2e5

BLAKE2: use size_t instead of int for key length, hash length, buffer…

6d44ff2

… size As pointed out during review, an unsigned type makes more sense. RFC 7693 uses size_t throughout; this is safe and nicely consistent.

Changes entry for 12307

1e98a68

xavierleroy force-pushed the blake2b branch from 5962244 to 1e98a68 Compare December 14, 2023 13:42

gasche approved these changes Dec 14, 2023

View reviewed changes

BLAKE2: typos

0dfd90e

dra27 approved these changes Dec 18, 2023

View reviewed changes

xavierleroy and others added 2 commits December 18, 2023 15:40

Proper indentation for submodules

715b727

Simplify runtime/caml/blake2.h

d2a89ee

Co-authored-by: David Allsopp <david.allsopp@metastack.com>

dra27 added the merge-me label Dec 18, 2023

dra27 merged commit b82413d into ocaml:trunk Dec 18, 2023
10 checks passed

xavierleroy deleted the blake2b branch December 18, 2023 18:31

This was referenced Dec 18, 2023

Support for OCaml trunk ocsigen/js_of_ocaml#1487

Draft

Fix compilation with ocaml trunk ocaml/ocamlbuild#325

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring and extending the Digest module with better hash functions #12307

Refactoring and extending the Digest module with better hash functions #12307

xavierleroy commented Jun 16, 2023

c-cube commented Jun 16, 2023

dbuenzli Jun 16, 2023 •

edited

dbuenzli Jun 16, 2023

xavierleroy Jun 18, 2023

xavierleroy commented Jun 18, 2023

wikku Jun 18, 2023

xavierleroy Oct 27, 2023

gasche left a comment

xavierleroy commented Oct 27, 2023

hannesm commented Oct 27, 2023

xavierleroy commented Oct 29, 2023

OlivierNicole commented Dec 4, 2023

xavierleroy commented Dec 6, 2023

OlivierNicole left a comment

xavierleroy commented Dec 14, 2023

OlivierNicole commented Dec 14, 2023

gasche left a comment

gasche Dec 14, 2023

xavierleroy Dec 15, 2023

hannesm Dec 15, 2023

dra27 left a comment

dra27 Dec 18, 2023

xavierleroy Dec 18, 2023

dra27 Dec 18, 2023

xavierleroy Dec 18, 2023

dra27 Dec 18, 2023

xavierleroy Dec 18, 2023

xavierleroy commented Dec 18, 2023

dra27 commented Dec 18, 2023

xavierleroy commented Dec 18, 2023

OlivierNicole commented Dec 18, 2023

xavierleroy commented Dec 18, 2023


		module BLAKE2 (X: sig val hash_length : int end) : S = struct

		type t = string

Refactoring and extending the Digest module with better hash functions #12307

Refactoring and extending the Digest module with better hash functions #12307

Conversation

xavierleroy commented Jun 16, 2023

c-cube commented Jun 16, 2023

dbuenzli Jun 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xavierleroy commented Jun 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche left a comment

Choose a reason for hiding this comment

xavierleroy commented Oct 27, 2023

hannesm commented Oct 27, 2023

xavierleroy commented Oct 29, 2023

OlivierNicole commented Dec 4, 2023

xavierleroy commented Dec 6, 2023

OlivierNicole left a comment

Choose a reason for hiding this comment

xavierleroy commented Dec 14, 2023

OlivierNicole commented Dec 14, 2023

gasche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dra27 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xavierleroy commented Dec 18, 2023

dra27 commented Dec 18, 2023

xavierleroy commented Dec 18, 2023

OlivierNicole commented Dec 18, 2023

xavierleroy commented Dec 18, 2023

dbuenzli Jun 16, 2023 •

edited