Skip to content

Bring Uchar hashing on par with other base types like Int, Char, ...#13240

Merged
nojb merged 1 commit into
ocaml:trunkfrom
hyphenrf:hazem/hash-uchar
Jun 17, 2024
Merged

Bring Uchar hashing on par with other base types like Int, Char, ...#13240
nojb merged 1 commit into
ocaml:trunkfrom
hyphenrf:hazem/hash-uchar

Conversation

@hyphenrf

@hyphenrf hyphenrf commented Jun 17, 2024

Copy link
Copy Markdown
Contributor

Executive summary

Previously, Uchar.hash was effectively %identity and there was no seeded_hash.
hash now uses the caml_hash prim like the rest of the base types. Additionally, having seeded_hash allows passing the module as–is to Hashtbl.MakeSeeded so it was added to increase interface uniformity.

Design defails

Having hash be an identity or some simple bit manipulation is fine. It can however have a suboptimal distribution with a regular (easy to maliciously produce!) collision sequence. This is compared to a hashing function. Small ranges like char can have a perfect Char.code–indexed array, in essence an identity hash function, and everything else would be suboptimal in that case. Uchar isn't a small range however.
Consider if you will a hash table of 2n buckets where n is initially small and bucket index is determined by hash mod len, all uchars represented as m×2n would collide and cluster on index 0.
Even when the above, those downsides might be acceptable for the simplicity of implementation. This change offers more uniformity among base types without amassing any real implementation complexity—by making use of a good hashing function with a nice distribution that's already available to us as a primitive.

The change was motivated first by bringing the interface up to the constraints of SeededHashedType, but once that was done, it was only natural to adjust hash next. It is technically a breaking change. But three things to note:

  1. the previous interface made no promises about the kind of value hash returns, just that it's a positive integer associated with the uchar argument.
  2. the hashing algorithm itself changed halfway through OCaml's life (with plans to change it again). Although at the time of the change backwards compatibility was ensured by leaving the old implementation available and ensuring Hashtbl is aware of it, it still would've surely left a massive ripple effect if many users of the default Hashtbl.hash were sensitive to hash values like one might anticipate here. And in our case the backwards compatibility would just be aliasing hash to to_int again in the functor argument.
  3. usage survey on sherlocode & github code search shows that Uchar.t hasn't been used as a key to Hashtbl.Make yet as far as I can tell. I don't imagine the function is critical by any means. There are three known instances where it's a key to a Hashtbl, but in this it uses the polymorphic hash anyway.
    This last point doesn't cover proprietary users, but it's a notable indicator nonetheless. I imagine you're more likely to be using (normalized) strings as keys than you are single scalars for most Unicode needs. Two of the three above seem to have legitimate uses for uchar keys (description lookup table, and a bounded range of glyphs that are known to map 1:1 to scalars).

Implementation details

  • I parametrized the hash function with 1 for meaningful nodes and 1 for total nodes since Uchar.t is immediate (honestly I have no idea why it's 10 and 100 respectively for other immediates. If that's desirable anyway, I don't feel strongly about using the magic numbers). hash params are the same as other modules, I used 0 for the default seed as with Hashtbl.hash.
  • I've also added a test to ensure the function conforms to the stricter specification of its counterparts, namely that it returns the same output as Hashtbl.hash. I chose arbitrary values to test, as with equivalent tests in other modules.
  • While I was in the module I fixed a typo in one of the error messages.

Background info: #11246, #8878, #5225, #9763, #9764, #80 (comment) (the origin of the function).

@gasche

gasche commented Jun 17, 2024

Copy link
Copy Markdown
Member

That sounds very reasonable to me.

One other argument would be to restore the property that Hashtbl.hash u and Uchar.hash u return the same result. (Can you check that this works on one input?)

Regarding the constants that you are passing to the hash_param function, frankly I would rather just cargo-cult by reusing the same parameters as, say, Char.hash. We already use these constants all over the place, I don't think it makes any difference, and mixing two different conventions through the stdlib codebase is going to be more annoying than it is worth.

@gasche

gasche commented Jun 17, 2024

Copy link
Copy Markdown
Member

@dbuenzli is there a particular reason why you used the identity as a hash function?

@hyphenrf

Copy link
Copy Markdown
Contributor Author

One other argument would be to restore the property that Hashtbl.hash u and Uchar.hash u return the same result. (Can you check that this works on one input?)

174e520 checks for precisely this

@hyphenrf

hyphenrf commented Jun 17, 2024

Copy link
Copy Markdown
Contributor Author

frankly I would rather just cargo-cult by reusing the same parameters as, say, Char.hash. [...] mixing two different conventions through the stdlib codebase is going to be more annoying than it is worth.

Fair, I'm convinced. I'll change the magics to 10, 100

@dbuenzli

Copy link
Copy Markdown
Contributor

@dbuenzli is there a particular reason why you used the identity as a hash function?

No. It was suggested by @bobot see #80 (comment). So he may have a particular reason :-)

@gasche

gasche commented Jun 17, 2024

Copy link
Copy Markdown
Member

Approved. I don't think we really need the commit split here, so you could squash them in a single commit. (Or we could do this at merge-time.)

@nojb nojb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Can you please update the Changes entry? We can then merge it.

Comment thread Changes Outdated
@hyphenrf

Copy link
Copy Markdown
Contributor Author

Alright I'll squash, address the review, and add reviewer names, a minute please

- Previously, `Uchar.hash` was effectively `%identity` of the int repr
- testsuite: Add tests to ensure Hashtbl.hash u = Uchar.hash u for some
  arbitrary selection of u, with seeded variants as well.
- error messages: s/an Unicode/a Unicode/
- Add a (breaking) changes entry
@nojb nojb merged commit 9585cfe into ocaml:trunk Jun 17, 2024
@hyphenrf hyphenrf deleted the hazem/hash-uchar branch June 17, 2024 15:11
@shindere

Copy link
Copy Markdown
Contributor

See #13892 for a follow-up to this PR.

shindere added a commit to shindere/ocaml that referenced this pull request Mar 24, 2025
Given that this function is present in the module since it was
introduced, the convension is to not have any @SInCE attribute
at the function level, since the module-level one applies.

This commit reverts the addition of the attribute done in PR ocaml#13240, see
commit 9585cfe.
shindere added a commit to shindere/ocaml that referenced this pull request Mar 31, 2025
Given that this function is present in the module since it was
introduced, the convension is to not have any @SInCE attribute
at the function level, since the module-level one applies.

This commit reverts the addition of the attribute done in PR ocaml#13240, see
commit 9585cfe.
shindere added a commit to shindere/ocaml that referenced this pull request Mar 31, 2025
Given that this function is present in the module since it was
introduced, the convension is to not have any @SInCE attribute
at the function level, since the module-level one applies.

This commit reverts the addition of the attribute done in PR ocaml#13240, see
commit 9585cfe.
clementblaudeau pushed a commit to clementblaudeau/ocaml that referenced this pull request Mar 31, 2025
Given that this function is present in the module since it was
introduced, the convension is to not have any @SInCE attribute
at the function level, since the module-level one applies.

This commit reverts the addition of the attribute done in PR ocaml#13240, see
commit 9585cfe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants