Bring Uchar hashing on par with other base types like Int, Char, ...#13240
Conversation
|
That sounds very reasonable to me. One other argument would be to restore the property that Regarding the constants that you are passing to the hash_param function, frankly I would rather just cargo-cult by reusing the same parameters as, say, Char.hash. We already use these constants all over the place, I don't think it makes any difference, and mixing two different conventions through the stdlib codebase is going to be more annoying than it is worth. |
|
@dbuenzli is there a particular reason why you used the identity as a hash function? |
174e520 checks for precisely this |
Fair, I'm convinced. I'll change the magics to 10, 100 |
No. It was suggested by @bobot see #80 (comment). So he may have a particular reason :-) |
|
Approved. I don't think we really need the commit split here, so you could squash them in a single commit. (Or we could do this at merge-time.) |
nojb
left a comment
There was a problem hiding this comment.
LGTM
Can you please update the Changes entry? We can then merge it.
|
Alright I'll squash, address the review, and add reviewer names, a minute please |
304d817 to
17b2639
Compare
- Previously, `Uchar.hash` was effectively `%identity` of the int repr - testsuite: Add tests to ensure Hashtbl.hash u = Uchar.hash u for some arbitrary selection of u, with seeded variants as well. - error messages: s/an Unicode/a Unicode/ - Add a (breaking) changes entry
17b2639 to
cd8fc66
Compare
|
See #13892 for a follow-up to this PR. |
Given that this function is present in the module since it was introduced, the convension is to not have any @SInCE attribute at the function level, since the module-level one applies. This commit reverts the addition of the attribute done in PR ocaml#13240, see commit 9585cfe.
Given that this function is present in the module since it was introduced, the convension is to not have any @SInCE attribute at the function level, since the module-level one applies. This commit reverts the addition of the attribute done in PR ocaml#13240, see commit 9585cfe.
Given that this function is present in the module since it was introduced, the convension is to not have any @SInCE attribute at the function level, since the module-level one applies. This commit reverts the addition of the attribute done in PR ocaml#13240, see commit 9585cfe.
Given that this function is present in the module since it was introduced, the convension is to not have any @SInCE attribute at the function level, since the module-level one applies. This commit reverts the addition of the attribute done in PR ocaml#13240, see commit 9585cfe.
Executive summary
Previously,
Uchar.hashwas effectively%identityand there was noseeded_hash.hashnow uses thecaml_hashprim like the rest of the base types. Additionally, havingseeded_hashallows passing the module as–is toHashtbl.MakeSeededso it was added to increase interface uniformity.Design defails
Having
hashbe an identity or some simple bit manipulation is fine. It can however have a suboptimal distribution with a regular (easy to maliciously produce!) collision sequence. This is compared to a hashing function. Small ranges likecharcan have a perfectChar.code–indexed array, in essence an identity hash function, and everything else would be suboptimal in that case.Ucharisn't a small range however.Consider if you will a hash table of 2n buckets where n is initially small and bucket index is determined by
hash mod len, all uchars represented as m×2n would collide and cluster on index 0.Even when the above, those downsides might be acceptable for the simplicity of implementation. This change offers more uniformity among base types without amassing any real implementation complexity—by making use of a good hashing function with a nice distribution that's already available to us as a primitive.
The change was motivated first by bringing the interface up to the constraints of
SeededHashedType, but once that was done, it was only natural to adjusthashnext. It is technically a breaking change. But three things to note:hashreturns, just that it's a positive integer associated with the uchar argument.Hashtblis aware of it, it still would've surely left a massive ripple effect if many users of the defaultHashtbl.hashwere sensitive to hash values like one might anticipate here. And in our case the backwards compatibility would just be aliasinghashtoto_intagain in the functor argument.Uchar.thasn't been used as a key toHashtbl.Makeyet as far as I can tell. I don't imagine the function is critical by any means. There are three known instances where it's a key to aHashtbl, but in this it uses the polymorphic hash anyway.This last point doesn't cover proprietary users, but it's a notable indicator nonetheless. I imagine you're more likely to be using (normalized) strings as keys than you are single scalars for most Unicode needs. Two of the three above seem to have legitimate uses for uchar keys (description lookup table, and a bounded range of glyphs that are known to map 1:1 to scalars).
Implementation details
I parametrized the hash function withhash params are the same as other modules, I used1for meaningful nodes and1for total nodes sinceUchar.tis immediate (honestly I have no idea why it's10and100respectively for other immediates. If that's desirable anyway, I don't feel strongly about using the magic numbers).0for the default seed as withHashtbl.hash.Hashtbl.hash. I chose arbitrary values to test, as with equivalent tests in other modules.Background info: #11246, #8878, #5225, #9763, #9764, #80 (comment) (the origin of the function).