having a sensible hash function in batInt, batString, etc. would make many functor applications easier (hashtbl and weak tables, for instance). Also, it can help make hashing specific to types:
let hash (i : int) = i land max_int
let hash (s : string) = (* function that hashes a string, or just Hashtbl.hash *)
(* ... *)
let hash (b : Big_int.big_int) = String.hash (Big_int.string_of_big_int b)
A very useful thing to have, also, is hash combinators (example) for lists, arrays, enums, similarly to printing functions.
Re. different hashing functions: Hashing is a subtle and difficult topic. People want speed, low number of conflicts, and (sometimes) conflict-attack security, and it's not possible to have all of them at once. Unless we have hashing experts that are really interested in the matter and can do something significantly better than just using Hashtbl.hash, I think we're better off following the standard library.
Re. hashing combinators for polymorphic data structures; that seems to be a reasonable idea, as it helps people that use specific hashing functions on their datatype of choice lift those to Batteries-provided datatypes. I am not quite sure which interface we would need for this and how it should work -- probably by exposing the "mixing" logic of the hash function that allows to add further data to be taken into account, but how resilient is that to changes in the hash function's design? The compiler libraries don't expose those primitives so we would need to add new external runtime dependencies for these.
I'm thinking of an interface such as, approximatively:
module type HashData = sig
val make : initial -> t
val mix_int32 : t -> int32 -> t
val mix_int64 : t -> int64 -> t
val mix_intnat : t -> nativeint -> t
val mix_float : t -> float -> t
val mix_string : t -> string -> t
val finalize : t -> final
The standard library would expose its hashing function with initial = int option (for using a seed or not) and final = unit (AFAIK the stdlib hash currently doesn't require any additional data/seeding to perform its final mix).
initial = int option
final = unit
I like the idea of having a "partial hash type", it's very useful for many hash functions anyway. I think the "Hashable" interface should be something like
module type Hashable = sig
val mix : HashData.t -> t -> HashData.t
val hash : t -> int
module HashData : sig
val mix_int : t -> int -> t
val mix_int32 : t -> int32 -> t
(* ..... *)
val make : int -> t
val finalize : t -> int
or something more general like what you said. The function Hashable.hash is a shortcut for creating a HashData.t, mixing t in, and finalizing it.
First, let me cancel "HashData" as a name for this interface. It's terrible and I think HashImpl would be more appropriate: it exposes the particular of a hashing function. Or PartialHash.
Then I'll note that I feel like we're opening a whole can of worms with this HashImpl design. If the goal is simply to let users say "I have a BatSplayTree of my own datatype HighlyOptimizedClauses.t wish has this domain-specific hash function", we should be able to compute a hash using this domain-specific hash function without telling the world about how we also know how to mix doubles and strings. We could just require the domain-specific hash function to return an integer (or a union type of the types we know how to mix, but isn't that premature complexity?), and mix that with the compiler-provided hash function.
The alternate view I was having above and you expose explicitly in your answer is to see the user's domain-specific hashable type not as "can be hashed to an integer" but "can be mixed into a partial hash value". Then we have to let the user know how to mix, and ask the user to do the mixing in our stead. This can give better hashing properties, but at the cost of an important increase in complexity.
Note that none of the above requires Batteries code to be parametrized over the hash functions it itself uses its own datastructure: the elements are hashed as the user says, but the final result is still combined with the standard hashing function. We can instead try to functorize everything over a module with the HashImpl signature, allowing people to also change the mixing function used to combine list element values, etc. But that's yet more complex -- and possibly of doubtful utility, as there has been little interest in changing the hash functions so far, only pressure to use a balanced tree of buckets to avoid collision attacks.
Finally, we didn't consider how to limit the depth on which the structure is traversed to collect a hash -- hash_param in the standard library.
I agree that using modules and abstract types make things very complicated. Again, a distinction has to be made between cryptographic/secure hashes (which require careful design), and a "simple" default hash function such as sdbm or dbj2 which could be used to compute hashes for Hashtbl, not for computing checksums or hashing passwords (cryptokit would be better).
Then, as you said, hash combinators would allow both the simple, default element hash function, or a user-provided one to be used. In this case, containers (e.g. List) could expose:
val hash_mix : (int -> 'a -> int) -> int -> 'a list -> int
val hash : (int -> 'a -> int) -> 'a list -> int
for respectively combine into an unfinalized hash, or compute a whole hash of the list.
I forgot to add that for some structures, a user-defined hash function is mandatory because the regular hash function isn't compatible with the equality. For instance, two Set/Map instances may have distinct internal structures but be structurally equal, and the correct way of hashing them is using a prefix traversal.
That's right. When replying to your first post, I considered this problem and wrongly thought that the two equivalent structures would fall into the same bucket, and thus be handled as collisions by running the user-provided (and equivalence-aware) equality function on them, but that doesn't work if they don't actually have the same hash.
For the name, why not HashItf or HashIntf since it is an interface?
Or Hashable even
So, a BatHash module with
type mix_int = int -> int -> int
type 'a t = mix_int -> 'a -> int
val mix1 : mix_int
val mix2 : mix_int
and other modules implementing the t BatHash.t, for instance
module List : sig
val hash : 'a BatHash.t -> 'a list BatHash.t
would satisfy everyone here? The same module could also include later a distinct, more complicated type for cryptographic hashes.
I think that I would like it better if we could keep the internal type of the hash function abstract. Not only would that help when programming, it would also make us able to support hash functions that really need 32 or 64 bits to be defined.
@thelema I agree. If required, a module could include several hash functions, that use distinct mixing functions, but overall simplicity beats genericity in this case. Hashing is mostly useful for hashtables, so it needs to be simple and easy to use (t -> int then).
t -> int