Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

hash functions in modules #492

Open
c-cube opened this Issue · 14 comments

4 participants

@c-cube
Owner

having a sensible hash function in batInt, batString, etc. would make many functor applications easier (hashtbl and weak tables, for instance). Also, it can help make hashing specific to types:

let hash (i : int) = i land max_int
let hash (s : string) = (* function that hashes a string, or just Hashtbl.hash *)

(* ... *)

let hash (b : Big_int.big_int) = String.hash (Big_int.string_of_big_int b)

A very useful thing to have, also, is hash combinators (example) for lists, arrays, enums, similarly to printing functions.

@gasche
Owner

Re. different hashing functions: Hashing is a subtle and difficult topic. People want speed, low number of conflicts, and (sometimes) conflict-attack security, and it's not possible to have all of them at once. Unless we have hashing experts that are really interested in the matter and can do something significantly better than just using Hashtbl.hash, I think we're better off following the standard library.

Re. hashing combinators for polymorphic data structures; that seems to be a reasonable idea, as it helps people that use specific hashing functions on their datatype of choice lift those to Batteries-provided datatypes. I am not quite sure which interface we would need for this and how it should work -- probably by exposing the "mixing" logic of the hash function that allows to add further data to be taken into account, but how resilient is that to changes in the hash function's design? The compiler libraries don't expose those primitives so we would need to add new external runtime dependencies for these.

I'm thinking of an interface such as, approximatively:

module type HashData = sig
  type t
  type initial
  type final
  val make : initial -> t
  val mix_int32 : t -> int32 -> t
  val mix_int64 : t -> int64 -> t
  val mix_intnat : t -> nativeint -> t
  val mix_float : t -> float -> t
  val mix_string : t -> string -> t
  val finalize : t -> final
end

The standard library would expose its hashing function with initial = int option (for using a seed or not) and final = unit (AFAIK the stdlib hash currently doesn't require any additional data/seeding to perform its final mix).

@c-cube
Owner

I like the idea of having a "partial hash type", it's very useful for many hash functions anyway. I think the "Hashable" interface should be something like

module type Hashable = sig
  type t
  val mix : HashData.t -> t -> HashData.t
  val hash : t -> int

and

module HashData : sig
  type t
  val mix_int : t -> int -> t
  val mix_int32 : t -> int32 -> t
  (* ..... *)

  val make : int -> t
  val finalize : t -> int
end

or something more general like what you said. The function Hashable.hash is a shortcut for creating a HashData.t, mixing t in, and finalizing it.

@gasche
Owner

First, let me cancel "HashData" as a name for this interface. It's terrible and I think HashImpl would be more appropriate: it exposes the particular of a hashing function. Or PartialHash.

Then I'll note that I feel like we're opening a whole can of worms with this HashImpl design. If the goal is simply to let users say "I have a BatSplayTree of my own datatype HighlyOptimizedClauses.t wish has this domain-specific hash function", we should be able to compute a hash using this domain-specific hash function without telling the world about how we also know how to mix doubles and strings. We could just require the domain-specific hash function to return an integer (or a union type of the types we know how to mix, but isn't that premature complexity?), and mix that with the compiler-provided hash function.

The alternate view I was having above and you expose explicitly in your answer is to see the user's domain-specific hashable type not as "can be hashed to an integer" but "can be mixed into a partial hash value". Then we have to let the user know how to mix, and ask the user to do the mixing in our stead. This can give better hashing properties, but at the cost of an important increase in complexity.

Note that none of the above requires Batteries code to be parametrized over the hash functions it itself uses its own datastructure: the elements are hashed as the user says, but the final result is still combined with the standard hashing function. We can instead try to functorize everything over a module with the HashImpl signature, allowing people to also change the mixing function used to combine list element values, etc. But that's yet more complex -- and possibly of doubtful utility, as there has been little interest in changing the hash functions so far, only pressure to use a balanced tree of buckets to avoid collision attacks.

Finally, we didn't consider how to limit the depth on which the structure is traversed to collect a hash -- hash_param in the standard library.

@c-cube
Owner

I agree that using modules and abstract types make things very complicated. Again, a distinction has to be made between cryptographic/secure hashes (which require careful design), and a "simple" default hash function such as sdbm or dbj2 which could be used to compute hashes for Hashtbl, not for computing checksums or hashing passwords (cryptokit would be better).

Then, as you said, hash combinators would allow both the simple, default element hash function, or a user-provided one to be used. In this case, containers (e.g. List) could expose:

val hash_mix : (int -> 'a -> int) -> int -> 'a list -> int
val hash : (int -> 'a -> int) -> 'a list -> int

for respectively combine into an unfinalized hash, or compute a whole hash of the list.

@c-cube
Owner

I forgot to add that for some structures, a user-defined hash function is mandatory because the regular hash function isn't compatible with the equality. For instance, two Set/Map instances may have distinct internal structures but be structurally equal, and the correct way of hashing them is using a prefix traversal.

@gasche
Owner

That's right. When replying to your first post, I considered this problem and wrongly thought that the two equivalent structures would fall into the same bucket, and thus be handled as collisions by running the user-provided (and equivalence-aware) equality function on them, but that doesn't work if they don't actually have the same hash.

@UnixJunkie
Collaborator

For the name, why not HashItf or HashIntf since it is an interface?

@UnixJunkie
Collaborator

Or Hashable even

@thelema
Owner
@c-cube
Owner

So, a BatHash module with

type mix_int = int -> int -> int

type 'a t = mix_int -> 'a -> int

val mix1 : mix_int
val mix2 : mix_int
...                      

and other modules implementing the t BatHash.t, for instance

module List : sig
  ...                   
    val hash : 'a BatHash.t -> 'a list BatHash.t
 end

would satisfy everyone here? The same module could also include later a distinct, more complicated type for cryptographic hashes.

@thelema
Owner
@gasche
Owner

I think that I would like it better if we could keep the internal type of the hash function abstract. Not only would that help when programming, it would also make us able to support hash functions that really need 32 or 64 bits to be defined.

@thelema
Owner
@c-cube
Owner

@thelema I agree. If required, a module could include several hash functions, that use distinct mixing functions, but overall simplicity beats genericity in this case. Hashing is mostly useful for hashtables, so it needs to be simple and easy to use (t -> int then).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.