Add an integer to integer hash function #1

sjackman · 2018-12-21T16:24:36Z

I like the idea of this library very much! Great idea!

Could you please add this invertible integer to integer hash function?
See https://gist.githubusercontent.com/lh3/974ced188be2f90422cc/raw/55fbbb63e489328fd9d1641897954ca997b65951/inthash.c
and https://gist.github.com/lh3/59882d6b96166dfc3d8d

A minor nit. dna4_hash is a bit of a misnomer. Calling something a hash function usually implies that the resulting numbers are roughly distributed uniformly, even when the input data is not distributed uniformly. See https://en.wikipedia.org/wiki/Avalanche_effect. I'd be inclined to call this function dna4_pack64, that is, pack an alphabet of four nucleotides into 64 bits. That integer can then be put through the above integer to integer hash function to produce a uniformly distributed hash value.

When computing minimizers for example (as with minimap2), the result of the current dna4_hash would be a suboptimal choice for a minimizer, since it would preferentially select polyA k-mers. Mixing the bits with Thomas Wang's integer hash function produces good hash values for minimizers.

Cheers!
Shaun

The text was updated successfully, but these errors were encountered:

kloetzl · 2018-12-21T17:29:24Z

I really like the idea of renaming dna4_hash to dna4_pack. Along with this, one could then provide a function dna4_unpack as the reverse operation.

Could you please add this invertible integer to integer hash function?

Such a function is a bit beyond the intended scope of this library. However, if you can convince me that a lot of bioinformatics programs will benefit I might add it.

sjackman · 2018-12-21T17:34:21Z

My thinking is that dna4_hash64 could then be redefined to be the composition of dna_hash_uint64(dna4_pack64(…)), which would then be a good hash function for DNA (for k ≤ 32) with good distribution (and invertible to boot).

kloetzl · 2018-12-21T17:44:03Z

Mash uses Murmur for the case you are describing. I am wondering if they chose it over Wang's method for a reason. Also, how does your scheme compare to ntHash? Both performance and randomness-wise? (I may have to do some reading on hash functions.)

sjackman · 2018-12-21T17:56:57Z

ntHash is from my group! =) When calculating the hash value of multiple overlapping k-mers from the same string, ntHash will be much faster, because it's a rolling hash function.

MurmurhHash and CityHash are good hash functions. ABySS 1 (also from my group) uses CityHash for hash tables (used to use MurmurHash, but we found CityHash was faster when we tested years ago), and ABySS 2 uses ntHash for Bloom filters.

Wang's method is great when k ≤ 32 and you want an invertible hash function. If you don't care if it's invertible, you're probably better off with ntHash, CityHash, or MurmurHash.

Minimap2 appears to use Wang's hash function. kh_int_hash_func2
https://github.com/lh3/minimap2/blob/5b2fdfff9c6621ab30c0843319d0328c1f8301e7/khash.h#L400-L410

kloetzl · 2018-12-21T18:20:35Z

Since you seem to know more about the topic than me, I would be more than happy to accept a pull request. 😃

sjackman · 2018-12-21T18:24:59Z

I've got my plate full with writing up my thesis and graduating. I'll keep it in mind though! I've been wanting a fast vector optimized reverse complement function. I look forward to give this one a go! Thanks again for this work.

kloetzl · 2018-12-21T18:29:22Z

I've got my plate full with writing up my thesis and graduating.

Similar situation here. Guess I will just leave this issue open until someone finds the time to do it.

sjackman · 2018-12-21T18:41:14Z

I wrote…
https://twitter.com/sjackman/status/1076175172049104897

Can you please confirm for me that minimap2 uses Wang's invertible hash function for the minimizers? https://github.com/lh3/minimap2/blob/5b2fdfff9c6621ab30c0843319d0328c1f8301e7/khash.h#L400-L410
How does minimap2 handle reverse complement? Does it canonicalize the kmers, or does it index both forward and reverse seq of the reference?

@lh3 wrote… https://twitter.com/lh3lh3/status/1076184196085936128

Yes, but not in khash.h. It is here: https://github.com/lh3/minimap2/blob/master/sketch.c#L28-L38 Minimap2 chooses the smaller integer: https://github.com/lh3/minimap2/blob/master/sketch.c#L106-L114

@sjackman

Suggested by @sjackman in #1. This leaves the name `hash` for some actual hashing.

kloetzl · 2022-04-10T16:04:13Z

I was about to close this issue as "won't fix" just when I remembered that I recently added a noise function to libdna. It doesn't try to be a proper hash function nor is it invertible. Would this be useful to export?

lh3 · 2022-04-10T16:44:05Z

Invertible hash functions are slightly preferred when you put k-mers into a hash table. With a non-invertible hash function, you either have to calculate hashes on the fly when visiting buckets (which takes a bit more time), or need to save the hashes in the table (which takes more space). Invertible hash functions are immune to this problem and is more friendly to 2-level hash tables as are used in minimap2, bfc and yak. As I remember, jellyfish also employs invertible hash functions.

kloetzl · 2022-04-10T17:29:23Z

Interesting. I didn't know that invertability could play such a significant role in the construction of hash tables. They are also used more widely that I would have guessed. Seems useful then.

kloetzl · 2022-05-22T19:53:42Z

I plan to add an integer to integer hash function and its reverse in the next release. Will take a close look at the algorithms linked above.

Proposed API:

uint64_t dna_hash(uint64_t data);
uint64_t dna_hash_revert(uint64_t hash);

kloetzl · 2022-09-11T17:51:16Z

Good news, I have combined the work of @lh3, @SquirrelEiserloh, and Thomas Wang to create a new, decent invertable hash function (d3b613f). I also added some basic statistical tests to ensure it is usable as a hash function. As soon as there is documentation v1.3 is ready to go.

kloetzl · 2022-10-30T15:22:18Z

Version 1.3 now includes an invertible hash function. Thanks again for the suggestion.

sjackman changed the title ~~Add a integer to integer hash function~~ Add an integer to integer hash function [feature request] Dec 21, 2018

kloetzl added the new functionality New feature or request label Dec 21, 2018

kloetzl added the help wanted Extra attention is needed label Dec 21, 2018

kloetzl added a commit that referenced this issue Dec 21, 2018

rename hash operation to pack

15101c4

Suggested by @sjackman in #1. This leaves the name `hash` for some actual hashing.

kloetzl mentioned this issue Apr 10, 2022

release v1.3 #19

Closed

5 tasks

kloetzl changed the title ~~Add an integer to integer hash function [feature request]~~ Add an integer to integer hash function May 22, 2022

kloetzl added this to the v1.3 milestone May 22, 2022

kloetzl closed this as completed Oct 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an integer to integer hash function #1

Add an integer to integer hash function #1

sjackman commented Dec 21, 2018 •

edited

kloetzl commented Dec 21, 2018

sjackman commented Dec 21, 2018 •

edited

kloetzl commented Dec 21, 2018

sjackman commented Dec 21, 2018 •

edited

kloetzl commented Dec 21, 2018

sjackman commented Dec 21, 2018

kloetzl commented Dec 21, 2018

sjackman commented Dec 21, 2018

kloetzl commented Apr 10, 2022

lh3 commented Apr 10, 2022

kloetzl commented Apr 10, 2022

kloetzl commented May 22, 2022

kloetzl commented Sep 11, 2022

kloetzl commented Oct 30, 2022

Add an integer to integer hash function #1

Add an integer to integer hash function #1

Comments

sjackman commented Dec 21, 2018 • edited

kloetzl commented Dec 21, 2018

sjackman commented Dec 21, 2018 • edited

kloetzl commented Dec 21, 2018

sjackman commented Dec 21, 2018 • edited

kloetzl commented Dec 21, 2018

sjackman commented Dec 21, 2018

kloetzl commented Dec 21, 2018

sjackman commented Dec 21, 2018

kloetzl commented Apr 10, 2022

lh3 commented Apr 10, 2022

kloetzl commented Apr 10, 2022

kloetzl commented May 22, 2022

kloetzl commented Sep 11, 2022

kloetzl commented Oct 30, 2022

sjackman commented Dec 21, 2018 •

edited

sjackman commented Dec 21, 2018 •

edited

sjackman commented Dec 21, 2018 •

edited