Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an integer to integer hash function #1

Closed
Tracked by #19
sjackman opened this issue Dec 21, 2018 · 14 comments
Closed
Tracked by #19

Add an integer to integer hash function #1

sjackman opened this issue Dec 21, 2018 · 14 comments
Labels
help wanted Extra attention is needed new functionality New feature or request
Milestone

Comments

@sjackman
Copy link

sjackman commented Dec 21, 2018

I like the idea of this library very much! Great idea!

Could you please add this invertible integer to integer hash function?
See https://gist.githubusercontent.com/lh3/974ced188be2f90422cc/raw/55fbbb63e489328fd9d1641897954ca997b65951/inthash.c
and https://gist.github.com/lh3/59882d6b96166dfc3d8d

A minor nit. dna4_hash is a bit of a misnomer. Calling something a hash function usually implies that the resulting numbers are roughly distributed uniformly, even when the input data is not distributed uniformly. See https://en.wikipedia.org/wiki/Avalanche_effect. I'd be inclined to call this function dna4_pack64, that is, pack an alphabet of four nucleotides into 64 bits. That integer can then be put through the above integer to integer hash function to produce a uniformly distributed hash value.

When computing minimizers for example (as with minimap2), the result of the current dna4_hash would be a suboptimal choice for a minimizer, since it would preferentially select polyA k-mers. Mixing the bits with Thomas Wang's integer hash function produces good hash values for minimizers.

Cheers!
Shaun

@sjackman sjackman changed the title Add a integer to integer hash function Add an integer to integer hash function [feature request] Dec 21, 2018
@kloetzl kloetzl added the new functionality New feature or request label Dec 21, 2018
@kloetzl
Copy link
Owner

kloetzl commented Dec 21, 2018

I really like the idea of renaming dna4_hash to dna4_pack. Along with this, one could then provide a function dna4_unpack as the reverse operation.

Could you please add this invertible integer to integer hash function?

Such a function is a bit beyond the intended scope of this library. However, if you can convince me that a lot of bioinformatics programs will benefit I might add it.

@sjackman
Copy link
Author

sjackman commented Dec 21, 2018

My thinking is that dna4_hash64 could then be redefined to be the composition of dna_hash_uint64(dna4_pack64(…)), which would then be a good hash function for DNA (for k ≤ 32) with good distribution (and invertible to boot).

@kloetzl
Copy link
Owner

kloetzl commented Dec 21, 2018

Mash uses Murmur for the case you are describing. I am wondering if they chose it over Wang's method for a reason. Also, how does your scheme compare to ntHash? Both performance and randomness-wise? (I may have to do some reading on hash functions.)

@sjackman
Copy link
Author

sjackman commented Dec 21, 2018

ntHash is from my group! =) When calculating the hash value of multiple overlapping k-mers from the same string, ntHash will be much faster, because it's a rolling hash function.

MurmurhHash and CityHash are good hash functions. ABySS 1 (also from my group) uses CityHash for hash tables (used to use MurmurHash, but we found CityHash was faster when we tested years ago), and ABySS 2 uses ntHash for Bloom filters.

Wang's method is great when k ≤ 32 and you want an invertible hash function. If you don't care if it's invertible, you're probably better off with ntHash, CityHash, or MurmurHash.

Minimap2 appears to use Wang's hash function. kh_int_hash_func2
https://github.com/lh3/minimap2/blob/5b2fdfff9c6621ab30c0843319d0328c1f8301e7/khash.h#L400-L410

@kloetzl kloetzl added the help wanted Extra attention is needed label Dec 21, 2018
@kloetzl
Copy link
Owner

kloetzl commented Dec 21, 2018

Since you seem to know more about the topic than me, I would be more than happy to accept a pull request. 😃

@sjackman
Copy link
Author

I've got my plate full with writing up my thesis and graduating. I'll keep it in mind though! I've been wanting a fast vector optimized reverse complement function. I look forward to give this one a go! Thanks again for this work.

@kloetzl
Copy link
Owner

kloetzl commented Dec 21, 2018

I've got my plate full with writing up my thesis and graduating.

Similar situation here. Guess I will just leave this issue open until someone finds the time to do it.

@sjackman
Copy link
Author

I wrote…
https://twitter.com/sjackman/status/1076175172049104897

Can you please confirm for me that minimap2 uses Wang's invertible hash function for the minimizers? https://github.com/lh3/minimap2/blob/5b2fdfff9c6621ab30c0843319d0328c1f8301e7/khash.h#L400-L410
How does minimap2 handle reverse complement? Does it canonicalize the kmers, or does it index both forward and reverse seq of the reference?

@lh3 wrote… https://twitter.com/lh3lh3/status/1076184196085936128

Yes, but not in khash.h. It is here: https://github.com/lh3/minimap2/blob/master/sketch.c#L28-L38 Minimap2 chooses the smaller integer: https://github.com/lh3/minimap2/blob/master/sketch.c#L106-L114

kloetzl added a commit that referenced this issue Dec 21, 2018
Suggested by @sjackman in #1. This leaves the name `hash` for some
actual hashing.
@kloetzl
Copy link
Owner

kloetzl commented Apr 10, 2022

I was about to close this issue as "won't fix" just when I remembered that I recently added a noise function to libdna. It doesn't try to be a proper hash function nor is it invertible. Would this be useful to export?

@lh3
Copy link

lh3 commented Apr 10, 2022

Invertible hash functions are slightly preferred when you put k-mers into a hash table. With a non-invertible hash function, you either have to calculate hashes on the fly when visiting buckets (which takes a bit more time), or need to save the hashes in the table (which takes more space). Invertible hash functions are immune to this problem and is more friendly to 2-level hash tables as are used in minimap2, bfc and yak. As I remember, jellyfish also employs invertible hash functions.

@kloetzl
Copy link
Owner

kloetzl commented Apr 10, 2022

Interesting. I didn't know that invertability could play such a significant role in the construction of hash tables. They are also used more widely that I would have guessed. Seems useful then.

@kloetzl kloetzl mentioned this issue Apr 10, 2022
5 tasks
@kloetzl
Copy link
Owner

kloetzl commented May 22, 2022

I plan to add an integer to integer hash function and its reverse in the next release. Will take a close look at the algorithms linked above.

Proposed API:

uint64_t dna_hash(uint64_t data);
uint64_t dna_hash_revert(uint64_t hash);

@kloetzl kloetzl changed the title Add an integer to integer hash function [feature request] Add an integer to integer hash function May 22, 2022
@kloetzl kloetzl added this to the v1.3 milestone May 22, 2022
@kloetzl
Copy link
Owner

kloetzl commented Sep 11, 2022

Good news, I have combined the work of @lh3, @SquirrelEiserloh, and Thomas Wang to create a new, decent invertable hash function (d3b613f). I also added some basic statistical tests to ensure it is usable as a hash function. As soon as there is documentation v1.3 is ready to go.

@kloetzl
Copy link
Owner

kloetzl commented Oct 30, 2022

Version 1.3 now includes an invertible hash function. Thanks again for the suggestion.

@kloetzl kloetzl closed this as completed Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed new functionality New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants