-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an integer to integer hash function #1
Comments
I really like the idea of renaming
Such a function is a bit beyond the intended scope of this library. However, if you can convince me that a lot of bioinformatics programs will benefit I might add it. |
My thinking is that |
Mash uses Murmur for the case you are describing. I am wondering if they chose it over Wang's method for a reason. Also, how does your scheme compare to ntHash? Both performance and randomness-wise? (I may have to do some reading on hash functions.) |
ntHash is from my group! =) When calculating the hash value of multiple overlapping k-mers from the same string, ntHash will be much faster, because it's a rolling hash function. MurmurhHash and CityHash are good hash functions. ABySS 1 (also from my group) uses CityHash for hash tables (used to use MurmurHash, but we found CityHash was faster when we tested years ago), and ABySS 2 uses ntHash for Bloom filters. Wang's method is great when k ≤ 32 and you want an invertible hash function. If you don't care if it's invertible, you're probably better off with ntHash, CityHash, or MurmurHash. Minimap2 appears to use Wang's hash function. kh_int_hash_func2 |
Since you seem to know more about the topic than me, I would be more than happy to accept a pull request. 😃 |
I've got my plate full with writing up my thesis and graduating. I'll keep it in mind though! I've been wanting a fast vector optimized reverse complement function. I look forward to give this one a go! Thanks again for this work. |
Similar situation here. Guess I will just leave this issue open until someone finds the time to do it. |
I wrote…
@lh3 wrote… https://twitter.com/lh3lh3/status/1076184196085936128
|
I was about to close this issue as "won't fix" just when I remembered that I recently added a noise function to libdna. It doesn't try to be a proper hash function nor is it invertible. Would this be useful to export? |
Invertible hash functions are slightly preferred when you put k-mers into a hash table. With a non-invertible hash function, you either have to calculate hashes on the fly when visiting buckets (which takes a bit more time), or need to save the hashes in the table (which takes more space). Invertible hash functions are immune to this problem and is more friendly to 2-level hash tables as are used in minimap2, bfc and yak. As I remember, jellyfish also employs invertible hash functions. |
Interesting. I didn't know that invertability could play such a significant role in the construction of hash tables. They are also used more widely that I would have guessed. Seems useful then. |
I plan to add an integer to integer hash function and its reverse in the next release. Will take a close look at the algorithms linked above. Proposed API: uint64_t dna_hash(uint64_t data);
uint64_t dna_hash_revert(uint64_t hash); |
Good news, I have combined the work of @lh3, @SquirrelEiserloh, and Thomas Wang to create a new, decent invertable hash function (d3b613f). I also added some basic statistical tests to ensure it is usable as a hash function. As soon as there is documentation v1.3 is ready to go. |
Version 1.3 now includes an invertible hash function. Thanks again for the suggestion. |
I like the idea of this library very much! Great idea!
Could you please add this invertible integer to integer hash function?
See https://gist.githubusercontent.com/lh3/974ced188be2f90422cc/raw/55fbbb63e489328fd9d1641897954ca997b65951/inthash.c
and https://gist.github.com/lh3/59882d6b96166dfc3d8d
A minor nit.
dna4_hash
is a bit of a misnomer. Calling something a hash function usually implies that the resulting numbers are roughly distributed uniformly, even when the input data is not distributed uniformly. See https://en.wikipedia.org/wiki/Avalanche_effect. I'd be inclined to call this functiondna4_pack64
, that is, pack an alphabet of four nucleotides into 64 bits. That integer can then be put through the above integer to integer hash function to produce a uniformly distributed hash value.When computing minimizers for example (as with
minimap2
), the result of the currentdna4_hash
would be a suboptimal choice for a minimizer, since it would preferentially select polyA k-mers. Mixing the bits with Thomas Wang's integer hash function produces good hash values for minimizers.Cheers!
Shaun
The text was updated successfully, but these errors were encountered: