Larger k and generic alphabets in SSHash #35

adamant-pwn · 2023-12-07T10:08:10Z

I'm currently working on adopting SSHash in MetaGraph, as we want to see if we can improve our performance using it as an internal structure for indexing and traversing DBGs.

As a part of the process, we want to generalize SSHash in a way that allows larger values of $k$ and potentially different alphabets. My general idea here is to rewrite the code base, so that the methods that previously relied on some properties of k-mers (the alphabet used, methods to convert between k-mers/individual characters and their bit representation, etc) would now be implemented as members of the kmer_t class, rather than utils:: functions, as they mostly are now.

So, I plan to write a simple base kmer_t class with virtual functions that provides contracts for how such methods should look like, and then we will inherit from this base class in MetaGraph to provide implementations for our kmer types. Then, all the methods that currently rely on kmer_t would take a template argument instead, which should be an implementation k-mer class derived from the base class described above. I think this way we can ensure that Metagraph and SSHash stay reasonably separated.

I started working in this direction in our fork of SSHash (9624612), but before going further I wanted to get in touch and collect some feedback on your side. That being said, I have a couple of questions:

Would it be of interest for SSHash to merge these changes upstream in the future?
Do you have any advice on how this should be approached? Does the outlined approach make sense to you?
It seemed to me that most of the stuff that heavily relies on the alphabet being ACGT is currently encapsulated in util.hpp, but there are also some methods in e.g. dictionary.cpp. Are there any other places where you rely on the alphabet being ACGT directly, rather than implicitly via calls to util:: or dictionary methods?

If merging it upstream in the future is of interest to you, I will try to take your opinion into consideration while working in the fork, so it'd be easier to merge afterwards. If it's not a priority for you, no problem, we can keep working on the fork, but still will highly appreciate any insight and advice you might have for these changes.

The text was updated successfully, but these errors were encountered:

jermp · 2023-12-07T12:10:11Z

Hi @adamant-pwn,
and thank you for reaching out! The generalization to arbitrary alphabets would be very interesting.

Would it be of interest for SSHash to merge these changes upstream in the future?

Yeah, why not? :) My focus would still be on indexing DNA, hence I would prefer the code to be optimized for this specific usecase rather than being general. Right now, there aren't specific optimizations when dealing with 2-bit alphabets.

Do you have any advice on how this should be approached? Does the outlined approach make sense to you?

It does make sense to me. In my experiments, I haven't seen much difference in query time when using k=31 and k=63 because I'm relying on built-in data types (uint64_t and uint128_t respectively). I suspect a user-defined struct with 64-bit words to be fairly slower than this approach, but it is needed to use k > 63.

I expect someone already implemented a generic kmer class in C++ that does so, but I never searched on GitHub for one.
Perhaps, you could first see if there is a good one out there.

It seemed to me that most of the stuff that heavily relies on the alphabet being ACGT is currently encapsulated in util.hpp, but there are also some methods in e.g. dictionary.cpp. Are there any other places where you rely on the alphabet being ACGT directly, rather than implicitly via calls to util:: or dictionary methods?

It is mostly in util.hpp but also in other place where, essentially, we process 1 symbol at a time and thus read 2 bits from a bitvector. See for example include/buckets.hpp and include/bit_vector_iterator.hpp. So making SSHash general would need to carefully review the entire codebase.

jermp added the question Further information is requested label Dec 7, 2023

adamant-pwn mentioned this issue Jan 16, 2024

Support larger alphabets and k via generic kmer_t #39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larger k and generic alphabets in SSHash #35

Larger k and generic alphabets in SSHash #35

adamant-pwn commented Dec 7, 2023 •

edited

Loading

jermp commented Dec 7, 2023

Larger k and generic alphabets in SSHash #35

Larger k and generic alphabets in SSHash #35

Comments

adamant-pwn commented Dec 7, 2023 • edited Loading

jermp commented Dec 7, 2023

adamant-pwn commented Dec 7, 2023 •

edited

Loading