Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger k and generic alphabets in SSHash #35

Open
adamant-pwn opened this issue Dec 7, 2023 · 1 comment
Open

Larger k and generic alphabets in SSHash #35

adamant-pwn opened this issue Dec 7, 2023 · 1 comment
Labels
question Further information is requested

Comments

@adamant-pwn
Copy link
Contributor

adamant-pwn commented Dec 7, 2023

Hi @jermp!

I'm currently working on adopting SSHash in MetaGraph, as we want to see if we can improve our performance using it as an internal structure for indexing and traversing DBGs.

As a part of the process, we want to generalize SSHash in a way that allows larger values of $k$ and potentially different alphabets. My general idea here is to rewrite the code base, so that the methods that previously relied on some properties of k-mers (the alphabet used, methods to convert between k-mers/individual characters and their bit representation, etc) would now be implemented as members of the kmer_t class, rather than utils:: functions, as they mostly are now.

So, I plan to write a simple base kmer_t class with virtual functions that provides contracts for how such methods should look like, and then we will inherit from this base class in MetaGraph to provide implementations for our kmer types. Then, all the methods that currently rely on kmer_t would take a template argument instead, which should be an implementation k-mer class derived from the base class described above. I think this way we can ensure that Metagraph and SSHash stay reasonably separated.

I started working in this direction in our fork of SSHash (9624612), but before going further I wanted to get in touch and collect some feedback on your side. That being said, I have a couple of questions:

  1. Would it be of interest for SSHash to merge these changes upstream in the future?
  2. Do you have any advice on how this should be approached? Does the outlined approach make sense to you?
  3. It seemed to me that most of the stuff that heavily relies on the alphabet being ACGT is currently encapsulated in util.hpp, but there are also some methods in e.g. dictionary.cpp. Are there any other places where you rely on the alphabet being ACGT directly, rather than implicitly via calls to util:: or dictionary methods?

If merging it upstream in the future is of interest to you, I will try to take your opinion into consideration while working in the fork, so it'd be easier to merge afterwards. If it's not a priority for you, no problem, we can keep working on the fork, but still will highly appreciate any insight and advice you might have for these changes.

@jermp jermp added the question Further information is requested label Dec 7, 2023
@jermp
Copy link
Owner

jermp commented Dec 7, 2023

Hi @adamant-pwn,
and thank you for reaching out! The generalization to arbitrary alphabets would be very interesting.

Would it be of interest for SSHash to merge these changes upstream in the future?

Yeah, why not? :) My focus would still be on indexing DNA, hence I would prefer the code to be optimized for this specific usecase rather than being general. Right now, there aren't specific optimizations when dealing with 2-bit alphabets.

Do you have any advice on how this should be approached? Does the outlined approach make sense to you?

It does make sense to me. In my experiments, I haven't seen much difference in query time when using k=31 and k=63 because I'm relying on built-in data types (uint64_t and uint128_t respectively). I suspect a user-defined struct with 64-bit words to be fairly slower than this approach, but it is needed to use k > 63.

I expect someone already implemented a generic kmer class in C++ that does so, but I never searched on GitHub for one.
Perhaps, you could first see if there is a good one out there.

It seemed to me that most of the stuff that heavily relies on the alphabet being ACGT is currently encapsulated in util.hpp, but there are also some methods in e.g. dictionary.cpp. Are there any other places where you rely on the alphabet being ACGT directly, rather than implicitly via calls to util:: or dictionary methods?

It is mostly in util.hpp but also in other place where, essentially, we process 1 symbol at a time and thus read 2 bits from a bitvector. See for example include/buckets.hpp and include/bit_vector_iterator.hpp. So making SSHash general would need to carefully review the entire codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants