Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-context seeds #388

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

Itolstoganov
Copy link

@Itolstoganov Itolstoganov commented Jan 16, 2024

This replaces randstrobe hashes with multi-context hashes. Multi-context hashes are defined as follows

((hash1 >> digest_size) << digest_size) ^ (hash2 >> (64 - digest_size))

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash. The 64 - digest_size prefix of the multi-context hash corresponding to hash1 is denoted as the main hash, and the digest_size suffix corresponding to hash2 is denoted as auxiliary hash.

The prefix of the size (64 - digest_size) corresponding to the smaller syncmer is used for partial search. If a given full 64-bit query hash was not found in the index, the seeds with the same main hash are looked up instead and added as the matches. These partial matches are used for the NAM construction in the same way as the full matches.

In order to calculate the reference range of the partial match correctly, we need to know which strobe was used as main. This information is stored in the m_packed field of RefRandstrobe, replacing one of the bits previously reserved for the reference index.

List of changes:

  • The randstrobe hash was replaced with the multi-context hash.
  • Partial match search was added to the find_nams function if full matches were not found. Partial matches have a larger abundance threshold than full matches due to higher seed repetitiveness.
  • Partial hit processing added to add_to_hits_per_ref.
  • The partial_find function that looks for the seeds with the same main hash as the query was added to the StrobemerIndex.
  • Some utility functions (for working with the main part of the hash) were added to the StrobemerIndex.
  • 9th right-hand bit of the m_packed field of the RefRandstrobe is now set iff the first strobe was used as the main part of the multi-context seed.
  • Fields that store the start and end positions of the main strobe on the query, and the indicator of the main strobe are added to the QueryRandstrobe class.
  • The --digest parameter was added to regulate the auxiliary part of the multi-context hash with the default value of 24.

static inline digest_hash_t digest_hash(syncmer_hash_t hash1, syncmer_hash_t hash2, size_t digest_size) {
syncmer_hash_t main_hash = std::min(hash1, hash2);
syncmer_hash_t aux_hash = std::max(hash1, hash2);
return ((main_hash >> digest_size) << digest_size) ^ (aux_hash >> (64 - digest_size));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to pass information of which strobe (1 or 2) that was saved as the base value. probably here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this information to the last bit of the hash

src/nam.cpp Outdated
@@ -14,12 +15,14 @@ inline void add_to_hits_per_ref(
int query_start,
int query_end,
const StrobemerIndex& index,
size_t position
size_t position,
bool is_partial
) {
int min_diff = std::numeric_limits<int>::max();
for (const auto hash = index.get_hash(position); index.get_hash(position) == hash; ++position) {
int ref_start = index.get_strobe1_position(position);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if using multi-context, we are not sure if it is index.get_strobe1_position(position) or position of the second strobe here. Will yield inconsistent NAMs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces randstrobes with multi-context seeds.

For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash.

Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?

The 64 - digest_size prefix of the multi-context hash corresponding to hash1 is denoted as the main hash, and the digest_size suffix corresponding to hash2 is denoted as auxiliary hash.

This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.

  • main hash
  • auxiliary hash
  • digest, digest size (inconsistency: command-line option is --digest, but what is meant is actually "digest length"). What is a digest anyway?
  • subhash (used in a comment somewhere)
  • multi-context hash
  • digest hash
  • partial hit

I guess that some of this will resolve itself if we switch to asymmetric randstrobe hashes, so it is not super important to fix this at the moment.

src/index.cpp Outdated
@@ -150,6 +150,8 @@ void StrobemerIndex::populate(float f, unsigned n_threads) {
}
stats.tot_strobemer_count = total_randstrobes;

logger.info() << "Digest parameter is : " << parameters.randstrobe.digest;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info() << "Digest parameter is : " << parameters.randstrobe.digest;
logger.info() << "Digest size: " << parameters.randstrobe.digest;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced "digest" with "aux_len" here and in other places.

@@ -125,7 +125,8 @@ int run_dumpstrobes(int argc, char **argv) {
l_set ? l : IndexParameters::DEFAULT,
u_set ? u : IndexParameters::DEFAULT,
c_set ? c : IndexParameters::DEFAULT,
max_seed_len_set ? max_seed_len : IndexParameters::DEFAULT
max_seed_len_set ? max_seed_len : IndexParameters::DEFAULT,
digest ? digest: IndexParameters::DEFAULT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
digest ? digest: IndexParameters::DEFAULT
digest ? digest : IndexParameters::DEFAULT

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

src/index.hpp Outdated
@@ -80,18 +83,66 @@ struct StrobemerIndex {
return end();
}

//Returns the first entry that matches the first strobe subhash (if using multi-context seeds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using multi-context seeds optional with this PR? If not, then please remove the text in parentheses

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not optional, but I wanted to leave the possibility of changing between hashes open at some point. But it does not seem to have any benefits. Fixed the comment.

Comment on lines 40 to 46
Profile{ 50, 70, 18, -4, -2, 1, 24},
Profile{ 75, 90, 20, -4, -3, 2, 24},
Profile{100, 110, 20, -4, -2, 2, 24},
Profile{125, 135, 20, -4, -1, 4, 24},
Profile{150, 175, 20, -4, 1, 7, 24},
Profile{250, 375, 22, -4, 2, 12, 24},
Profile{400, max, 23, -6, 2, 12, 24},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it’s always 24, it doesn’t need to be part of the profile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

src/nam.cpp Outdated
if (diff <= min_diff) {
hits_per_ref[index.reference_index(position)].push_back(Hit{query_start, query_end, ref_start, ref_end});
min_diff = diff;
if (not is_partial) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you notice you were using a digraph?

Suggested change
if (not is_partial) {
if (!is_partial) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this up into two functions: One for the partial and one for the non-partial case. (A function that has completely different behavior based on a boolean parameter should be split up into two functions.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Splitted the function.

@@ -16,6 +16,7 @@

using syncmer_hash_t = uint64_t;
using randstrobe_hash_t = uint64_t;
using digest_hash_t = uint64_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this typedef is necessary. digest_hash_t is only used once as the return type of the digest_hash function, but everywhere else, it is still called randstrobe_hash_t. If there was a reason to use a name other than randstrobe_hash_t, we would rename randstrobe_hash_t.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

src/index.hpp Outdated

bool is_filtered(bucket_index_t position) const {
return get_hash(position) == get_hash(position + filter_cutoff);
}

bool is_partial_filtered(bucket_index_t position) const {
uint shift = parameters.randstrobe.digest;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer explicitly sized types, that is, don’t use uint, use whichever type the digest parameter has.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all uint occurences

@@ -49,6 +49,13 @@ static inline randstrobe_hash_t randstrobe_hash(syncmer_hash_t hash1, syncmer_ha
return hash1 + hash2;
}

static inline digest_hash_t digest_hash(syncmer_hash_t hash1, syncmer_hash_t hash2, size_t digest_size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a digest_hash function, the randstrobe_hash function should be modified. (Currently, the randstrobe_hash function is unused.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -220,9 +238,11 @@ QueryRandstrobeVector randstrobes_query(const std::string_view seq, const IndexP
RandstrobeIterator randstrobe_fwd_iter{syncmers, parameters.randstrobe};
while (randstrobe_fwd_iter.has_next()) {
auto randstrobe = randstrobe_fwd_iter.next();
uint partial_start = randstrobe.is_first_main ? randstrobe.strobe1_pos : randstrobe.strobe2_pos;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint


private:
SyncmerIterator syncmer_iterator;
const unsigned w_min;
const unsigned w_max;
const uint64_t q;
const unsigned int max_dist;
uint digest;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint

@Itolstoganov
Copy link
Author

For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.

Sure, edited the PR description.

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash.
Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?

Yes, the only reason for selecting minimal hash is to keep the hash symmetric.

This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.

I tried to alleviate that by removing all mentions of "digest" (by which I meant the same thing as the "auxiliary" part of the multi-context hash) and "subhash".

src/index.cpp Outdated Show resolved Hide resolved
@ksahlin
Copy link
Owner

ksahlin commented Mar 15, 2024

Is this PR ready for a larger benchmarking?

@ksahlin
Copy link
Owner

ksahlin commented Apr 17, 2024

The randstrobe iterator for queries stops when there is still w_min syncmers left in the read by checking

    bool has_next() {
        return strobe1_index + w_min < syncmers.size();

This is expected behaviour for our current seeds. For example, w_min=1 for 50 and 75, and w_min=4 for 100.

However, I think mcs can be boosted further by adding the remaining syncmers 'in the ends of reads' as seeds. This means 2*w_min more seeds for a read (fw and rc ends).

I haven't thought carefully about the best way to change it in the code, but perhaps changing the bool has_next() to return strobe1_index < syncmers.size(); together with adding a case to return randstrobes with base/main hash as strobe1.hash and auxillary hash 0 in Randstrobe RandstrobeIterator when i < w_start?

@ksahlin
Copy link
Owner

ksahlin commented Apr 17, 2024

I implemented and tested this briefly:

  1. Changed to
    bool has_next() {
        return strobe1_index < syncmers.size();
    }
  1. Added in RandstrobeIterator::get
    if (syncmers.size() < w_start) {
        return Randstrobe{
                randstrobe_hash(strobe1.hash, strobe2.hash, aux_len),
                static_cast<uint32_t>(strobe1.position),
                static_cast<uint32_t>(strobe2.position), true
        };
    }

The results only very slightly improved/nearly unchanged on a 'no variation' drosophila genome in SE mapping mode, but they improve substantially for a high variation simulation (see below for read lengths 100 and 150). This also suggest that we may be overfitting our optimization to high quality simulations without much variation.

ref,read_length,tool,mode,%-aligned,accuracy,time(sec),Mem(MB)
droso_above10k_variation,100,strobealign-mcs_SE,align,48.815,44.2065,704,1.45,778.4
droso_above10k_variation,100,strobealign-mcs-more-seeds_SE,align,48.9885,44.2975,704,1.55,778.36

droso_above10k_variation,150,strobealign-mcs_SE,align,49.0685,44.9805,839,2.47,781.87
droso_above10k_variation,150,strobealign-mcs-more-seeds_SE,align,49.258,45.1925,839,2.8,787.6

Needs to be tested on larger genome(s) obv.

@marcelm
Copy link
Collaborator

marcelm commented Apr 30, 2024

This also suggest that we may be overfitting our optimization to high quality simulations without much variation.

I am a bit worried about this. Doesn’t this fit the pattern that we often see worse variant detection rates for "optimized" mapping parameters? Shall I perhaps run the parameter optimization on data with higher variation?

@ksahlin
Copy link
Owner

ksahlin commented Apr 30, 2024

Shall I perhaps run the parameter optimization on data with higher variation?

Sounds like a good suggestion to me! From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.

Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100 to make it really challenging.

rule mason_simulate_variants:
    input:  ref = config["HG38"]
    output: sim_vcf =  config["ROOT_OUT"] + "/reads/{dataset}/variations.vcf",
    run:
        if wildcards.dataset == "SIM1":
            shell("mason_variator -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM2":
            shell("mason_variator --sv-indel-rate 0.000001 --snp-rate 0.001 --small-indel-rate 0.00001 --max-small-indel-size 20  -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM3":
            shell("mason_variator --sv-indel-rate 0.000005 --snp-rate 0.001 --small-indel-rate 0.0001 --max-small-indel-size 50   -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM4":
            shell("mason_variator --sv-indel-rate 0.00001 --snp-rate 0.005 --small-indel-rate 0.0005 --max-small-indel-size 50   -ir {input.ref} -ov {output.sim_vcf}")


@marcelm
Copy link
Collaborator

marcelm commented May 2, 2024

From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.

Sorry if you got the wrong impression, but since I don’t store the index on disk, everything has to be recomputed. I’ll run this on a cluster somewhere to get the results faster this time.

Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100 to make it really challenging.

Good, I’ll use that and would like to call it SIM5; is that ok?

@ksahlin
Copy link
Owner

ksahlin commented May 4, 2024

Good, I’ll use that and would like to call it SIM5; is that ok?

Yes, sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants