-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-context seeds plus fixes and optimized parameters #426
base: main
Are you sure you want to change the base?
Conversation
f4ec683
to
7ae48f2
Compare
Increases the number of seeds per read by 2*w_min seeds
…he -b parameter value
… try modifying this at a later stage
…l hits may benefit from larger syncmers reverting to only change shortest read lengths before starting benchmarks - will wait for proper parameter optimization instead"
several full seeds can have the same partial base seed (identical query and ref coordinates. Such partial seeds got added to the same NAM and thus increased the score (through incrementing n_hits several times)
…we did not sort on seed length, they could still be added if there was a full hit with the same base hash value. This commit sorts also on seed length and check if we have already added a full hit with the same base hash value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple of comments, mostly directed at myself.
@@ -80,18 +82,66 @@ struct StrobemerIndex { | |||
return end(); | |||
} | |||
|
|||
//Returns the first entry that matches the main hash | |||
size_t partial_find(randstrobe_hash_t key) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code duplication: find
is a special case of partial_find
with aux_len=0
.
7ae48f2
to
1702c10
Compare
Commit 21599d9 "Possibly fix redundant alignment sites for symmetric multi context seeds." also breaks a test. After it, the -rescuable.43 83 NC_001422.1 3137 60 120S14=1X4=1X9=1X4=1X4=1X9=1X4=1X4=1X4=1X9=1X19=1X4=1X4=1X4=1X4=1X4=1X4=1X9=1X4=1X9=1X4=1X9=1X4=1X4=1X4=1X = 2955 -362 (sequence) (qual) NM:i:25 AS:i:120 RG:Z:1
+rescuable.43 69 NC_001422.1 2955 0 * = 2955 0 (sequence) (qual) RG:Z:1 |
By adding get_aux_len() to StrobemerIndex and using that.
Reduces code duplication a little bit.
65240de
to
2c90913
Compare
Randstrobes that have no downstream partner at least w_min syncmers away get their second hash set to 0, which means in the case of multi-context seeds that the primary/main hash is also zero (because it is the smaller of the two). When this is done both for the reference and for queries, we get spurious hits for all randstrobes towards the ends of queries (they get mapped to the end of the reference). Using the hash of the primary syncmer also as hash for the second syncmer gets rid of the problem.
I looked into this. The problem is that this heuristic check now returns false: Line 482 in 3a97f6b
This happens because I think the easiest fix for now is to pass |
Yes, sounds good to reconsider the heuristic or make it independent of k. |
I realized only now that #388 is incomplete and that @ksahlin’s updates were made in a separate branch. We need a place to review this mcs-optimized-parameters branch and someplace where we have a press "Merge", so this is a separate PR that supersedes #388.
I’m making quite some changes to this branch (squashing commits, changing commit messages etc.). If anyone wants the original branch without my modifications, it is available as
mcs-optimized-parameters-backup
.Original parameter optimization was done with the commit that has the description "Fix so that partial rescue hits are added properly". The commit hash hash changed due to history rewriting.
To Do
rescuable.43
is no longer rescuableconst unsigned int aux_len = 24;
infind_nams()