Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-context seeds plus fixes and optimized parameters #426

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented May 22, 2024

I realized only now that #388 is incomplete and that @ksahlin’s updates were made in a separate branch. We need a place to review this mcs-optimized-parameters branch and someplace where we have a press "Merge", so this is a separate PR that supersedes #388.

I’m making quite some changes to this branch (squashing commits, changing commit messages etc.). If anyone wants the original branch without my modifications, it is available as mcs-optimized-parameters-backup.

Original parameter optimization was done with the commit that has the description "Fix so that partial rescue hits are added properly". The commit hash hash changed due to history rewriting.

To Do

  • Changelog entry
  • Document the concept (take from Add multi-context seeds #388)
  • Document the data structures
  • Document terminology: main, partial, auxiliary, ...
  • Fix tests
    • rescuable.43 is no longer rescuable
  • Get rid of hardcoded const unsigned int aux_len = 24; in find_nams()

@marcelm marcelm force-pushed the mcs-optimized-parameters branch 2 times, most recently from f4ec683 to 7ae48f2 Compare May 22, 2024 12:07
Itolstoganov and others added 10 commits May 22, 2024 14:58
Increases the number of seeds per read by 2*w_min seeds
…l hits may benefit from larger syncmers

reverting to only change shortest read lengths before starting benchmarks - will wait for proper parameter optimization instead"
several full seeds can have the same partial base seed (identical query
and ref coordinates. Such partial seeds got added to the same NAM and
thus increased the score (through incrementing n_hits several times)
…we did not sort on seed length, they could still be added if there was a full hit with the same base hash value. This commit sorts also on seed length and check if we have already added a full hit with the same base hash value
Copy link
Collaborator Author

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a couple of comments, mostly directed at myself.

src/arguments.hpp Outdated Show resolved Hide resolved
src/nam.cpp Outdated Show resolved Hide resolved
src/nam.cpp Outdated Show resolved Hide resolved
src/randstrobes.cpp Outdated Show resolved Hide resolved
@@ -80,18 +82,66 @@ struct StrobemerIndex {
return end();
}

//Returns the first entry that matches the main hash
size_t partial_find(randstrobe_hash_t key) const {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication: find is a special case of partial_find with aux_len=0.

src/index.hpp Outdated Show resolved Hide resolved
src/index.hpp Outdated Show resolved Hide resolved
src/nam.cpp Outdated Show resolved Hide resolved
src/randstrobes.cpp Outdated Show resolved Hide resolved
@marcelm
Copy link
Collaborator Author

marcelm commented May 23, 2024

Commit 21599d9 "Possibly fix redundant alignment sites for symmetric multi context seeds." also breaks a test. After it, the rescuable.43 read in the tiny test dataset is no longer rescuable. Shortened diff of read 1:

-rescuable.43   83  NC_001422.1  3137  60  120S14=1X4=1X9=1X4=1X4=1X9=1X4=1X4=1X4=1X9=1X19=1X4=1X4=1X4=1X4=1X4=1X4=1X9=1X4=1X9=1X4=1X9=1X4=1X4=1X4=1X      =       2955    -362    (sequence)       (qual)       NM:i:25 AS:i:120        RG:Z:1
+rescuable.43   69  NC_001422.1  2955  0  *  =  2955  0       (sequence)  (qual)  RG:Z:1

src/randstrobes.cpp Outdated Show resolved Hide resolved
Randstrobes that have no downstream partner at least w_min syncmers away get
their second hash set to 0, which means in the case of multi-context seeds
that the primary/main hash is also zero (because it is the smaller of the
two).

When this is done both for the reference and for queries, we get spurious
hits for all randstrobes towards the ends of queries (they get mapped to the
end of the reference).

Using the hash of the primary syncmer also as hash for the second syncmer
gets rid of the problem.
@marcelm
Copy link
Collaborator Author

marcelm commented Jun 7, 2024

Commit 21599d9 "Possibly fix redundant alignment sites for symmetric multi context seeds." also breaks a test. After it, the rescuable.43 read in the tiny test dataset is no longer rescuable. [...]

I looked into this. The problem is that this heuristic check now returns false:

if (!has_shared_substring(r_tmp, ref_segm, k)) {

This happens because $k$ was increased from 22 to 25. With $k=22$, there were shared $k$-mers, but there are no longer any with $k=25$.

I think the easiest fix for now is to pass -k 22 on the command-line when running that particular test, but maybe it also makes sense to reconsider how the heuristic works (perhaps it could depend less on $k$). But that would be something independent of this PR.

@ksahlin
Copy link
Owner

ksahlin commented Jun 7, 2024

Yes, sounds good to reconsider the heuristic or make it independent of k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants