Skip to content

Chaining#504

Merged
marcelm merged 27 commits intomainfrom
chaining
Sep 1, 2025
Merged

Chaining#504
marcelm merged 27 commits intomainfrom
chaining

Conversation

@NicolasBuchin
Copy link
Copy Markdown
Collaborator

@NicolasBuchin NicolasBuchin commented Jun 4, 2025

This PR introduces collinear chaining as the new default method for mapping/aligning in Strobealign, replacing NAMs. NAMs are still supported and can be enabled using the --nams flag.

It adds new CLI parameters to fine-tune the chaining behavior:
--nams Use NAMs instead of collinear chaining
-H [INT] Chaining look-back heuristic (default: 50)
--gd=[FLOAT] Diagonal gap cost (default: 0.1)
--gl=[FLOAT] Gap length cost (default: 0.05)
--vp=[FLOAT] Best chain score threshold (default: 0.7)
--sg=[INT] Skip distance allowed on the reference

I should add that the CI pipeline currently fails because the output format and alignments produced by collinear chaining differ from those of NAMs, causing test assertions to break, and should be handled at some point.

Next Steps

  • Update test suite to reflect new chaining behavior
  • Document changes usage in CHANGES.md
  • Benchmark performances comparing chaining vs NAMs
  • Fix the reporting of chains that cover the same region of a reference

@marcelm
Copy link
Copy Markdown
Collaborator

marcelm commented Jun 10, 2025

Awesome! Let’s get this into shape so it can be merged.

I’m looking at the failing tests at the moment. When I run tests/run.sh, I get:

Failure running 'diff tests/phix.se.sam phix.se.sam'

When I look at the diff, I can see that the alignments are actually all identical before and after but that the mapping quality changes. Whereas before, we had 42 alignments with quality 60 and 3 with quality 0, afterwards we get 24 with quality 60 and 21 with quality zero. Alignment quality 0 means that there are multiple equally good alignments. This may be an indication that multiple chains are found that result in the same alignment. Either we need to detect these duplicates and remove them or the chaining algorithm needs to be changed to not produce them in the first place. The latter would be preferable because computing alignments is expensive and these duplicates seem to happen quite often.

@NicolasBuchin
Copy link
Copy Markdown
Collaborator Author

You are right, we do report chains that cover the same region on a reference.

I fixed it, by preventing reporting chains with an already used anchor. The tests should pass now.

Copy link
Copy Markdown
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve done a first round of code review. Most of it is about code style, but there’s also a potential bug with int vs unsigned.

Comment thread src/aln.hpp Outdated
Comment thread src/aln.hpp Outdated
Comment thread src/aln.hpp Outdated
Comment thread src/chain.cpp Outdated
Comment thread ext/zstr/src/zstr.hpp Outdated
Comment thread src/chain.cpp Outdated
Comment thread src/chain.cpp Outdated
Comment thread src/chain.cpp Outdated
Comment thread src/chain.hpp Outdated
struct Anchor {
int query_start;
int ref_start;
int ref_id;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also changed the Nam struct to use unsigned ?

Comment thread src/chain.cpp Outdated
@marcelm
Copy link
Copy Markdown
Collaborator

marcelm commented Jun 12, 2025

You are right, we do report chains that cover the same region on a reference.

I fixed it, by preventing reporting chains with an already used anchor.

Nice, good that was a small fix.

The tests should pass now.

They don’t but the failure is somewhere else (in the PAF output). Don’t you run the tests locally? You should do that before every push.

@ksahlin
Copy link
Copy Markdown
Owner

ksahlin commented Jun 12, 2025

Nice, good that was a small fix.

Communicated with Nicolas offline. The fix does not guarantee that the highest scoring chain is chosen. For example, if we have anchors [a_1,a_2,a_3,a_4,a_5,a_6] with the DP score [1,2,3,3,5,4], the current solution will report [a_3,a_4,a_6] (with score 4; over reporting threshold), but will then skip [a_3,a_4,a_5] with score 5. Nicolas is looking into solutions. One hack is to always start with reporting the optimal chain solution (we should have the DP index), then do a pass reporting all other non overlapping solutions.

Comment thread CHANGES.md Outdated
Comment thread CHANGES.md
Comment thread src/chain.cpp Outdated
Comment thread src/chain.hpp
Comment thread src/chain.cpp Outdated
Comment thread src/chain.cpp Outdated
Comment thread src/chain.cpp Outdated
// Rescue if requested and needed
if (map_param.rescue_level > 1 && (nonrepetitive_hits == 0 || nonrepetitive_fraction < 0.7)) {
for (int is_revcomp : {0, 1}) {
auto [n_rescue_hits_oriented, n_partial_hits_oriented] = find_anchors_rescue(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: structured binding declaration set but not used [-Wunused-but-set-variable]

NicolasBuchin and others added 23 commits August 30, 2025 11:22
rescure, fwd+rev, simple O(N²) algo

chaining inside strboemer in O(N*h)
... in one place (where sorting seem to takes most time). Unclear if pdqsort
is faster on my data though (ChrX only), perhaps will be for larger references?
…tion.

This commit factors the function collinear_chaining into first chaining (still named collinear_chaining), then to traceback (named extract_chains_from_dp).

This enables finding the global optimal chain score (both FW and RC) before backtrack, which was not done in previous commits.

This commit however does not use the global score, but keeps the previous individual scores using best_score[is_revcomp] (instead of float max_score = std::max(best_score[0], best_score[1]);)

The reason is that, while faster, the global score reduce the alignment score significantly on the (one) dataset I am testing on. However, I beleive one could potentially intorduce the glodal score but relaxing --vp instead to compensate. Further analysis needed

Lastly, while this commit should only be a refactoring (keeping identical results), it still changes results. This is because previous commits had `int new_score = dp[j] + score;` while I believe it should be `float new_score = dp[j] + score;`. I verified that this change has a non-negligible effect on chains returned.
Instead of --chain, there is now --nams
(no sharing of anchors)

This fixes also the issue that mapping quality is set to 0 much more often
than with NAMs.

updated strobeealign tests to be in sync with the chaining results for paf files (different number of matches)

Is-new-baseline: yes
@marcelm
Copy link
Copy Markdown
Collaborator

marcelm commented Sep 1, 2025

I have now rebased this PR on top of main and cleaned up the commit history as much as I could. I also ran all the tests for each commit and set the Is-new-baseline: yes trailer if necessary.

It is unfortunate that GitHub only shows my avatar next to the commits that Nicolas has authored, making it appear as if I were the author (maybe it is because Nicolas uses the default avatar). However, the commits themselves are correctly attributed to Nicolas as author (while I am the committer).

There are some unaddressed review comments that I will look into next before merging the PR.

The original history for this PR is in the branch chaining-original. I’ll delete that some time after merging if no one needs it.

@ksahlin
Copy link
Copy Markdown
Owner

ksahlin commented Sep 1, 2025

very nice!

@marcelm
Copy link
Copy Markdown
Collaborator

marcelm commented Sep 1, 2025

I think this deserves to be merged now.

The chaining code as it is here is what we are running in our current benchmarks, so it makes sense to mark that somehow. I’ll open separate, smaller PRs for the few remaining issues.

@NicolasBuchin Great work, congratulations!

@marcelm marcelm merged commit b3821bf into main Sep 1, 2025
17 checks passed
@marcelm marcelm deleted the chaining branch September 1, 2025 13:41
@ksahlin
Copy link
Copy Markdown
Owner

ksahlin commented Sep 2, 2025

Great stuff @NicolasBuchin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants