Chaining by NicolasBuchin · Pull Request #504 · ksahlin/strobealign

NicolasBuchin · 2025-06-04T09:08:04Z

This PR introduces collinear chaining as the new default method for mapping/aligning in Strobealign, replacing NAMs. NAMs are still supported and can be enabled using the --nams flag.

It adds new CLI parameters to fine-tune the chaining behavior:
--nams Use NAMs instead of collinear chaining
-H [INT] Chaining look-back heuristic (default: 50)
--gd=[FLOAT] Diagonal gap cost (default: 0.1)
--gl=[FLOAT] Gap length cost (default: 0.05)
--vp=[FLOAT] Best chain score threshold (default: 0.7)
--sg=[INT] Skip distance allowed on the reference

I should add that the CI pipeline currently fails because the output format and alignments produced by collinear chaining differ from those of NAMs, causing test assertions to break, and should be handled at some point.

Next Steps

Update test suite to reflect new chaining behavior
Document changes usage in CHANGES.md
Benchmark performances comparing chaining vs NAMs
Fix the reporting of chains that cover the same region of a reference

marcelm · 2025-06-10T19:01:13Z

Awesome! Let’s get this into shape so it can be merged.

I’m looking at the failing tests at the moment. When I run tests/run.sh, I get:

Failure running 'diff tests/phix.se.sam phix.se.sam'

When I look at the diff, I can see that the alignments are actually all identical before and after but that the mapping quality changes. Whereas before, we had 42 alignments with quality 60 and 3 with quality 0, afterwards we get 24 with quality 60 and 21 with quality zero. Alignment quality 0 means that there are multiple equally good alignments. This may be an indication that multiple chains are found that result in the same alignment. Either we need to detect these duplicates and remove them or the chaining algorithm needs to be changed to not produce them in the first place. The latter would be preferable because computing alignments is expensive and these duplicates seem to happen quite often.

NicolasBuchin · 2025-06-11T13:38:09Z

You are right, we do report chains that cover the same region on a reference.

I fixed it, by preventing reporting chains with an already used anchor. The tests should pass now.

marcelm

I’ve done a first round of code review. Most of it is about code style, but there’s also a potential bug with int vs unsigned.

marcelm · 2025-06-12T06:42:04Z

+struct Anchor {
+    int query_start;
+    int ref_start;
+    int ref_id;


should we also changed the Nam struct to use unsigned ?

marcelm · 2025-06-12T07:08:03Z

You are right, we do report chains that cover the same region on a reference.

I fixed it, by preventing reporting chains with an already used anchor.

Nice, good that was a small fix.

The tests should pass now.

They don’t but the failure is somewhere else (in the PAF output). Don’t you run the tests locally? You should do that before every push.

ksahlin · 2025-06-12T07:43:00Z

Nice, good that was a small fix.

Communicated with Nicolas offline. The fix does not guarantee that the highest scoring chain is chosen. For example, if we have anchors [a_1,a_2,a_3,a_4,a_5,a_6] with the DP score [1,2,3,3,5,4], the current solution will report [a_3,a_4,a_6] (with score 4; over reporting threshold), but will then skip [a_3,a_4,a_5] with score 5. Nicolas is looking into solutions. One hack is to always start with reporting the optimal chain solution (we should have the DP index), then do a pass reporting all other non overlapping solutions.

marcelm · 2025-08-22T09:31:20Z

+    // Rescue if requested and needed
+    if (map_param.rescue_level > 1 && (nonrepetitive_hits == 0 || nonrepetitive_fraction < 0.7)) {
+        for (int is_revcomp : {0, 1}) {
+            auto [n_rescue_hits_oriented, n_partial_hits_oriented] = find_anchors_rescue(


warning: structured binding declaration set but not used [-Wunused-but-set-variable]

rescure, fwd+rev, simple O(N²) algo chaining inside strboemer in O(N*h)

…plicated) instead of the HashMap and Set DS.

... in one place (where sorting seem to takes most time). Unclear if pdqsort is faster on my data though (ChrX only), perhaps will be for larger references?

…tion. This commit factors the function collinear_chaining into first chaining (still named collinear_chaining), then to traceback (named extract_chains_from_dp). This enables finding the global optimal chain score (both FW and RC) before backtrack, which was not done in previous commits. This commit however does not use the global score, but keeps the previous individual scores using best_score[is_revcomp] (instead of float max_score = std::max(best_score[0], best_score[1]);) The reason is that, while faster, the global score reduce the alignment score significantly on the (one) dataset I am testing on. However, I beleive one could potentially intorduce the glodal score but relaxing --vp instead to compensate. Further analysis needed Lastly, while this commit should only be a refactoring (keeping identical results), it still changes results. This is because previous commits had `int new_score = dp[j] + score;` while I believe it should be `float new_score = dp[j] + score;`. I verified that this change has a non-negligible effect on chains returned.

Instead of --chain, there is now --nams

(no sharing of anchors) This fixes also the issue that mapping quality is set to 0 much more often than with NAMs. updated strobeealign tests to be in sync with the chaining results for paf files (different number of matches) Is-new-baseline: yes

Is-new-baseline: yes

…hors Is-new-baseline: yes

Is-new-baseline: yes

marcelm · 2025-09-01T11:02:47Z

I have now rebased this PR on top of main and cleaned up the commit history as much as I could. I also ran all the tests for each commit and set the Is-new-baseline: yes trailer if necessary.

It is unfortunate that GitHub only shows my avatar next to the commits that Nicolas has authored, making it appear as if I were the author (maybe it is because Nicolas uses the default avatar). However, the commits themselves are correctly attributed to Nicolas as author (while I am the committer).

There are some unaddressed review comments that I will look into next before merging the PR.

The original history for this PR is in the branch chaining-original. I’ll delete that some time after merging if no one needs it.

ksahlin · 2025-09-01T12:44:48Z

very nice!

marcelm · 2025-09-01T13:40:39Z

I think this deserves to be merged now.

The chaining code as it is here is what we are running in our current benchmarks, so it makes sense to mark that somehow. I’ll open separate, smaller PRs for the few remaining issues.

@NicolasBuchin Great work, congratulations!

ksahlin · 2025-09-02T04:50:28Z

Great stuff @NicolasBuchin!

marcelm reviewed Jun 12, 2025

View reviewed changes

marcelm reviewed Aug 22, 2025

View reviewed changes

Comment thread CHANGES.md Outdated

Comment thread CHANGES.md

Comment thread src/chain.cpp Outdated

Comment thread src/chain.hpp

marcelm reviewed Aug 22, 2025

View reviewed changes

NicolasBuchin and others added 23 commits August 30, 2025 11:22

simple collinear chaining with matches O(N²)

be3b273

rescure, fwd+rev, simple O(N²) algo chaining inside strboemer in O(N*h)

sorted and non repetitive anchors

67ef667

valid chain threshold as a param + no need for sorting option

0cd7a32

chaining upgrade!

a35b39c

Implements a vector keeping anchors from all ref_ids (sorted and dedu…

9e51db0

…plicated) instead of the HashMap and Set DS.

Removed returning heavy DS and instead pass by reference.

9e5fd1a

Removed redundant call to std::sort and replaced std::sort with pdqsort

e77f516

... in one place (where sorting seem to takes most time). Unclear if pdqsort is faster on my data though (ChrX only), perhaps will be for larger references?

Typo fix. Default value is set to 0.7 internally

e475c4b

Removed unnecessary sort and changed to pdqsort for rescue

aa89958

skiping diff ref ids

c7b4f0b

skipping by ref distance

58f2595

scoring with distances + small refactor

58e4251

skip distance is now param as an int, set to 10 000 by default

efffc42

Remove whitespace and comments in chain.cpp/hpp

666ef1f

Switch from NAMs by default to chains by default

949b218

Instead of --chain, there is now --nams

trying a concave gap cost function

4e53892

some minor edits for the PR

fefde81

chain overlapping candidate selection now by score

6f36aa4

Is-new-baseline: yes

better log2 for concave gap function

ad472a6

Is-new-baseline: yes

names changed

098329d

scoring function commented + conversion error fixed (?)

8cf9de6

NicolasBuchin added 4 commits September 1, 2025 11:51

n_matches correctly counted + chain score multiplied by number of anc…

4e96037

…hors Is-new-baseline: yes

alternate scoring for chains (score + c * epsilon)

3696e54

Is-new-baseline: yes

Changelog entry and readme section

1570163

minor edits

36e2deb

marcelm force-pushed the chaining branch from e465a90 to be3b273 Compare September 1, 2025 10:04

marcelm merged commit b3821bf into main Sep 1, 2025
17 checks passed

marcelm deleted the chaining branch September 1, 2025 13:41

Conversation

NicolasBuchin commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next Steps

Uh oh!

marcelm commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasBuchin commented Jun 11, 2025

Uh oh!

marcelm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcelm Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

NicolasBuchin Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marcelm commented Jun 12, 2025

Uh oh!

ksahlin commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcelm Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

marcelm commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ksahlin commented Sep 1, 2025

Uh oh!

marcelm commented Sep 1, 2025

Uh oh!

Uh oh!

ksahlin commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NicolasBuchin commented Jun 4, 2025 •

edited

Loading

marcelm commented Jun 10, 2025 •

edited

Loading

ksahlin commented Jun 12, 2025 •

edited

Loading

marcelm commented Sep 1, 2025 •

edited

Loading