Conversation
|
Awesome! Let’s get this into shape so it can be merged. I’m looking at the failing tests at the moment. When I run When I look at the diff, I can see that the alignments are actually all identical before and after but that the mapping quality changes. Whereas before, we had 42 alignments with quality 60 and 3 with quality 0, afterwards we get 24 with quality 60 and 21 with quality zero. Alignment quality 0 means that there are multiple equally good alignments. This may be an indication that multiple chains are found that result in the same alignment. Either we need to detect these duplicates and remove them or the chaining algorithm needs to be changed to not produce them in the first place. The latter would be preferable because computing alignments is expensive and these duplicates seem to happen quite often. |
|
You are right, we do report chains that cover the same region on a reference. I fixed it, by preventing reporting chains with an already used anchor. The tests should pass now. |
marcelm
left a comment
There was a problem hiding this comment.
I’ve done a first round of code review. Most of it is about code style, but there’s also a potential bug with int vs unsigned.
| struct Anchor { | ||
| int query_start; | ||
| int ref_start; | ||
| int ref_id; |
There was a problem hiding this comment.
should we also changed the Nam struct to use unsigned ?
Nice, good that was a small fix.
They don’t but the failure is somewhere else (in the PAF output). Don’t you run the tests locally? You should do that before every push. |
Communicated with Nicolas offline. The fix does not guarantee that the highest scoring chain is chosen. For example, if we have anchors [a_1,a_2,a_3,a_4,a_5,a_6] with the DP score [1,2,3,3,5,4], the current solution will report [a_3,a_4,a_6] (with score 4; over reporting threshold), but will then skip [a_3,a_4,a_5] with score 5. Nicolas is looking into solutions. One hack is to always start with reporting the optimal chain solution (we should have the DP index), then do a pass reporting all other non overlapping solutions. |
| // Rescue if requested and needed | ||
| if (map_param.rescue_level > 1 && (nonrepetitive_hits == 0 || nonrepetitive_fraction < 0.7)) { | ||
| for (int is_revcomp : {0, 1}) { | ||
| auto [n_rescue_hits_oriented, n_partial_hits_oriented] = find_anchors_rescue( |
There was a problem hiding this comment.
warning: structured binding declaration set but not used [-Wunused-but-set-variable]
rescure, fwd+rev, simple O(N²) algo chaining inside strboemer in O(N*h)
…plicated) instead of the HashMap and Set DS.
... in one place (where sorting seem to takes most time). Unclear if pdqsort is faster on my data though (ChrX only), perhaps will be for larger references?
…tion. This commit factors the function collinear_chaining into first chaining (still named collinear_chaining), then to traceback (named extract_chains_from_dp). This enables finding the global optimal chain score (both FW and RC) before backtrack, which was not done in previous commits. This commit however does not use the global score, but keeps the previous individual scores using best_score[is_revcomp] (instead of float max_score = std::max(best_score[0], best_score[1]);) The reason is that, while faster, the global score reduce the alignment score significantly on the (one) dataset I am testing on. However, I beleive one could potentially intorduce the glodal score but relaxing --vp instead to compensate. Further analysis needed Lastly, while this commit should only be a refactoring (keeping identical results), it still changes results. This is because previous commits had `int new_score = dp[j] + score;` while I believe it should be `float new_score = dp[j] + score;`. I verified that this change has a non-negligible effect on chains returned.
Instead of --chain, there is now --nams
(no sharing of anchors) This fixes also the issue that mapping quality is set to 0 much more often than with NAMs. updated strobeealign tests to be in sync with the chaining results for paf files (different number of matches) Is-new-baseline: yes
Is-new-baseline: yes
Is-new-baseline: yes
…hors Is-new-baseline: yes
Is-new-baseline: yes
|
I have now rebased this PR on top of main and cleaned up the commit history as much as I could. I also ran all the tests for each commit and set the It is unfortunate that GitHub only shows my avatar next to the commits that Nicolas has authored, making it appear as if I were the author (maybe it is because Nicolas uses the default avatar). However, the commits themselves are correctly attributed to Nicolas as author (while I am the committer). There are some unaddressed review comments that I will look into next before merging the PR. The original history for this PR is in the branch |
|
very nice! |
|
I think this deserves to be merged now. The chaining code as it is here is what we are running in our current benchmarks, so it makes sense to mark that somehow. I’ll open separate, smaller PRs for the few remaining issues. @NicolasBuchin Great work, congratulations! |
|
Great stuff @NicolasBuchin! |
This PR introduces collinear chaining as the new default method for mapping/aligning in Strobealign, replacing NAMs. NAMs are still supported and can be enabled using the
--namsflag.It adds new CLI parameters to fine-tune the chaining behavior:
--namsUse NAMs instead of collinear chaining-H [INT]Chaining look-back heuristic (default: 50)--gd=[FLOAT]Diagonal gap cost (default: 0.1)--gl=[FLOAT]Gap length cost (default: 0.05)--vp=[FLOAT]Best chain score threshold (default: 0.7)--sg=[INT]Skip distance allowed on the referenceI should add that the CI pipeline currently fails because the output format and alignments produced by collinear chaining differ from those of NAMs, causing test assertions to break, and should be handled at some point.
Next Steps