Alignment with ksw2/WFA2 - Take 2 #251

marcelm · 2023-03-06T21:21:21Z

This supersedes #242 and is a much smaller PR because the refactoring has already been done in #249. As before:

Use WFA2 to align the NAM itself
Use ksw2 to extend the alignment once towards the 5' end ("left extension") and once towards the 3' end ("right extension") of the read.

Compared to #242, this ensures that indels are left-aligned even in the left extension. To do the left extension, we reverse query and reference, do a right extension (this is how ksw2 is intended to be used) and then reverse the resulting CIGAR. However, when doing so, we need to set a flag to tell ksw2 to right-align indels so that they are left aligned after reversing the CIGAR. The version in #242 did not do this.

This should help with getting rid of the (likely incorrect) increased indel precision/recall in #245.

To Do

Run benchmarks
Score computation is buggy and results in the score being off by exactly 7 for some alignments. It is quite strange that any score returned by the function is odd at all even though match and mismatch scores are even and there are no gaps involved.
The fixops and count_edits functions should possibly be moved somewhere else.

ksahlin · 2023-03-06T22:04:41Z

great that you found out the cause!

marcelm · 2023-03-13T14:31:43Z

I investigated the unexpected ksw2 score computation. As I could have guessed, "it’s not a bug, it’s a feature": I noticed the weird scores have to do with the N characters in our test reads. As it turns out, any character that is not one of ACGTacgt is treated as a wildcard character by ksw2. I couldn’t really find where this is done in the code, but to explain the numbers that I saw, any match involving a wildcard character must be getting a score of -1. This appears to be hardcoded.

This behavior is different from SSW, but it makes sense. The only slight problem is that -1 is less bad in our scoring scheme (with +4 for match, -8 for mismatch) than in the BWA-MEM scoring scheme (+1 for match, -4 for mismatch), but I don’t see how this would have any more than a very minor influence on mapping results.

ksahlin · 2023-03-13T14:55:11Z

I see. I guess it makes sense to have a lower penalty on N compared to actual mismatches, as there is higher uncertainty for N.

marcelm · 2023-03-17T12:35:50Z

Here are some accuracy measurements.

Paired-end

dataset	before	after (`8311650`)	difference
drosophila/2x50-1M	90.0774	90.0718	-0.0056
drosophila/2x100-1M	92.3898	92.38455	-0.00525
drosophila/2x150-1M	93.21915	93.20835	-0.0108
drosophila/2x200-1M	93.5103	93.50025	-0.01005
drosophila/2x300-1M	95.36705	95.3561	-0.01095
CHM13/2x50-1M	90.4725	90.46865	-0.00385
CHM13/2x100-1M	93.2267	93.2158	-0.0109
CHM13/2x150-1M	94.13815	94.1189	-0.01925
CHM13/2x200-1M	94.417	94.405	-0.012
CHM13/2x300-1M	95.62605	95.6148	-0.01125
rye/2x50-1M	68.88975	68.88125	-0.0085
rye/2x100-1M	85.55745	85.55015	-0.0073
rye/2x150-1M	90.1907	90.18255	-0.00815
rye/2x200-1M	91.4566	91.45025	-0.00635
rye/2x300-1M	94.56905	94.5592	-0.00985
maize/2x50-1M	71.2683	71.26195	-0.00635
maize/2x100-1M	87.12395	87.11465	-0.0093
maize/2x150-1M	91.68675	91.67655	-0.0102
maize/2x200-1M	92.90785	92.8948	-0.01305
maize/2x300-1M	96.72045	96.71095	-0.0095

Average difference: -0.0094225

Single-end

dataset	before	after (`8311650`)	difference
drosophila/2x50-1M	82.033	82.0313	-0.0017
drosophila/2x100-1M	89.1891	89.1865	-0.0026
drosophila/2x150-1M	90.9077	90.8999	-0.0078
drosophila/2x200-1M	91.9535	91.9437	-0.0098
drosophila/2x300-1M	93.2254	93.2126	-0.0128
CHM13/2x50-1M	81.1605	81.1602	-0.0003
CHM13/2x100-1M	90.5476	90.5425	-0.0051
CHM13/2x150-1M	92.3761	92.3628	-0.0133
CHM13/2x200-1M	93.2074	93.1941	-0.0133
CHM13/2x300-1M	94.1505	94.1366	-0.0139
rye/2x50-1M	44.4244	44.4244	0
rye/2x100-1M	69.2576	69.2554	-0.0022
rye/2x150-1M	80.3999	80.391	-0.0089
rye/2x200-1M	85.8734	85.866	-0.0074
rye/2x300-1M	90.4879	90.4764	-0.0115
maize/2x50-1M	47.1744	47.1755	0.0011
maize/2x100-1M	70.483	70.4802	-0.0028
maize/2x150-1M	81.1613	81.1519	-0.0094
maize/2x200-1M	86.6911	86.6827	-0.0084
maize/2x300-1M	91.8587	91.8464	-0.0123

Average difference: -0.00712

Runtime

I have only looked at the runtime for the above accuracy measurements (so sample size per dataset is 1). Roughly, read lengths 200 and 300 bp on the repetitive genomes benefit the most: rye/2x300 and maize/2x300 reduce mapping runtime by 17 and 19%, respectively. Mapping time on 100 bp reads seems to be quite unaffected.

The decrease in accuracy is unfortunate, but maybe not too bad. At the same time, the speedup isn’t that super great either. I’m not sure what to think of this.

ksahlin · 2023-03-17T14:26:35Z

In these benchmarks, is the 'before' a partial (piece wise) alignment approach or is it the old 'semi global' one with SSW that I implemented?

I strongly prefer a partial alignment approach so that we could eventually integrate split (supplementary) alignments as well as long read alignment. What is your opinion on this?

While i am perfectly happy paying this very small decrease in mapping accuracy, I care more about what happens with alignments around SNPs and indels. Could you point me to two commits that you want benchmarked against each other for this? I would be happy to set off an evaluation. (it will just have to wait until end of next week).

- rescue_mate still uses SSW - WFA2 is used to align the NAM itself - ksw2 is used (twice) to extend the NAM towards the read ends

marcelm · 2023-03-17T17:29:15Z

In these benchmarks, is the 'before' a partial (piece wise) alignment approach or is it the old 'semi global' one with SSW that I implemented?

"before" is essentially v0.9.0, so it’s SSW.

I strongly prefer a partial alignment approach so that we could eventually integrate split (supplementary) alignments as well as long read alignment. What is your opinion on this?

I guess it would be a good first step towards split alignments.

I forgot to mention that the decrease in accuracy is smaller than the improvement we got from introducing the soft-clipping end bonus. So accuracy would still be higher than in 0.8.0.

While i am perfectly happy paying this very small decrease in mapping accuracy, I care more about what happens with alignments around SNPs and indels. Could you point me to two commits that you want benchmarked against each other for this? I would be happy to set off an evaluation. (it will just have to wait until end of next week).

I just rebased this PR on top of main and force-pushed it. The commits to compare against each other would be v0.9.0 (0771d1c) and 8311650.

marcelm mentioned this pull request Mar 6, 2023

[E::sam_parse1] CIGAR and query sequence are of different length #245

Closed

marcelm added 7 commits March 17, 2023 17:56

Make WFA2-lib available to the build

918466d

Update ksw2 to the most recent upstream version

eb96cb6

Move ksw2 files to a subdirectory and add license

0fd2f16

Use WFA2 and ksw2 for alignment instead of SSW

da4702c

- rescue_mate still uses SSW - WFA2 is used to align the NAM itself - ksw2 is used (twice) to extend the NAM towards the read ends

Ensure indels are left-aligned in the left extension

b70bf3c

Call ksw_gen_simple_mat only once

88ffb17

Add operator<< for ksw_extz_t

8311650

marcelm force-pushed the wfa2 branch from 47e9afc to 8311650 Compare March 17, 2023 16:56

marcelm mentioned this pull request Apr 4, 2023

Alignment with ksw2 and WFA2-lib #242

Closed

5 tasks

marcelm mentioned this pull request Mar 14, 2024

Support end-to-end alignments? #399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment with ksw2/WFA2 - Take 2 #251

Alignment with ksw2/WFA2 - Take 2 #251

marcelm commented Mar 6, 2023 •

edited

Loading

ksahlin commented Mar 6, 2023

marcelm commented Mar 13, 2023

ksahlin commented Mar 13, 2023

marcelm commented Mar 17, 2023 •

edited

Loading

ksahlin commented Mar 17, 2023

marcelm commented Mar 17, 2023

Alignment with ksw2/WFA2 - Take 2 #251

Are you sure you want to change the base?

Alignment with ksw2/WFA2 - Take 2 #251

Conversation

marcelm commented Mar 6, 2023 • edited Loading

To Do

ksahlin commented Mar 6, 2023

marcelm commented Mar 13, 2023

ksahlin commented Mar 13, 2023

marcelm commented Mar 17, 2023 • edited Loading

Paired-end

Single-end

Runtime

ksahlin commented Mar 17, 2023

marcelm commented Mar 17, 2023

marcelm commented Mar 6, 2023 •

edited

Loading

marcelm commented Mar 17, 2023 •

edited

Loading