Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment with ksw2/WFA2 - Take 2 #251

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Alignment with ksw2/WFA2 - Take 2 #251

wants to merge 7 commits into from

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Mar 6, 2023

This supersedes #242 and is a much smaller PR because the refactoring has already been done in #249. As before:

  • Use WFA2 to align the NAM itself
  • Use ksw2 to extend the alignment once towards the 5' end ("left extension") and once towards the 3' end ("right extension") of the read.

Compared to #242, this ensures that indels are left-aligned even in the left extension. To do the left extension, we reverse query and reference, do a right extension (this is how ksw2 is intended to be used) and then reverse the resulting CIGAR. However, when doing so, we need to set a flag to tell ksw2 to right-align indels so that they are left aligned after reversing the CIGAR. The version in #242 did not do this.

This should help with getting rid of the (likely incorrect) increased indel precision/recall in #245.

To Do

  • Run benchmarks
  • Score computation is buggy and results in the score being off by exactly 7 for some alignments. It is quite strange that any score returned by the function is odd at all even though match and mismatch scores are even and there are no gaps involved.
  • The fixops and count_edits functions should possibly be moved somewhere else.

@ksahlin
Copy link
Owner

ksahlin commented Mar 6, 2023

great that you found out the cause!

@marcelm
Copy link
Collaborator Author

marcelm commented Mar 13, 2023

I investigated the unexpected ksw2 score computation. As I could have guessed, "it’s not a bug, it’s a feature": I noticed the weird scores have to do with the N characters in our test reads. As it turns out, any character that is not one of ACGTacgt is treated as a wildcard character by ksw2. I couldn’t really find where this is done in the code, but to explain the numbers that I saw, any match involving a wildcard character must be getting a score of -1. This appears to be hardcoded.

This behavior is different from SSW, but it makes sense. The only slight problem is that -1 is less bad in our scoring scheme (with +4 for match, -8 for mismatch) than in the BWA-MEM scoring scheme (+1 for match, -4 for mismatch), but I don’t see how this would have any more than a very minor influence on mapping results.

@ksahlin
Copy link
Owner

ksahlin commented Mar 13, 2023

I see. I guess it makes sense to have a lower penalty on N compared to actual mismatches, as there is higher uncertainty for N.

@marcelm
Copy link
Collaborator Author

marcelm commented Mar 17, 2023

Here are some accuracy measurements.

Paired-end

dataset before after (8311650) difference
drosophila/2x50-1M 90.0774 90.0718 -0.0056
drosophila/2x100-1M 92.3898 92.38455 -0.00525
drosophila/2x150-1M 93.21915 93.20835 -0.0108
drosophila/2x200-1M 93.5103 93.50025 -0.01005
drosophila/2x300-1M 95.36705 95.3561 -0.01095
CHM13/2x50-1M 90.4725 90.46865 -0.00385
CHM13/2x100-1M 93.2267 93.2158 -0.0109
CHM13/2x150-1M 94.13815 94.1189 -0.01925
CHM13/2x200-1M 94.417 94.405 -0.012
CHM13/2x300-1M 95.62605 95.6148 -0.01125
rye/2x50-1M 68.88975 68.88125 -0.0085
rye/2x100-1M 85.55745 85.55015 -0.0073
rye/2x150-1M 90.1907 90.18255 -0.00815
rye/2x200-1M 91.4566 91.45025 -0.00635
rye/2x300-1M 94.56905 94.5592 -0.00985
maize/2x50-1M 71.2683 71.26195 -0.00635
maize/2x100-1M 87.12395 87.11465 -0.0093
maize/2x150-1M 91.68675 91.67655 -0.0102
maize/2x200-1M 92.90785 92.8948 -0.01305
maize/2x300-1M 96.72045 96.71095 -0.0095

Average difference: -0.0094225

Single-end

dataset before after (8311650) difference
drosophila/2x50-1M 82.033 82.0313 -0.0017
drosophila/2x100-1M 89.1891 89.1865 -0.0026
drosophila/2x150-1M 90.9077 90.8999 -0.0078
drosophila/2x200-1M 91.9535 91.9437 -0.0098
drosophila/2x300-1M 93.2254 93.2126 -0.0128
CHM13/2x50-1M 81.1605 81.1602 -0.0003
CHM13/2x100-1M 90.5476 90.5425 -0.0051
CHM13/2x150-1M 92.3761 92.3628 -0.0133
CHM13/2x200-1M 93.2074 93.1941 -0.0133
CHM13/2x300-1M 94.1505 94.1366 -0.0139
rye/2x50-1M 44.4244 44.4244 0
rye/2x100-1M 69.2576 69.2554 -0.0022
rye/2x150-1M 80.3999 80.391 -0.0089
rye/2x200-1M 85.8734 85.866 -0.0074
rye/2x300-1M 90.4879 90.4764 -0.0115
maize/2x50-1M 47.1744 47.1755 0.0011
maize/2x100-1M 70.483 70.4802 -0.0028
maize/2x150-1M 81.1613 81.1519 -0.0094
maize/2x200-1M 86.6911 86.6827 -0.0084
maize/2x300-1M 91.8587 91.8464 -0.0123

Average difference: -0.00712

Runtime

I have only looked at the runtime for the above accuracy measurements (so sample size per dataset is 1). Roughly, read lengths 200 and 300 bp on the repetitive genomes benefit the most: rye/2x300 and maize/2x300 reduce mapping runtime by 17 and 19%, respectively. Mapping time on 100 bp reads seems to be quite unaffected.

The decrease in accuracy is unfortunate, but maybe not too bad. At the same time, the speedup isn’t that super great either. I’m not sure what to think of this.

@ksahlin
Copy link
Owner

ksahlin commented Mar 17, 2023

In these benchmarks, is the 'before' a partial (piece wise) alignment approach or is it the old 'semi global' one with SSW that I implemented?

I strongly prefer a partial alignment approach so that we could eventually integrate split (supplementary) alignments as well as long read alignment. What is your opinion on this?

While i am perfectly happy paying this very small decrease in mapping accuracy, I care more about what happens with alignments around SNPs and indels. Could you point me to two commits that you want benchmarked against each other for this? I would be happy to set off an evaluation. (it will just have to wait until end of next week).

@marcelm
Copy link
Collaborator Author

marcelm commented Mar 17, 2023

In these benchmarks, is the 'before' a partial (piece wise) alignment approach or is it the old 'semi global' one with SSW that I implemented?

"before" is essentially v0.9.0, so it’s SSW.

I strongly prefer a partial alignment approach so that we could eventually integrate split (supplementary) alignments as well as long read alignment. What is your opinion on this?

I guess it would be a good first step towards split alignments.

I forgot to mention that the decrease in accuracy is smaller than the improvement we got from introducing the soft-clipping end bonus. So accuracy would still be higher than in 0.8.0.

While i am perfectly happy paying this very small decrease in mapping accuracy, I care more about what happens with alignments around SNPs and indels. Could you point me to two commits that you want benchmarked against each other for this? I would be happy to set off an evaluation. (it will just have to wait until end of next week).

I just rebased this PR on top of main and force-pushed it. The commits to compare against each other would be v0.9.0 (0771d1c) and 8311650.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants