Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imperfect read alignment, changing parameters does not help #357

Open
tprodanov opened this issue Oct 31, 2023 · 9 comments · May be fixed by #382
Open

Imperfect read alignment, changing parameters does not help #357

tprodanov opened this issue Oct 31, 2023 · 9 comments · May be fixed by #382

Comments

@tprodanov
Copy link

tprodanov commented Oct 31, 2023

Mapping a read

>A00297:159:HV2LVDSXX:3:1439:25572:32503
AGGCGGCCCCTGCAGTGGTCGCTGATGCAGGCGTTGAAGAATGGGCCCGGGGGCACAAGGTTGTGGCACTCAGCAAAGACCCTGTGGGGCAAAGCGATCAGGACGGAGGGCCCCCAGCGTGCATGCGGGCGGGCGGGGGCGCCCCAGGGC

to the reference

>ref
CCCATGCCCACCACAGCCGCTCTGTGATCTGATGCTGAGCCAGTGAGTCCTCCCCTCGGG
GGTTGCAGGCCCTGGGGCGCCCCCGCCCGCCCGCATGCACGCTGGGGGCCCTCCGTCCTG
ATCGCTTTGCCCCACAGGGTCTTTGCTGAGTGCCACAACCTTGTGCCCCCGGGCCCATTC
TTCAACGCCTGCATCAGCGACCACTGCAGGGCCGCCTTGAGGTGCCCTGCCAGAGCCTGA
GGCTTACGCAGAGCTCTGCCGCGCCCGGGGAGTGTGCAGTGACTGGCGAGGTGCAACCGG
TGGCCTGTGCGGTGAGTGGGGGCGGCCCCGGGCCCCCCAGACCCCTCGGCCTCTCTGAGT
GTCCTGTGCGGTGAGTGGGGGCGGCCCCGGGCCCCCCAGAC

using Strobealign v0.11.0 produces alignment 143=7S with score 296.
Mapping the same read using BWA MEM finds a better alignment 140=1I9= with score 143 (x2 = 286).
Setting high penalty for soft clipping -L 1000 produces a bad alignment 143=1X1=2X1=1X1= with score 2260, for some reason. Decreasing gap opening (-O) penalty and increasing match bonus/mismatch penalty (-A/B) did not help.
I tried changing seeding arguments -m/k/c/u/l, but with no luck for now.

Maybe part of the strobemer aligns in place of the 1 bp insert and this breaks the rest of the alignment, but it is strange that alignment cannot be fixed by modifying seeding arguments.
Also, extremely high score with high soft clipping penalty without the actual soft clipping seems like a bug.

@ksahlin
Copy link
Owner

ksahlin commented Oct 31, 2023

Excellent bug report, I can reproduce your results.

I will discuss with @marcelm. Thanks for letting us know!

@marcelm
Copy link
Collaborator

marcelm commented Nov 6, 2023

Also, extremely high score with high soft clipping penalty without the actual soft clipping seems like a bug.

The soft clipping penalty is not actually a penalty, but an "end bonus". It is added to the score for each end that reaches the end. This is also why you get the score 2260 when you set -L 1000.

(I know this is not documented correctly, and we need to fix this. – And maybe the better fix would be to make it actually be a soft-clipping penalty.)

I believe the issue may be that the alignment library we are using (SSW) does not directly support soft clipping penalties. We just let it find an alignment that is allowed to start and end anywhere within the read, and then afterwards try to extend it towards the end of the reads and adjust the score accordingly.

When we run SSW, we get 143=7S with a score of 286 (143 times match score). The end bonus for the 5' end is then added afterwards and gives a total score of 296.

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches). So although the total score would be 305 because we can add the end bonus twice, that variant is never considered.

@ksahlin
Copy link
Owner

ksahlin commented Nov 6, 2023

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches).

Shouldn’t lowering -O adress the issue then, given your explanation? (I tried lowering -O and it still produces 143=7S).

(mentioned in anther discussion: I haven’t thought it through properly yet, but is it sound that the end bonus/penalty is part in determining the best location for PE-reads?)

@drtconway
Copy link
Contributor

Reflecting some direct emails, there seem to be some examples where the alignments have slightly odd indel behaviour, where bwa is happy to soft clip.

I made a recipe for reproducing some cases at https://gist.github.com/drtconway/962697804acd1d89c753b1d9a0378fc6

@marcelm
Copy link
Collaborator

marcelm commented Nov 29, 2023

I believe to have found the problem.

First, a correction to what I wrote:

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches)

Unlike BWA, our gap costs are computed as gap_open + (length - 1) * gap_extend, so the single insertion costs 12, not 12 + 1, making the total score 286. My argument would still hold (both alignments are equally good, so SSW can arbitrarily decide which one to pick), but the actual issue is a different one.

The actual problem is that SSW is never called due to an optimization in strobealign that skips computing a gapped alignment if the ungapped alignment looks "good enough". This is a part of the relevant code:

strobealign/src/aln.cpp

Lines 226 to 229 in 7fe2934

auto hamming_dist = hamming_distance(query, ref_segm_ham);
if (hamming_dist >= 0 && (((float) hamming_dist / query.size()) < 0.05) ) { //Hamming distance worked fine, no need to ksw align
info = hamming_align(query, ref_segm_ham, aligner.parameters.match, aligner.parameters.mismatch, aligner.parameters.end_bonus);

To reduce runtime, strobealign first computes an ungapped alignment when extending a seed. At this stage, soft clipping is not allowed and no CIGAR string is computed, we only count the number of mismatches. If that is below 5% of the length of the query, we skip computing a full gapped alignment with SSW.

Here, the ungapped alignment is 143= 1X 1= 2X 1= 1X 1= with four mismatches. Since 4/150=0.03, we skip SSW.

When SSW has been skipped, the ungapped alignment is re-done to get the CIGAR, this time taking soft-clipping into account. That is why the final output is 143=7S.

I was able to get the 140=1I9= alignment by 1) disabling the optimization and 2) reducing gap open penalty to 11 (-O 11).

@ksahlin
Copy link
Owner

ksahlin commented Nov 30, 2023

Perfect that you could pin this down. Easy to fix, but requires a bit of though how to do this without introducing too much runtime.

Perhaps hamming only over the span of the NAMs, but it may lead to a lot of extra ksw calls. We can discuss this offline unless you already have something in mind.

@marcelm
Copy link
Collaborator

marcelm commented Nov 30, 2023

(I wrote the above yesterday, I notice only now I haven’t actually posted it.)

It’s clear that just disabling the optimization is undesirable. On a test dataset (drosophila-50), runtime increases by more than 50%.

I think it makes sense to first ensure that the initial ungapped alignment is also done with soft clipping. The only reason for the current, inconsistent behavior is that I forgot this when I added soft clipping.

With that in place, we would get 143=7S as the first ungapped alignment in the example. If we count the soft-clipped bases as mismatches, that would give us 7/150 = 4.7%. That is still below the currently allowed threshold of 5%, so we would still skip SSW, but it’s a bit closer. Lowering the threshold to 4% (for example) is IMO not the way to go as it doesn’t help for longer reads.

I wonder whether we should then just run ksw on the ends that the ungapped alignment soft-clips. ksw can do extension alignment (unlike SSW) and because the soft-clipped ends are typically short, it should be quite fast.

@ksahlin
Copy link
Owner

ksahlin commented Nov 30, 2023

I wonder whether we should then just run ksw on the ends that the ungapped alignment soft-clips. ksw can do extension alignment (unlike SSW) and because the soft-clipped ends are typically short, it should be quite fast.

Sounds good. So we need to start around _k_nt in w.r.t. the softclipp. So for the above example we would do extension alignment with the k+7 rightmost nt in the read to detect the indel.

@ksahlin
Copy link
Owner

ksahlin commented Nov 30, 2023

Btw, what does strobealign currently do when the region with the NAM has the same length on the query and the reference and Hamming distance of the NAM region is high?

Do we fully realign such cases with SSW? If so, an optimization would be to run ksw on the ends only. I remember we have discussed similar scenarios when we tried out partitioning the alignments and use WFA2 but I don't remember the conclusions.

Clarification: I meant when Hamming distance is high -- possibly because regions outside the NAM region do not fit (e.g. indels). Then it might be inefficient to realign the whole read. One approach would be to try hamming of the NAM hit only, then extension of the ends.

marcelm added a commit that referenced this issue Dec 15, 2023
marcelm added a commit that referenced this issue Dec 17, 2023
It is possible that gapped alignment gives a better score than soft-clipping
based on ungapped alignment.

Closes #357
marcelm added a commit that referenced this issue Feb 22, 2024
It is possible that gapped alignment gives a better score than soft-clipping
based on ungapped alignment.

Closes #357
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants