Imperfect read alignment, changing parameters does not help #357

tprodanov · 2023-10-31T10:58:20Z

Mapping a read

>A00297:159:HV2LVDSXX:3:1439:25572:32503
AGGCGGCCCCTGCAGTGGTCGCTGATGCAGGCGTTGAAGAATGGGCCCGGGGGCACAAGGTTGTGGCACTCAGCAAAGACCCTGTGGGGCAAAGCGATCAGGACGGAGGGCCCCCAGCGTGCATGCGGGCGGGCGGGGGCGCCCCAGGGC

to the reference

>ref
CCCATGCCCACCACAGCCGCTCTGTGATCTGATGCTGAGCCAGTGAGTCCTCCCCTCGGG
GGTTGCAGGCCCTGGGGCGCCCCCGCCCGCCCGCATGCACGCTGGGGGCCCTCCGTCCTG
ATCGCTTTGCCCCACAGGGTCTTTGCTGAGTGCCACAACCTTGTGCCCCCGGGCCCATTC
TTCAACGCCTGCATCAGCGACCACTGCAGGGCCGCCTTGAGGTGCCCTGCCAGAGCCTGA
GGCTTACGCAGAGCTCTGCCGCGCCCGGGGAGTGTGCAGTGACTGGCGAGGTGCAACCGG
TGGCCTGTGCGGTGAGTGGGGGCGGCCCCGGGCCCCCCAGACCCCTCGGCCTCTCTGAGT
GTCCTGTGCGGTGAGTGGGGGCGGCCCCGGGCCCCCCAGAC

using Strobealign v0.11.0 produces alignment 143=7S with score 296.
Mapping the same read using BWA MEM finds a better alignment 140=1I9= with score 143 (x2 = 286).
Setting high penalty for soft clipping -L 1000 produces a bad alignment 143=1X1=2X1=1X1= with score 2260, for some reason. Decreasing gap opening (-O) penalty and increasing match bonus/mismatch penalty (-A/B) did not help.
I tried changing seeding arguments -m/k/c/u/l, but with no luck for now.

Maybe part of the strobemer aligns in place of the 1 bp insert and this breaks the rest of the alignment, but it is strange that alignment cannot be fixed by modifying seeding arguments.
Also, extremely high score with high soft clipping penalty without the actual soft clipping seems like a bug.

The text was updated successfully, but these errors were encountered:

ksahlin · 2023-10-31T12:38:34Z

Excellent bug report, I can reproduce your results.

I will discuss with @marcelm. Thanks for letting us know!

marcelm · 2023-11-06T10:44:45Z

Also, extremely high score with high soft clipping penalty without the actual soft clipping seems like a bug.

The soft clipping penalty is not actually a penalty, but an "end bonus". It is added to the score for each end that reaches the end. This is also why you get the score 2260 when you set -L 1000.

(I know this is not documented correctly, and we need to fix this. – And maybe the better fix would be to make it actually be a soft-clipping penalty.)

I believe the issue may be that the alignment library we are using (SSW) does not directly support soft clipping penalties. We just let it find an alignment that is allowed to start and end anywhere within the read, and then afterwards try to extend it towards the end of the reads and adjust the score accordingly.

When we run SSW, we get 143=7S with a score of 286 (143 times match score). The end bonus for the 5' end is then added afterwards and gives a total score of 296.

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches). So although the total score would be 305 because we can add the end bonus twice, that variant is never considered.

ksahlin · 2023-11-06T11:51:14Z

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches).

Shouldn’t lowering -O adress the issue then, given your explanation? (I tried lowering -O and it still produces 143=7S).

(mentioned in anther discussion: I haven’t thought it through properly yet, but is it sound that the end bonus/penalty is part in determining the best location for PE-reads?)

drtconway · 2023-11-20T04:40:03Z

Reflecting some direct emails, there seem to be some examples where the alignments have slightly odd indel behaviour, where bwa is happy to soft clip.

I made a recipe for reproducing some cases at https://gist.github.com/drtconway/962697804acd1d89c753b1d9a0378fc6

marcelm · 2023-11-29T19:16:57Z

I believe to have found the problem.

First, a correction to what I wrote:

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches)

Unlike BWA, our gap costs are computed as gap_open + (length - 1) * gap_extend, so the single insertion costs 12, not 12 + 1, making the total score 286. My argument would still hold (both alignments are equally good, so SSW can arbitrarily decide which one to pick), but the actual issue is a different one.

The actual problem is that SSW is never called due to an optimization in strobealign that skips computing a gapped alignment if the ungapped alignment looks "good enough". This is a part of the relevant code:

strobealign/src/aln.cpp

Lines 226 to 229 in 7fe2934

    
           auto hamming_dist = hamming_distance(query, ref_segm_ham); 
        
           if (hamming_dist >= 0 && (((float) hamming_dist / query.size()) < 0.05) ) { //Hamming distance worked fine, no need to ksw align 
        
               info = hamming_align(query, ref_segm_ham, aligner.parameters.match, aligner.parameters.mismatch, aligner.parameters.end_bonus);

To reduce runtime, strobealign first computes an ungapped alignment when extending a seed. At this stage, soft clipping is not allowed and no CIGAR string is computed, we only count the number of mismatches. If that is below 5% of the length of the query, we skip computing a full gapped alignment with SSW.

Here, the ungapped alignment is 143= 1X 1= 2X 1= 1X 1= with four mismatches. Since 4/150=0.03, we skip SSW.

When SSW has been skipped, the ungapped alignment is re-done to get the CIGAR, this time taking soft-clipping into account. That is why the final output is 143=7S.

I was able to get the 140=1I9= alignment by 1) disabling the optimization and 2) reducing gap open penalty to 11 (-O 11).

ksahlin · 2023-11-30T10:40:37Z

Perfect that you could pin this down. Easy to fix, but requires a bit of though how to do this without introducing too much runtime.

Perhaps hamming only over the span of the NAMs, but it may lead to a lot of extra ksw calls. We can discuss this offline unless you already have something in mind.

marcelm · 2023-11-30T11:05:38Z

(I wrote the above yesterday, I notice only now I haven’t actually posted it.)

It’s clear that just disabling the optimization is undesirable. On a test dataset (drosophila-50), runtime increases by more than 50%.

I think it makes sense to first ensure that the initial ungapped alignment is also done with soft clipping. The only reason for the current, inconsistent behavior is that I forgot this when I added soft clipping.

With that in place, we would get 143=7S as the first ungapped alignment in the example. If we count the soft-clipped bases as mismatches, that would give us 7/150 = 4.7%. That is still below the currently allowed threshold of 5%, so we would still skip SSW, but it’s a bit closer. Lowering the threshold to 4% (for example) is IMO not the way to go as it doesn’t help for longer reads.

I wonder whether we should then just run ksw on the ends that the ungapped alignment soft-clips. ksw can do extension alignment (unlike SSW) and because the soft-clipped ends are typically short, it should be quite fast.

ksahlin · 2023-11-30T11:31:39Z

I wonder whether we should then just run ksw on the ends that the ungapped alignment soft-clips. ksw can do extension alignment (unlike SSW) and because the soft-clipped ends are typically short, it should be quite fast.

Sounds good. So we need to start around _k_nt in w.r.t. the softclipp. So for the above example we would do extension alignment with the k+7 rightmost nt in the read to detect the indel.

ksahlin · 2023-11-30T17:16:16Z

Btw, what does strobealign currently do when the region with the NAM has the same length on the query and the reference and Hamming distance of the NAM region is high?

Do we fully realign such cases with SSW? If so, an optimization would be to run ksw on the ends only. I remember we have discussed similar scenarios when we tried out partitioning the alignments and use WFA2 but I don't remember the conclusions.

Clarification: I meant when Hamming distance is high -- possibly because regions outside the NAM region do not fit (e.g. indels). Then it might be inefficient to realign the whole read. One approach would be to try hamming of the NAM hit only, then extension of the ends.

Closes #357

It is possible that gapped alignment gives a better score than soft-clipping based on ungapped alignment. Closes #357

marcelm mentioned this issue Dec 11, 2023

Try extension alignment if seed appears to be ungapped #377

Open

marcelm added a commit that referenced this issue Dec 15, 2023

If ungapped align is softclipped, use ksw_extend

82611d8

Closes #357

marcelm added a commit that referenced this issue Dec 17, 2023

If ungapped alignment is softclipped, ksw_extend soft-clipped ends

85030cf

It is possible that gapped alignment gives a better score than soft-clipping based on ungapped alignment. Closes #357

marcelm linked a pull request Dec 17, 2023 that will close this issue

Use ksw2 to align soft-clipped ends from ungapped alignment #382

Open

marcelm added a commit that referenced this issue Feb 22, 2024

If ungapped alignment is softclipped, ksw_extend soft-clipped ends

26b472e

It is possible that gapped alignment gives a better score than soft-clipping based on ungapped alignment. Closes #357

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imperfect read alignment, changing parameters does not help #357

Imperfect read alignment, changing parameters does not help #357

tprodanov commented Oct 31, 2023 •

edited

Loading

ksahlin commented Oct 31, 2023

marcelm commented Nov 6, 2023

ksahlin commented Nov 6, 2023

drtconway commented Nov 20, 2023

marcelm commented Nov 29, 2023 •

edited

Loading

ksahlin commented Nov 30, 2023

marcelm commented Nov 30, 2023

ksahlin commented Nov 30, 2023

ksahlin commented Nov 30, 2023 •

edited

Loading

Imperfect read alignment, changing parameters does not help #357

Imperfect read alignment, changing parameters does not help #357

Comments

tprodanov commented Oct 31, 2023 • edited Loading

ksahlin commented Oct 31, 2023

marcelm commented Nov 6, 2023

ksahlin commented Nov 6, 2023

drtconway commented Nov 20, 2023

marcelm commented Nov 29, 2023 • edited Loading

ksahlin commented Nov 30, 2023

marcelm commented Nov 30, 2023

ksahlin commented Nov 30, 2023

ksahlin commented Nov 30, 2023 • edited Loading

tprodanov commented Oct 31, 2023 •

edited

Loading

marcelm commented Nov 29, 2023 •

edited

Loading

ksahlin commented Nov 30, 2023 •

edited

Loading