Fix for sometimes too short sequence trimmed #485

marcelm · 2020-10-22T13:27:09Z

When a read contains a 5' adapter with a mismatch in the last nucleotide(s), the current alignment algorithm prefers to report these as deletions, resulting in the mismatched nucleotides not being trimmed from the read.

An example for how an alignment currently looks like (1 error allowed, 5' adapter):

adapter:  GCACATCT
read:     GCACATC- GGAA

With this, the read would be trimmed to GGAA, but that doesn’t make sense because this seems more likely:

adapter:  GCACATCT
read:     GCACATCG GAA

With this, the read would be trimmed to GAA.

The above example works with both anchored and regular 5' adapters, and something similar happens when n errors are allowed and the last n nucleotides are mismatches.

The reason for this is that both alignments have the same number of matches, the same costs (number of edits), and that the undesired version is encountered first and then kept.

(One workaround is to use --no-indels, which is probably a good thing in many cases anyway.)

To reproduce, run

echo -e ">read\nGCACATCGGAA\n" | cutadapt -g ^GCACATCT -e 0.15 -

to get a result of

>read
GGAA

This is the DP matrix that is output with --debug=trace:

      G  C  A  C  A  T  C  G  G  A  A  T  T  T  A  A
   0  1  2  3  4  5  6  7  8  9
G  1  0  1  2  3  4  5  6  7  8
C  2  1  0  1  2  3  4  5  6  7
A  3     1  0  1  2  3  4  5  6
C  4        1  0  1  2  3  4  5
A  5           1  0  1  2  3  4
T  6              1  0  1  2  3
C  7                 1  0  1  2
T  8                    1  1  2

The numbers are the edit distances. There are two cells in the last row with the value 1, which correspond to the two alignment possibilities listed above. The left "1" corresponds to the undesired alignment with an indel, which is found first.

The fix in this PR changes the way in which the best alignment is found: If there are multiple possible alignment end positions leading to alignments with the same number of matches, the end position with the fewest indels is chosen.

marcelm added 7 commits October 22, 2020 13:12

Add an xfailing test for an unexpected alignment result

a366e45

Another test for two adapter occurrences

33e82bb

Prefer alignments with mismatches over those with indels

998f0d1

Document a case in the alignment algorithm

494c326

Rename matches to score

695db50

Use score only to compare alignments that have the same origin

1009200

Add alignment algorithm fix to changelog

15848bd

marcelm merged commit 2214184 into master Oct 22, 2020

marcelm deleted the alignalgo branch October 22, 2020 14:07

marcelm mentioned this pull request Dec 28, 2020

error rate and trimmed bases #502

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for sometimes too short sequence trimmed #485

Fix for sometimes too short sequence trimmed #485

marcelm commented Oct 22, 2020

Fix for sometimes too short sequence trimmed #485

Fix for sometimes too short sequence trimmed #485

Conversation

marcelm commented Oct 22, 2020