Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for sometimes too short sequence trimmed #485

Merged
merged 7 commits into from
Oct 22, 2020
Merged

Fix for sometimes too short sequence trimmed #485

merged 7 commits into from
Oct 22, 2020

Conversation

marcelm
Copy link
Owner

@marcelm marcelm commented Oct 22, 2020

When a read contains a 5' adapter with a mismatch in the last nucleotide(s), the current alignment algorithm prefers to report these as deletions, resulting in the mismatched nucleotides not being trimmed from the read.

An example for how an alignment currently looks like (1 error allowed, 5' adapter):

adapter:  GCACATCT
read:     GCACATC- GGAA

With this, the read would be trimmed to GGAA, but that doesn’t make sense because this seems more likely:

adapter:  GCACATCT
read:     GCACATCG GAA

With this, the read would be trimmed to GAA.

The above example works with both anchored and regular 5' adapters, and something similar happens when n errors are allowed and the last n nucleotides are mismatches.

The reason for this is that both alignments have the same number of matches, the same costs (number of edits), and that the undesired version is encountered first and then kept.

(One workaround is to use --no-indels, which is probably a good thing in many cases anyway.)

To reproduce, run

echo -e ">read\nGCACATCGGAA\n" | cutadapt -g ^GCACATCT -e 0.15 -

to get a result of

>read
GGAA

This is the DP matrix that is output with --debug=trace:

      G  C  A  C  A  T  C  G  G  A  A  T  T  T  A  A
   0  1  2  3  4  5  6  7  8  9
G  1  0  1  2  3  4  5  6  7  8
C  2  1  0  1  2  3  4  5  6  7
A  3     1  0  1  2  3  4  5  6
C  4        1  0  1  2  3  4  5
A  5           1  0  1  2  3  4
T  6              1  0  1  2  3
C  7                 1  0  1  2
T  8                    1  1  2

The numbers are the edit distances. There are two cells in the last row with the value 1, which correspond to the two alignment possibilities listed above. The left "1" corresponds to the undesired alignment with an indel, which is found first.

The fix in this PR changes the way in which the best alignment is found: If there are multiple possible alignment end positions leading to alignments with the same number of matches, the end position with the fewest indels is chosen.

@marcelm marcelm merged commit 2214184 into master Oct 22, 2020
@marcelm marcelm deleted the alignalgo branch October 22, 2020 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant