Fix for sometimes too short sequence trimmed #485
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When a read contains a 5' adapter with a mismatch in the last nucleotide(s), the current alignment algorithm prefers to report these as deletions, resulting in the mismatched nucleotides not being trimmed from the read.
An example for how an alignment currently looks like (1 error allowed, 5' adapter):
With this, the read would be trimmed to
GGAA
, but that doesn’t make sense because this seems more likely:With this, the read would be trimmed to
GAA
.The above example works with both anchored and regular 5' adapters, and something similar happens when n errors are allowed and the last n nucleotides are mismatches.
The reason for this is that both alignments have the same number of matches, the same costs (number of edits), and that the undesired version is encountered first and then kept.
(One workaround is to use
--no-indels
, which is probably a good thing in many cases anyway.)To reproduce, run
to get a result of
This is the DP matrix that is output with
--debug=trace
:The numbers are the edit distances. There are two cells in the last row with the value 1, which correspond to the two alignment possibilities listed above. The left "1" corresponds to the undesired alignment with an indel, which is found first.
The fix in this PR changes the way in which the best alignment is found: If there are multiple possible alignment end positions leading to alignments with the same number of matches, the end position with the fewest indels is chosen.