Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demultiplexing does not find the longest best partial 3' adapter #399

Closed
AlexanderBartholomaeus opened this issue Sep 4, 2019 · 4 comments

Comments

@AlexanderBartholomaeus
Copy link

@AlexanderBartholomaeus AlexanderBartholomaeus commented Sep 4, 2019

I'm using cutadapt 2.4 and Python 3.6.8 installed with pip3.

When demultiplexing using linked adapters and pair-end reads I recognize that cutadapt does favor a shorter partial overlap that would include a mismatch over longer partial overlap without any mismatch.

The full cutadapt command is the following:
cutadapt --no-indels --no-trim -g file:adaptersFR_FRrevComp_test.fasta -o trimmed-{name}_R1.fastq -p trimmed-{name}_R2.fastq test_R1.fastq test_R2.fastq > output.txt

The adapters (that include barcodes) are identical except the last few (5 NTs). The adapters look like this:

1a
AACGCGGTGCCAGCMGCCGCGGTAA...ATTAGAWACCCBDGTAGTCCCGCGTT
17a
AACGCGGTGCCAGCMGCCGCGGTAA...ATTAGAWACCCBDGTAGTCCCCGAGG

However in the 1a results (file: trimmed-1a_R1.fastq) are Sequences like this:
@DE18INS60515:38:000000000-BJ6GC:1:1101:11645:1732 1:N:0:ATTCAGAACTACTGAC
AACGCGGTGCCAGCAGCCGCGGTAATACGGATGGTCCAAGCGTTATCCGTAATCATTGGGTTTAAAGTGTCCGCAGGCGGTCTTTTAAGTCAGAGGTTAAATCCCGTCTCTCAACGACTGACCTGCCTTTGATACTGGTTGACTTGAGTCATATGGATGTAGATAGAATGTCTAGTTTAGCGGTGAAATGCTTAGAGATTACACTTAATACCGATTTCGAAGGCAGTCTACTACGTATTTTCTGACCCTTAGGTACGAATGCCTGGTGAGCGATCCGTATTAGATACCCCTGTAGTCCCC

I marked the interesting regions bold. As you see, this read fits better to 17a but is assigned to 1a. I know that 1a is not wrong because I allow mismatches, but 17a fits much better, because of the direct hit

Attached you can find all input and output files.
bugreport_cutadapt.zip

@marcelm

This comment has been minimized.

Copy link
Owner

@marcelm marcelm commented Sep 6, 2019

Hi, sorry that I somehow missed your bug report. Thanks for attaching all the necessary files, this makes it easy to reproduce.

At least one of the problems is that you are encountering issue #394, which was that the --no-indels option was ignored for linked adapters. I’ve released Cutadapt 2.5 yesterday, which fixes this problem, and when I run it on your files, the above read that was previously assigned to the "1a" file indeed now does end up in the 17a file.

However, I’ll still need to look into this further because even without --no-indels, the second linked adapter should be preferred.

@marcelm marcelm closed this in d5cf13a Sep 6, 2019
@marcelm

This comment has been minimized.

Copy link
Owner

@marcelm marcelm commented Sep 6, 2019

The criterion that determines which adapter is the best-matching one is simply the number of matches in the alignment.

When allowing indels in the above example, the problem was that the alignments for 1a and 17a were considered to be equivalent because they both contain 17 matches. And in that case, the rule was that simply the first one found wins. Since 1a was listed before 17a in your FASTA file with adapter sequences, 1a was found.

Alignment for 1a:

ATTAGAWACCCBDGTAGTCCCGCGTT  (1a adapter)
     ================X=
  ...ATACCCCTGTAGTCCC-C     (read)

Alignment for 17a:

ATTAGAWACCCBDGTAGTCCCCGAGG  (17a adapter)
     =================
  ...ATACCCCTGTAGTCCCC      (read)

I have now fixed this by using the number of errors in the alignment as a tie breaker. That is, if two adapters get the same number of matches in their alignments, the one with the lower number of errors wins. In the above case, this would then correctly prefer 17a over 1a.

Thanks for finding this! This part of Cutadapt has not been changed in a long time, so this behavior has been as it is in a while. This change also applies to any other adapter type, by the way, not only linked adapters.

@AlexanderBartholomaeus

This comment has been minimized.

Copy link
Author

@AlexanderBartholomaeus AlexanderBartholomaeus commented Sep 6, 2019

Thank you!!! That was a fast fix!

I recognized the "first wins" behaviour when inputting duplicates in the adapter file. I will open a suggestion to print a warning for this as a seperate issue (but it has low priority I think).

I have also some more ideas of improvements for amplicon related cutting which I will also post later.

Thank you again!

marcelm added a commit that referenced this issue Sep 9, 2019
See #399
@marcelm

This comment has been minimized.

Copy link
Owner

@marcelm marcelm commented Sep 9, 2019

Good suggestion about the duplicate adapter warning – I’ve added this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.