Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is cutadapt discarding some paired end reads? #556

Closed
gabrielet opened this issue Aug 18, 2021 · 5 comments
Closed

Why is cutadapt discarding some paired end reads? #556

gabrielet opened this issue Aug 18, 2021 · 5 comments
Labels
waiting waiting for a reply

Comments

@gabrielet
Copy link

I asked this same question in StackExchange Bioinformatics (link here), but I am posting it since I am not seeing any reply.

Sorry for the very long post.

So problem is that cutadapt is removing a read (actually more than one) but I found something unexpected.

So, I am using cutadapt to remove two primers from some reads.

I have 2 files for each sample, i.e. sample_R1.fastq and sample_R2.fastq since the reads are paired end.

I have primer_1 and primer_2 and both of them can be found in sample_R1.fastq and sample_R2.fastq. Usually, I have half sample_R1.fastq reads starting with primer_1 and half starting with primer_2. Same for sample_R2.fastq, i.e. I can find both primers.

To properly trim all the primers what I do is the following, through R:

setOne <- paste("-g", primer_1, "-a", primer_1)
setTwo <- paste("-G", primer_2, "-A", primer_2)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

The command works fine except for one thing. On the one hand cutadapt should remove those sequences which, after the trimming, will have length < 1, as specified by the command. However, some reads are removed even though they are longer than 1 and I would like to know why this happens, since I can't find a specific reference mentioning that cutadapt removes sequences using different parameters rather than the length.

If I am getting this straight, cutadapt only removes the primer sequence from the beginning of the read, then removes the equivalent characters in the quality field of the read that was trimmed and, finally, it returns the trimmed read with the quality field adjusted to the new length. If the length is lower than the threshold, the read is discarded. If it is discarded in R1, it will be discarded in R2 and viceversa (since we are dealing with paired end reads and since I specified two inputs and two outputs in the command)

I give an example of read that is removed, this one from R1:

@some:sample:number 1:N:0:1
ACCGTCACGTCATCAGCAGCATCGACTCCTGTTTGATCCCCGTGCCTGTACTCAAGTCAACCAGTACACGCTTTTCAAGCTGCCTTCTGCCGTATTTCACACTCAGCGCATTTGCAATTGGTGTTCTTTGCACCACATATCTATCAGTTAAGCCACCGCTACTCCGCCTACCTCGTACTGACTTAATCGACCGCCTACGCACCCTTTAAACCCAATTCGAGACCGTCAATTCAATCATACTTTAGTAAGTG
+
BBBBBBBBBBBBGGGGHEGHFHA@EGGGHHHHHHHHHGHHHHHHHHGGGGGHGHGHGGGHHHHHGDGGGFHGGGGHGGGGGGHGHGFHHHHHHHHHHGGGHHHGHGHHHHHHHGFFGGGGGGGGGGGGGGGGGFFBDFDFFFFFFFBFFFFFFFFFHGHGGHHHHHEGGGGGHHHHGHHHHHGHHHHHHHHHHHHGGGHGHHGGGGEFGGHHHHHHHHHHHHHGHHHHHHHHHHHFHFGHHHHGGGGGHHG

this one from R2:

@some:sample:number 2:N:0:1
CTGCTGCTATCACTAGCTAAAGCTTTAACGTAGGGCAATTGACGAAGGTGAAATGCATAGATATGTGATGACGGTGTAGCGTTGGCGCTGGGGTTTAAAGGGTGCGTAGTGACTTCACAGAACACCAATTGGCAGCTTACTAAAGCATGAGTACAGACGAGGTAGGCGGAATTGCGGTCGATTAAGTCAGTTGTGAAATACGGCAGCTTAACTGTTGGCAAGCGTTGTCCGGCACTGGTTTTATTCGAGGT
+
AAAAA@BFFFFFGFGCGGFDDC/CDHF2FGFHHG//<<CCC1FGHH1<1>11C<FF/0BFFFFFFF/:DEEEFFHGG0AEG/B9B./-.@9GHGEFC>EE<FHEHEGHDHGGFHGGBGGGHGGDGH0CDDGFFHHCC--.:CGCFG??GGAEGGHFHHBFFFEFB./B..;/;:/999=AAA../BB/;//////9/EFFCBDAA..990ADD=99/;/BFGGGGGGGGGFHHHHHHGHDGHHFFDHHHHG

and these are the primers that I am removing:

  • ACCGTCACGTCATCAGCAGCATCGAC
  • CTGCTGCTATCACTAGCTAAAGCTTTA

matching the reads I provided. Clearly, if I trim them manually, the reads length is way greater than 1 so, why are these reads removed from the final cutadapt output file?


EDIT to the StackExchange Bioinformatics question:

I have the feeling that the command I posted above is not the one I should have used in the first place.

What I am doing now is to perform the trimming in two steps:

# here we remove primer_1 from R1 and primer_2 from R2
setOne <- paste("-g", primer_1)
setTwo <- paste("-G", primer_2)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

Then, using the output of the above command, I am simply switching the primers and then let cutadapt remove them:

# here we remove primer_2 from R1 and primer_1 from R2
setOne <- paste("-g", primer_2)
setTwo <- paste("-G", primer_1)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

If I follow such pipeline, I am able to keep the above mentioned read, i.e. it is not discarded anymore and I guess this is the right approach to solve my problem, i.e. dealing with two primers that can be both found in R1 and R2.

I just wanted to make sure I am getting this right, since I am quite new to these kind of analyses.

Also, I am wondering what happens when cutadapt searches for a primer, like in my first command, the reverse complement of primer_1 (or primer_2) but it can't find a match. Is cutadapt discarding reads if a match is not found?
I guess it is: the read that is discarded, as mentioned above, in the example I provided contains primer_2 but, if I run cutadapt in its first (and I believe wrong) version then primer_2 is not found in sample_R1.fastq. Indeed, cutadapt is searching for primer_1 and reverseComplement(primer_2) in sample_R1.fastq. But, cutadapt can't find any of them (the read in question contains primer_2 but belongs to sample_R1.fastq). At this point I have the feeling cutadapt is removing the reads since it does not match anything. Is this what is going on?

But if this what is going on...why does it not happen the same way for the two other commands, i.e. the two-step analysis?

@marcelm
Copy link
Owner

marcelm commented Aug 24, 2021

Hi,

in summary: the second command you gave is not quite correct, but should mostly work, and it can be simplified.

So, I am using cutadapt to remove two primers from some reads.

you didn’t quite describe how your sequencing library is structured, but I think I can deduce that you have a single amplicon to which the sequencing adapters were attached in a random orientation.

Sorry if this is already obvious to you, but if the above is correct, then 50% of the molecules in the sequencing library look like this:

5'sequencing_adapter | primer_1 | target_dna | revcomp_of_primer2 | 3'sequencing_adapter

After paired-end sequencing, you get reads that are truncated versions of the following:

R1: primer1 | target_dna | revcomp_of_primer2 | 3'sequencing_adapter
R2: primer2 | revcomp_of_target_dna | revcomp_of_primer1 | revcomp_of_5'_sequencing_adapter

The truncation is of course due to read length. Possibly you only see the first primer and then parts of the target DNA.

And the other 50% of the molecules look like this:

5'sequencing_adapter | primer2 | revcomp_of_target_dna | revcomp_of_primer1 | 3'sequencing_adapter

Which after sequencing results in some truncated version of this:

R1: primer2 | revcomp_of_target_dna | revcomp_of_primer1 | 3'sequencing_adapter
R2: primer1 | target_dna | revcomp_of_primer2 | revcomp_of_5'sequencing_adapter

For trimming, I’ll assume that the DNA sequence of interest is so long that you only see the primer at the beginning and then part of the DNA sequence of interest. That makes everything simpler, and you also assume that in the commands that you gave.

Looking only at R1 for now, you want Cutadapt to remove primer1 or primer2. You can achieve that like this:

cutadapt -g ^primer1 -g ^primer2 ...

If you provide multiple adapters, Cutadapt searches for both in each read and removes only the best matching one.

Note the leading ^, which is used to provide an anchored 5' adapter. If you know that your reads start with the primer sequence, you should use it to reduce the number of false positives.

To also trim R2, you just add the same options in their uppercase versions:

cutadapt -g ^primer1 -g ^primer2 -G ^primer1 -G ^primer2 ...

You can then add --minimum-length as desired. You shouldn’t use -n 2 as it’ll try to remove more than one adapter/primer from each read.

To properly trim all the primers what I do is the following, through R:

setOne <- paste("-g", primer_1, "-a", primer_1)
setTwo <- paste("-G", primer_2, "-A", primer_2)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

The command works fine except for one thing. On the one hand cutadapt should remove those sequences which, after the trimming, will have length < 1, as specified by the command. However, some reads are removed even though they are longer than 1 and I would like to know why this happens, since I can't find a specific reference mentioning that cutadapt removes sequences using different parameters rather than the length.

The command above uses -a to provide a 3' adapter. When Cutadapt finds a 3' adapter, it removes the adapter and everything following it. Since in your case the adapters/primers actually occur at the beginning of each read, the read is empty afterwards.

If the length is lower than the threshold, the read is discarded. If it is discarded in R1, it will be discarded in R2 and viceversa (since we are dealing with paired end reads and since I specified two inputs and two outputs in the command)

Yes. If a filtering criterion applies to R1 or R2, the pair is discarded (you can change that with --pair-filter=... if you wanted, see docs).

[...]

matching the reads I provided. Clearly, if I trim them manually, the reads length is way greater than 1 so, why are these reads removed from the final cutadapt output file?

The example matches what I said above: The reads start with the primer, so -a trims the read to length zero.

EDIT to the StackExchange Bioinformatics question:

I have the feeling that the command I posted above is not the one I should have used in the first place.

What I am doing now is to perform the trimming in two steps:

# here we remove primer_1 from R1 and primer_2 from R2
setOne <- paste("-g", primer_1)
setTwo <- paste("-G", primer_2)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

Then, using the output of the above command, I am simply switching the primers and then let cutadapt remove them:

# here we remove primer_2 from R1 and primer_1 from R2
setOne <- paste("-g", primer_2)
setTwo <- paste("-G", primer_1)
system2(cutadapt, args = c(setOne, setTwo, "-n", 2, 
                           "-o", outOne, "-p", outTwo,
                            inputOne, inputTwo, "--minimum-length", 1))

If I follow such pipeline, I am able to keep the above mentioned read, i.e. it is not discarded anymore and I guess this is the right approach to solve my problem, i.e. dealing with two primers that can be both found in R1 and R2.

As I described above, you can simplify this to a single command, and you should omit -n 2.

I just wanted to make sure I am getting this right, since I am quite new to these kind of analyses.

Also, I am wondering what happens when cutadapt searches for a primer, like in my first command, the reverse complement of primer_1 (or primer_2)

Cutadapt does not search for reverse complements by default.

but it can't find a match. Is cutadapt discarding reads if a match is not found?

The read remains unchanged. Unless you provide filtering options (such as --discard-trimmed, --minimum-length etc.), all reads are kept.

@gabrielet
Copy link
Author

gabrielet commented Aug 26, 2021

Thank you for your reply. I am working on it but I need to wait for some info about the data-set since what you said is a bit unexpected. Once I have all the info, I will get back to the thread. I want to make a useful post for other people, too but I need more info. If I may...I consider this thread not closed yet.

@marcelm
Copy link
Owner

marcelm commented Aug 26, 2021

Keep in mind I was only guessing your library layout, and it may very well be different.

@marcelm marcelm added the waiting waiting for a reply label Sep 8, 2021
@gabrielet
Copy link
Author

There we go. I finally got an idea about how the whole library was built and as far as I understood it should be the way you described it. The problem is that I am basically finding primers everywhere. All the sequences have their primer at the beginning, either forward or reverse. But, it is also possible to find a primer, either in its forward and/or reverseComplement version, inside some of the sequences.

What I did was to run several steps of cutatapt, in order to get rid of all of them since, as you mentioned, cutadapt removes only the best matching one. So I needed to run a first round to remove the expected primers from the beginning of the sequences. Then I removed primers in their original form from the inside of the sequences by doing the following:

# step two: remove remaining forward or reverse primers
for (smpl in c(1:nrow(infoSample))) {

	print(paste0("STEP TWO - processing sample ", smpl))

	# get samples name from step one
	sNameROne <- paste0(cutaOne, infoSample$sampleID[smpl], "_CD18_L000_R1_001.fastq")
	sNameRTwo <- paste0(cutaOne, infoSample$sampleID[smpl], "_CD18_L000_R2_001.fastq")

	# create filenames for the cutadapt-step-two outputs
	fwdCut <- paste0(cutaTwo, paste0(infoSample$sampleID[smpl], "_CD18_L000_R1_001.fastq"))
	revCut <- paste0(cutaTwo, paste0(infoSample$sampleID[smpl], "_CD18_L000_R2_001.fastq"))
	
	# remove remaining forward or reverse primers
	system2(cutadapt, args = c("-g", "ACGTAGTGTGCCAGCMGCCGCGGTAA", "-g", "TATGGTGTGCCAGCMGCCGCGGTAA", "-g", "ACACGGTGTGCCAGCMGCCGCGGTAA", "-g", "GTAGGTGTGCCAGCMGCCGCGGTAA", "-g", "CTGATAGTGTGCCAGCMGCCGCGGTAA", "-g", "CGATACGTGTGCCAGCMGCCGCGGTAA", "-g", "TGCTGTGTGCCAGCMGCCGCGGTAA", "-g", "CGCAGAGTGTGCCAGCMGCCGCGGTAA", "-g", "GACGCAGTGTGCCAGCMGCCGCGGTAA", "-g", "AGATGTGTGCCAGCMGCCGCGGTAA", "-g", "ACTACAGTGTGCCAGCMGCCGCGGTAA", "-g", "GATCTGTGTGCCAGCMGCCGCGGTAA", "-g", "ACGTACCGGACTACHVGGGTWTCTAAT", "-g", "TATGCCGGACTACHVGGGTWTCTAAT", "-g", "ACACGCCGGACTACHVGGGTWTCTAAT", "-g", "GTAGCCGGACTACHVGGGTWTCTAAT", "-g", "CTGATACCGGACTACHVGGGTWTCTAAT", "-g", "TGCTCCGGACTACHVGGGTWTCTAAT", "-g", "CGCAGACCGGACTACHVGGGTWTCTAAT", "-g", "CGATACCCGGACTACHVGGGTWTCTAAT", "-g", "GACGCACCGGACTACHVGGGTWTCTAAT", "-g", "AGATCCGGACTACHVGGGTWTCTAAT", "-g", "ACTACACCGGACTACHVGGGTWTCTAAT", "-g", "GATCTCCGGACTACHVGGGTWTCTAAT", "-G", "ACGTAGTGTGCCAGCMGCCGCGGTAA", "-G", "TATGGTGTGCCAGCMGCCGCGGTAA", "-G", "ACACGGTGTGCCAGCMGCCGCGGTAA", "-G", "GTAGGTGTGCCAGCMGCCGCGGTAA", "-G", "CTGATAGTGTGCCAGCMGCCGCGGTAA", "-G", "CGATACGTGTGCCAGCMGCCGCGGTAA", "-G", "TGCTGTGTGCCAGCMGCCGCGGTAA", "-G", "CGCAGAGTGTGCCAGCMGCCGCGGTAA", "-G", "GACGCAGTGTGCCAGCMGCCGCGGTAA", "-G", "AGATGTGTGCCAGCMGCCGCGGTAA", "-G", "ACTACAGTGTGCCAGCMGCCGCGGTAA", "-G", "GATCTGTGTGCCAGCMGCCGCGGTAA", "-G", "ACGTACCGGACTACHVGGGTWTCTAAT", "-G", "TATGCCGGACTACHVGGGTWTCTAAT", "-G", "ACACGCCGGACTACHVGGGTWTCTAAT", "-G", "GTAGCCGGACTACHVGGGTWTCTAAT", "-G", "CTGATACCGGACTACHVGGGTWTCTAAT", "-G", "TGCTCCGGACTACHVGGGTWTCTAAT", "-G", "CGCAGACCGGACTACHVGGGTWTCTAAT", "-G", "CGATACCCGGACTACHVGGGTWTCTAAT", "-G", "GACGCACCGGACTACHVGGGTWTCTAAT", "-G", "AGATCCGGACTACHVGGGTWTCTAAT", "-G", "ACTACACCGGACTACHVGGGTWTCTAAT", "-G", "GATCTCCGGACTACHVGGGTWTCTAAT", "-o", fwdCut, "-p", revCut, sNameROne, sNameRTwo, "--minimum-length", 200, "--quiet"))
}

Finally, I did the same but using the reverse complement of the sequences above, to get rid of reverse complement primers that I also found inside the sequences.

I guess these problems are a consequence of the fact that the amplification was done many years ago and probably the technique worked quite poorly.

Anyway, thank you for your suggestions!

@marcelm
Copy link
Owner

marcelm commented Sep 24, 2021

Great, it appears you figured it out now :-) (I cannot judge whether your script is correct, but it looks fine.)

The first library ever from which I needed to remove primers, which actually prompted me to start work on Cutadapt, was also of quite low quality. The 5' adapters were found at the 3' end, there were multiple consecutive adapter occurrences (primer dimers) and stuff like that. So I can relate a bit.

Thank you for getting back to me, and good luck with your project.

@marcelm marcelm closed this as completed Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting waiting for a reply
Projects
None yet
Development

No branches or pull requests

2 participants