New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make --discard-untrimmed work with linked adapters #256

Closed
GrahamHamilton opened this Issue Aug 31, 2017 · 11 comments

Comments

Projects
None yet
3 participants
@GrahamHamilton

GrahamHamilton commented Aug 31, 2017

I am trying to find and trim two adapters from a set of single end reads. I only want reads that have both of the primers present. I am using this "-a adapt1...adapt2$" to define the two adapters and to only trim reads that have both adapters present. However, this results in none of the reads being trimmed. When I omit the "$" at the end of adapt2 the reads are trimmed but reads with only the 5' adapter are included in the output. Am I doing something wrong? Here is the command.
cutadapt -a ATAGCGTTGCTCGAGTACTAACTGGTACCTCTTCTTTTTTTTCTGCAG...CCCTGGTGAATCGCATCG$ -o HiTTs_test_barcodes.fastq HiTTs_test.fastq
And here is my test fasta file
@NS500227:67:HTVFWBGX2:2:12104:24081:17311:ATTACTCG+TAAGATTA ATAGCGTTGCTCGAGTACTAACTGGTACCTCTTCTTTTTTTTCTGCAGNNNSNNNRNNNWCCCTGGTGAATCGCATCGTAGACGCGTATCGAATTCCGCTCGAGATAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTG + AAAAAEEEEEEEEEEEEEEAEAEEEEEEEEEEEEAEEEEEEE6EEEEAEEEEE/EE6/EEEEAEEEEE/EEEBBBEEEEEAEEEEEEEEAEAA/EEEEEEEEEE/EE/EEEEEEEEEEEEEEEEE/EEEEEEEEEAEEAAEAAAAA @NS500227:66:HTVFWBGX2:2:12104:24081:17311:ATTACTCG+TAAGATTA ATAGCGTTGCTCGAGTACTAACTGGTACCTCTTCTTTTTTTTCTGCAGNNNSNNNRNNNWCCCTGGTGGGTCGCATCGTAGACGCGTATCGAATTCCGCTCGAGATAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTG + AAAAAEEEEEEEEEEEEEEAEAEEEEEEEEEEEEAEEEEEEE6EEEEAEEEEE/EE6/EEEEAEEEEE/EEEBBBEEEEEAEEEEEEEEAEAA/EEEEEEEEEE/EE/EEEEEEEEEEEEEEEEE/EEEEEEEEEAEEAAEAAAAA

@marcelm

This comment has been minimized.

Owner

marcelm commented Aug 31, 2017

Hm, that’s how it’s designed at the moment: Only the anchored adapters are required to exist.

(You are also using --discard-untrimmed, right? Otherwise no reads would be filtered.)

I’ll briefly reiterate what the $ character does: It enforces that the adapter (in this case the 3' half of the linked adapter) is only found if it is at the end of the read. In your example, there are nucleotides after that adapter sequence in both reads, so it is not found. (The 5' half is anchored by default, but then your reads actually start with it, so everything is fine here.)

If you omit the $, the 3' adapter may occur anywhere in the read. For linked adapters, anywhere can also mean outside of the read. In other words: When the 3' half of a linked adapter is not found, the assumption is that the sequencer is not seeing it because the DNA fragment is too long, and then - as long as the 5' adapter was found - the read is still considered to be trimmed (and therefore not thrown away by --discard-untrimmed).

I think you have at least two options:

  • Are the nucleotides after the 3' adapter constant? Then you can just add them to the 3' adapter sequence and keep the $.
  • Use two passes over the data: One with --discard-untrimmed -g ^ATAGCGTTGCTCGAGTACTAACTGGTACCTCTTCTTTTTTTTCTGCAG (the 5' adapter) and one with --discard-untrimmed -a CCCTGGTGAATCGCATCG. You can combine that into a pipe to make it more efficient:
    cutadapt --discard-untrimmed -g ^ATAGCGTTGCTCGAGTACTAACTGGTACCTCTTCTTTTTTTTCTGCAG in.fastq | \
    cutadapt --discard-untrimmed -a CCCTGGTGAATCGCATCG -o out.fastq -
    
@DarioS

This comment has been minimized.

DarioS commented Nov 14, 2017

This is a duplicate request as #224 is wishing for the same feature. Hopefully, it can be executed as a single command in version 1.15.

@marcelm

This comment has been minimized.

Owner

marcelm commented Nov 14, 2017

I’m in the middle of finishing #270, so I need to do that first. So 1.15 will focus on multi-core, and I will probably not be able to do anything about this issue or #224 in that version.

@marcelm

This comment has been minimized.

Owner

marcelm commented Nov 14, 2017

One question: At the end of a read, should partial matches of the 3' adapter be allowed? That is, if the adapter is MYADAPTER and the end of the read is ...SEQUENCEMYADAP, do you want that to count as an adapter match? (As it does for normal 3' adapters.)

The biggest problem I usually have with changes like the one in this isse is to decide how the user interface should look. In this case, I think it would be best to not add a command-line option, but to extend the syntax in which adapters are specified.

If partial matches are not necessary, one could allow something like this: -a ACGT...TTGGAA%
The % would be mark the sequence to its left as "required for a match".

@DarioS

This comment has been minimized.

DarioS commented Nov 14, 2017

For my use case of extracting CRISPR gRNAs, the partial matches at the 3' side are necessary. I don't know what kind of data Graham has.

@GrahamHamilton

This comment has been minimized.

GrahamHamilton commented Nov 16, 2017

@marcelm

This comment has been minimized.

Owner

marcelm commented Nov 18, 2017

Yes, that is possible. Exactly the cases you describe are covered if you specify a “non-anchored linked adapter” in the following way: -g MyAdapter1...MyAdapter2. The trimming will work as you want and leave only TargetSequence.

However, what does not work at the moment is to use --discard-untrimmed with that option. No reads are discarded at all.

@GrahamHamilton

This comment has been minimized.

GrahamHamilton commented Nov 20, 2017

@marcelm marcelm changed the title from Linked adapters issue to Make --discard-untrimmed work with linked adapters Nov 20, 2017

@marcelm marcelm closed this in be388e4 Nov 20, 2017

@marcelm

This comment has been minimized.

Owner

marcelm commented Nov 20, 2017

I’ve just committed a fix: When you use -g to specify a linked adapter, both adapters are now required to be found in the read. If only one of them is found, 1) the read is not trimmed at all, and 2) --discard-untrimmed will throw the read away.

@DarioS @GrahamHamilton Please let me know if this is what you wanted. Thanks for the feedback; I hope this way cutadapt becomes more useful in CRISPR screening studies!

This will be in cutadapt 1.15, which I plan to release this week.

@DarioS

This comment has been minimized.

DarioS commented Nov 24, 2017

I have tested the new release of cutadapt and it works as expected on my CRISPR guides dataset.

@marcelm

This comment has been minimized.

Owner

marcelm commented Nov 24, 2017

Perfect, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment