New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling partial adapter spanning 5' to 3' #152

Closed
multicode opened this Issue Oct 6, 2015 · 6 comments

Comments

Projects
None yet
2 participants
@multicode

multicode commented Oct 6, 2015

Hi,

I have this case where, I wanted to trim an adapter CGCCTTGGCCGTACAGCAG from the 3' end of the read. The complete form of the adapter is like <adaptor><barcode-sequence><otherP2adaptor>. Cutadapt didn't handle the sequences that looked like this:

106:TACAGCAG CCTCTTACACAGAGATAGACCTGTCAGTTGCGACCGGTCATCCAGGACTTACC
174:TACAGCAG CCTCTTACACAGAGAAGCTCCCTGAATGCAGCACGGGGACCTGTCTATACCATCAT

I intentionally left a gap in the above sequence for you to easily spot the portion of the adapter sequence. What I wanted in this case is, the entire read needs to be removed. The command that I have used didn't handle this case:

time zcat F3.fastq.gz | fastx_clipper -a NNNNN -M 1 -C | cutadapt \
-a CGCCTTGGCCGTACAGCAG \
-a "A{10}" -a "T{10}" -g "A{10}" -g "T{10}" \
--minimum-length 25 --prefix DA:FC1:Lib17:Lane1: --output F3_aptrRmvd.fastq.gz -

You can also see in my above command that I am using fastx_clipper to remove any reads that contained Ns, a feature if inherently present in cutadapt would be great.

I would be very happy if you could help me with a solution to handle this case.

Thanks a lot! :)

@marcelm

This comment has been minimized.

Owner

marcelm commented Oct 7, 2015

Interesting, do you have an idea why the adapters are degraded in such a way? Just to make sure: Is it correct that the untrimmed read starts with TACAGCAG? Or has that already been trimmed in some way?

I’m travelling at the moment and cannot test it, but if the degraded adapter is always TACAGCAG, then, as a workaround, you could add -a TACAGCAG to your options. Cutadapt will search for all given adapter sequences in each read and remove the longest one it finds, so this could work.

An option to discard reads that contain N characters should be easy to add and sounds useful.

@multicode

This comment has been minimized.

multicode commented Oct 7, 2015

do you have an idea why the adapters are degraded in such a way?

They are the reads from SOLiD 5500XL 😉

An option to discard reads that contain N characters should be easy to add and sounds useful.

I just saw that cutadapt already has this feature. Using --max-n 1 should probably fix my case, but this I suppose will leave sequences containing 1 N. If I wanted NO Ns, does this work if I use --max-n 0 ?

Is the degraded adapter always TACAGCAG ?

No, it could be any of these. I've tested these scenarios.

CGCCTTGGCCGTACAGCAG
 GCCTTGGCCGTACAGCAG
  CCTTGGCCGTACAGCAG
   CTTGGCCGTACAGCAG
    TTGGCCGTACAGCAG
     TGGCCGTACAGCAG
      GGCCGTACAGCAG
       GCCGTACAGCAG
        CCGTACAGCAG
         CGTACAGCAG
          GTACAGCAG
           TACAGCAG
... and goes on...

When you use the -g option for PREFIX+ADAPTER+MYSEQUENCE, all the stuff that is PREFIX+ADAPTER will be trimmed. To handle the above case, you could provide an option like --gs ADAPTER so that it would trim the ADAPTER+SUFFIX from the 5' end. Because, in this special case, the suffix can be barcode and other adapter sequences.

To complete this case, you could also add the --ap ADAPTER option to trim PREFIX+ADAPTER from the 3' end.

This essentially mean that you are (almost) removing the entire read, which can then be filtered with cutadapt's --minimum-length option.

What are your comments about this ?

@marcelm

This comment has been minimized.

Owner

marcelm commented Oct 7, 2015

I just saw that cutadapt already has this feature. Using --max-n 1 should probably fix my case, but this I suppose will leave sequences containing 1 N. If I wanted NO Ns, does this work if I use --max-n 0 ?

You are right, that functionality already exists. I forgot about it because it was contributed externally. --max-n 0 will work for what you want to accomplish.

What are your comments about this?

I won’t use the exact syntax you suggested (because options are allowed to be only single characters), but the functionality will be what you described.

@multicode

This comment has been minimized.

multicode commented Oct 7, 2015

Yeah true. My mistake. The syntax should have double hyphens when more then single character.
So, when can we expect a release with such a functionality 😉 ?
Thanks!

@marcelm

This comment has been minimized.

Owner

marcelm commented Oct 7, 2015

Patches welcome if you are in a hurry ...

@marcelm marcelm closed this in 4e1a1e0 Aug 29, 2018

@marcelm

This comment has been minimized.

Owner

marcelm commented Aug 29, 2018

Thank you for your suggestion! Late, and possibly no longer relevant for you, but it is now possible to specify that a 3' adapter should be found also when it occurs partially at the 5' end. Please see the new section in the documentation.

This will be in Cutadapt 1.18.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment