Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-anchored linked adapters #224

Closed
DarioS opened this issue Dec 15, 2016 · 12 comments
Closed

Non-anchored linked adapters #224

DarioS opened this issue Dec 15, 2016 · 12 comments

Comments

@DarioS
Copy link

DarioS commented Dec 15, 2016

I have CRISPR reads where there is a 5' promoter sequence and a 3' plasmid sequence. I was interested in the --untrimmed-output option, but wasn't sure how it works exactly from the information in the user guide. I want to retain only reads which had both 5' and 3' trimming done, not one flanking sequence but not the other . If both -g and -a are provided, what does --untrimmed-output really mean? Could there also be a way for -a to enforce that both adapters must be found if the linked adapter pattern is used? Currently, "... the 3’ adapter is optional".

@marcelm
Copy link
Owner

marcelm commented Dec 15, 2016

--untrimmed-output is for reads in which no adapters were found and which consequently remained untrimmed. I have now changed the help text from

Write reads that do not contain the adapter to FILE

to

Write reads that do not contain any adapter to FILE

The next version of cutadapt will allow to anchor the 3'-half of a linked adapter by appending a $ character. In that case, it is no longer optional.

@DarioS
Copy link
Author

DarioS commented Dec 16, 2016

That's good, but creates an inconsistency in the user experience. For the 5' adapter, it is anchored without any special characters because;

Note that FIRST is always an anchored 5’ adapter (see the previous section) although there is no ^ character in the beginning.

Would a better default be to have unanchored adapters by default, unless the user specifies ^ for the 5' adapter and $ for the 3' adapter, to make it consistently optional for both of them, rather than anchoring by default for one and optional for the other, as it now is in the development version?

Also, have you considered when both of the adapters are not anchored to the beginning or end of a read and still having the ability to filter by both adapters being matched? I notice that cutadapt is becoming popular for CRISPR data preprocessing and this would be a convenient option to provide users with.

@marcelm
Copy link
Owner

marcelm commented Dec 16, 2016

I cannot change the default because the program needs to be backwards compatible. Linked adapters were added to cutadapt in version 1.10 already and some people surely rely on the current behavior.

Regarding the inconsistency, there is a secret plan behind it: I intended to move away from -g and use only -a for all adapter types. Already now, you can write an anchored 5' adapter as -a ADAPTER... (no ^). (This is not really documented anywhere.) If you do it this way, then it’s consistent again!

I have to admit I still don’t know why non-anchored 5' adapters appear at all. I also have no experience with CRISPR data. Perhaps you can write a few sentences about how CRISPR reads looks like? For example, when you want a non-anchored 5' adapter, is it because the 5' adapter has some bases preceding it, or is it because it may appear in a degraded form?

@marcelm
Copy link
Owner

marcelm commented Dec 16, 2016

Here is an idea that is backwards compatible: I could add the syntax -g ADAPTER...ADAPTER for linked adapters where by default neither the 5' nor the 3' adapter is anchored. -a ADAPTER...ADAPTER would be equivalent to -g ^ADAPTER...ADAPTER.

@marcelm marcelm changed the title Clarify Functionality of --untrimmed-output Option With Two Adapters Non-anchored linked adapters Dec 16, 2016
@DarioS
Copy link
Author

DarioS commented Dec 16, 2016

I like that idea.

The reason both can be unancheroed, is because they're not really adapters but some complicated plasmid construct. The basic structure of a CRISPR GeCKO library short read is SOMETHINGELSEU6PROMOTERguideRNAPLASMIDSOMETHINGELSE.

So, the flanking sequences (i.e. adapters in cutadapt terminology) don't acutually occur at the beginning nor the end of the read. The most popular pipeline, MAGeCK, recommends using cutatapt for extracting guideRNA which is in between U6PROMOTER and PLASMID. The purpose of CRISPR analysis is to count the number of times each guideRNA sequence occurs in each sample and the statistically compare them.

So, you might get a lot more of these requests next year. It is not the purpose that cutadapt was intended for, but it could result in lots of citations as CRISPR increases in popularity. Currently, others use grep before cutadapt to filter out the sequences without both flanking sequences present, but grep can't handle indels and 1 mismatch at any position which cutadapt already can, so this could be made into a cutadapt one-liner.

@marcelm
Copy link
Owner

marcelm commented Dec 19, 2016

Thanks for the explanation! I need a handful of reads from a real dataset for testing. Do you know of a suitable (publicly available) dataset or would you be willing yourself to "donate" a few reads? They would become part of cutadapt’s test suite so it must be ok for them to become public.

@DarioS
Copy link
Author

DarioS commented Dec 20, 2016

I have sent the reads by e-mail.

@marcelm
Copy link
Owner

marcelm commented Jan 24, 2017

I’ve implemented this now. The syntax is as described above (-g ADAPTER1...ADAPTER2) . I tested it on the reads you sent me, but have not included them in cutadapt’s test data set. I just made up a few reads that look like yours, and a few other variations.

This will be part of cutadapt 1.13, but it would be great if you could test it before the release. Let me know if you don’t know how to install from the Git repo. I’ve also marked the feature as 'tentative', which means that I may change it in a backwards-incompatible way in 1.14. But after that, it will be backwards compatible.

Let me know what you think!

P.S: I just saw that the statistics are a bit weird. I’ll fix that later.

@DarioS
Copy link
Author

DarioS commented Jan 25, 2017

Yes, it works as expected on a complete input file. I don't mind updating my scripts for software interface changes.

@DarioS
Copy link
Author

DarioS commented Mar 17, 2017

But, when I also use -e, it doesn't work as expected. If only one of the two adapters is below the error threshold, the read is reported as having the adapters. It should require both segments in the read to have less differences to the provided sequences to be considered a match.

@marcelm
Copy link
Owner

marcelm commented Mar 17, 2017

I see. This is not caused be -e, but comes from the way non-anchored adapters work: There is no requirement that they must occur in full in the read. Bases can be missing at the 5' end in the case of 5' adapters, or at the 3' end in the case of 3' adapters, and it is even fine that all bases are missing - that is, that the adapter is missing entirely. This carries over to linked adapters: Only the anchored parts of linked adapters are required, the others are optional.

There is no way to change this at the command-line level, yet. One possibility to deal with this - until I have implemented a solution - is to run cutadapt twice: In the first run, use -a PLASMID --discard-untrimmed and set the minimum overlap (-O) option to something appropriate. Then use the output of that command and run cutadapt again on it with -g U6PROMOTER, this time setting the minimum overlap to the length of that sequence.

@marcelm
Copy link
Owner

marcelm commented Nov 14, 2017

Let’s continue discussion in #256 since the original issue here - as I understood it originally - was to implement non-anchored linked adapters, which has been done.

@marcelm marcelm closed this as completed Nov 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants