Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert adapter matches to lowercase instead of trimming #166

Closed
Poshi opened this issue Dec 17, 2015 · 6 comments
Closed

Convert adapter matches to lowercase instead of trimming #166

Poshi opened this issue Dec 17, 2015 · 6 comments

Comments

@Poshi
Copy link

Poshi commented Dec 17, 2015

One more idea. Another thing that we need.
Following with the idea of keeping all the information in the output, it would be nice to be able to clip instead of trim the sequences. Something similar to Repeat Masker:

  • no trimming at all: In that case, input equals output. No need to implement, just don't run cutadapt XD
  • soft clipping: Transform bases that should be cut for any reason to lower case
  • hard clipping: Transform bases that should be cut for any reason to 'N'
  • trimming: Eliminate bases that should be cut for any reason

In our case, that we need this functionality, we have to store sequence and qualities in the ID line, run the trimming process, and postprocess the output to put back the original sequence and qualities, but keeping only in uppercase the bases that survived the process.

This allows us to regenerate the original FastQ file from the resulting aligned bam, thus reducing the space we need to store all the data.

@marcelm
Copy link
Owner

marcelm commented Jan 4, 2016

Masking bases that should be trimmed with N characters is already possible by using the --mask-adapter option. The no trimming at all option is also implemented (--no-trim) and actually has a use case even when running cutadapt: It allows you to redirect reads with and without adapters to different files.

What is currently not possible is to transform bases to lowercase and I agree it would be useful to have.

Have you seen that cutadapt can output an "info file" with the --info-file option? It makes it quite easy to extract all the information you need, even the lowercase transformation should be quite easy to do with a short awk script or similar.

I’ll leave this report open and change the title to remind me that a --lowercase option or similar would be useful.

@marcelm marcelm changed the title Clipping instead of trimming Convert adapter matches to lowercase instead of trimming Jan 4, 2016
@Poshi
Copy link
Author

Poshi commented Jan 4, 2016

Good to know that by using the info file we can recreate the requested behavior. But this poses a trade-off: we have to do some extra calculus (CPU time, not really important) and we have to write the intermediate file and re-read it again (preventing us from piping the results into other tools and doubling the disk IO needed for the processing, and this is quite important to us).

@marcelm
Copy link
Owner

marcelm commented Jan 7, 2016

Yes, some extra processing is needed, but wouldn’t you have to do that in any case? Even when cutadapt had a --lowercase option already, you’d have to create a FASTQ file for the aligner in which the lowercase bases are removed.

I’m happy to implement this, but I’m just not sure whether it really helps in your case.

@Poshi
Copy link
Author

Poshi commented Jan 7, 2016

Our aligner can handle the situation. For lower case input bases, it will output a soft clipping operation in the sam file (CIGAR operation: S, like 1S90M10S for a removal of the first base and the last ten bases).

@Poshi
Copy link
Author

Poshi commented Feb 28, 2019

Thanks!
We love you too much! :-)

@marcelm
Copy link
Owner

marcelm commented Feb 28, 2019

It’s taken a while, but I don’t forget :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants