Convert adapter matches to lowercase instead of trimming #166

Poshi · 2015-12-17T16:13:12Z

One more idea. Another thing that we need.
Following with the idea of keeping all the information in the output, it would be nice to be able to clip instead of trim the sequences. Something similar to Repeat Masker:

no trimming at all: In that case, input equals output. No need to implement, just don't run cutadapt XD
soft clipping: Transform bases that should be cut for any reason to lower case
hard clipping: Transform bases that should be cut for any reason to 'N'
trimming: Eliminate bases that should be cut for any reason

In our case, that we need this functionality, we have to store sequence and qualities in the ID line, run the trimming process, and postprocess the output to put back the original sequence and qualities, but keeping only in uppercase the bases that survived the process.

This allows us to regenerate the original FastQ file from the resulting aligned bam, thus reducing the space we need to store all the data.

marcelm · 2016-01-04T16:18:55Z

Masking bases that should be trimmed with N characters is already possible by using the --mask-adapter option. The no trimming at all option is also implemented (--no-trim) and actually has a use case even when running cutadapt: It allows you to redirect reads with and without adapters to different files.

What is currently not possible is to transform bases to lowercase and I agree it would be useful to have.

Have you seen that cutadapt can output an "info file" with the --info-file option? It makes it quite easy to extract all the information you need, even the lowercase transformation should be quite easy to do with a short awk script or similar.

I’ll leave this report open and change the title to remind me that a --lowercase option or similar would be useful.

Poshi · 2016-01-04T16:31:42Z

Good to know that by using the info file we can recreate the requested behavior. But this poses a trade-off: we have to do some extra calculus (CPU time, not really important) and we have to write the intermediate file and re-read it again (preventing us from piping the results into other tools and doubling the disk IO needed for the processing, and this is quite important to us).

marcelm · 2016-01-07T09:28:39Z

Yes, some extra processing is needed, but wouldn’t you have to do that in any case? Even when cutadapt had a --lowercase option already, you’d have to create a FASTQ file for the aligner in which the lowercase bases are removed.

I’m happy to implement this, but I’m just not sure whether it really helps in your case.

Poshi · 2016-01-07T10:18:01Z

Our aligner can handle the situation. For lower case input bases, it will output a soft clipping operation in the sam file (CIGAR operation: S, like 1S90M10S for a removal of the first base and the last ten bases).

Poshi · 2019-02-28T13:47:01Z

Thanks!
We love you too much! :-)

marcelm · 2019-02-28T14:26:12Z

It’s taken a while, but I don’t forget :-)

marcelm changed the title ~~Clipping instead of trimming~~ Convert adapter matches to lowercase instead of trimming Jan 4, 2016

marcelm added Type- feature-request and removed Type- labels Aug 30, 2018

marcelm closed this as completed in 5057184 Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert adapter matches to lowercase instead of trimming #166

Convert adapter matches to lowercase instead of trimming #166

Poshi commented Dec 17, 2015

marcelm commented Jan 4, 2016

Poshi commented Jan 4, 2016

marcelm commented Jan 7, 2016

Poshi commented Jan 7, 2016

Poshi commented Feb 28, 2019

marcelm commented Feb 28, 2019

Convert adapter matches to lowercase instead of trimming #166

Convert adapter matches to lowercase instead of trimming #166

Comments

Poshi commented Dec 17, 2015

marcelm commented Jan 4, 2016

Poshi commented Jan 4, 2016

marcelm commented Jan 7, 2016

Poshi commented Jan 7, 2016

Poshi commented Feb 28, 2019

marcelm commented Feb 28, 2019