overtrimming of reads #42

jdidion · 2018-03-16T16:32:18Z

I am benchmarking fastp against other read trimmers using the workflow I developed for the Atropos paper (https://github.com/jdidion/atropos/tree/master/paper/workflow). I find that fastp has a high rate of read overtrimming. Example fastq input and output are attached. The command I used is:

fastp
-i {fastq1} -I {fastq2} -o {prefix}.1.fq.gz -O {prefix}.2.fq.gz
--adapter_sequence {adapter1} --adapter_sequence_r2 {adapter2}
--thread {threads} --length_required 25 —disable_quality_filtering

Nearly all of these overtrimming events involve the spurious removal of up to 10 bases from one or both reads:

I suspect this might be due to overzealous alignment of the reads to each other, and could probably be fixed with an option to require a minimum insert overlap before trimming. Another approach (which is offered as an option in Atropos) is to compute the random match probability of each alignment and compare against a user-specified threshold value.

example.zip

jdidion · 2018-03-16T16:44:51Z

To follow up on this: I tried trimming using the example data again but without specifying the adapter sequences and the reads were not trimmed. So to clarify the issue, it is with spurious match of the adapter to the read. I think you need to add the equivalent of the -O parameter in Atropos and Cutadapt. Also note that, with the alternate option I describe above (using random match probability), Atropos uses a heuristic (that was copied from SeqPurge) to require both adapter sequences to match their respective reads when the match length is <= than some threshold (9 bp by default), regardless of the probability. This seems to be necessary to achieve good performance with short adapter matches.

sfchen · 2018-03-18T14:23:56Z

Thank you John.

In current implementation, fastp will trim the reads if it matches the adapter with 4 or more bases in 3' end. This may introduce a bit overtrimming (may overtrim <0.1% of the data for 150bp reads), but will definitely give cleaner data.

It's a good idea to expose a parameter to change this setting. I will implement this in next release.

komaljain3 · 2022-08-22T18:16:24Z

When can we expect this new feature to be released? The -O option in Cutadapt is really useful.

sfchen closed this as completed Jun 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overtrimming of reads #42

overtrimming of reads #42

jdidion commented Mar 16, 2018

jdidion commented Mar 16, 2018

sfchen commented Mar 18, 2018 •

edited

Loading

komaljain3 commented Aug 22, 2022

overtrimming of reads #42

overtrimming of reads #42

Comments

jdidion commented Mar 16, 2018

jdidion commented Mar 16, 2018

sfchen commented Mar 18, 2018 • edited Loading

komaljain3 commented Aug 22, 2022

sfchen commented Mar 18, 2018 •

edited

Loading