Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overtrimming of reads #42

Closed
jdidion opened this issue Mar 16, 2018 · 3 comments
Closed

overtrimming of reads #42

jdidion opened this issue Mar 16, 2018 · 3 comments

Comments

@jdidion
Copy link

jdidion commented Mar 16, 2018

I am benchmarking fastp against other read trimmers using the workflow I developed for the Atropos paper (https://github.com/jdidion/atropos/tree/master/paper/workflow). I find that fastp has a high rate of read overtrimming. Example fastq input and output are attached. The command I used is:

fastp
-i {fastq1} -I {fastq2} -o {prefix}.1.fq.gz -O {prefix}.2.fq.gz
--adapter_sequence {adapter1} --adapter_sequence_r2 {adapter2}
--thread {threads} --length_required 25 —disable_quality_filtering

Nearly all of these overtrimming events involve the spurious removal of up to 10 bases from one or both reads:

image

I suspect this might be due to overzealous alignment of the reads to each other, and could probably be fixed with an option to require a minimum insert overlap before trimming. Another approach (which is offered as an option in Atropos) is to compute the random match probability of each alignment and compare against a user-specified threshold value.

example.zip

@jdidion
Copy link
Author

jdidion commented Mar 16, 2018

To follow up on this: I tried trimming using the example data again but without specifying the adapter sequences and the reads were not trimmed. So to clarify the issue, it is with spurious match of the adapter to the read. I think you need to add the equivalent of the -O parameter in Atropos and Cutadapt. Also note that, with the alternate option I describe above (using random match probability), Atropos uses a heuristic (that was copied from SeqPurge) to require both adapter sequences to match their respective reads when the match length is <= than some threshold (9 bp by default), regardless of the probability. This seems to be necessary to achieve good performance with short adapter matches.

@sfchen
Copy link
Member

sfchen commented Mar 18, 2018

Thank you John.

In current implementation, fastp will trim the reads if it matches the adapter with 4 or more bases in 3' end. This may introduce a bit overtrimming (may overtrim <0.1% of the data for 150bp reads), but will definitely give cleaner data.

It's a good idea to expose a parameter to change this setting. I will implement this in next release.

@sfchen sfchen closed this as completed Jun 27, 2018
@komaljain3
Copy link

When can we expect this new feature to be released? The -O option in Cutadapt is really useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants