-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling #33
Comments
Thank you Brad! Actually fastp only implemented polyG trimming in 3' ends since I thought polyG is much more serious than poly A/T/C for NextSeq/NovaSeq. It's not difficult to support polyX trimming based on the polyG trimming implementation. And I will implement it, according to your good suggestion. Since the paired-end processing will output two files, I cannot find a good solution to stream the output to aligners. Thank you again for your support and good suggestions. I will continue to improve this tool. |
Shifu, thank you so much -- we are one of the NovaSeq sites that would positively benefit from a fast trimming implementation, particularly if it can handle polyX sequences. A +1 on the roadmap from us. |
Hi Brad and Oliver, I'd like to ask some questions about polyX trimming:
Thanks |
Shifu; So, I'd suggest having these as separate options. Some users might want only polyG and to keep the other poly stretches. I think having both as optional flags is probably the best option for now until trimming these becomes more standard accepted practice. Regarding streaming for paired end reads, there are a couple of use cases where we could take advantage of this;
Thanks again for considering all these options and all the work on fastp. |
Shifu, I think both options are useful. polyG is a 2-color-chemistry specific problem and causes variant callers to fail depending on coverage depth (or at least stall for a long time). That in itself is valuable, but based on the preliminary results removing other polyX regions helps with both speed and precision, something I didn't anticipate. For two color chemistry I'd suggest polyG as a default, with everything else up to the user. |
Okay. Another question: do you want to discard reads containing long polyX in the middle of reads. I mean the polyX is not in 3' or 5' ends. This function may have side effects on MSI detection. |
Hi @chapmanb , Streaming both ends, interleaved, directly into STDOUT is a good suggestion. I have just confirmed that minimap2 also supports interleaved FASTQ, as well as bwa does. I will add this feature to the roadmap. Thanks |
In RNA-seq one has polyA due to polyadenylation of mRNA, and therefore one would want to trim the polyA ends from the reads before aligning the reads. Some RNA-seq datasets, which come from non-stranded library preparation, will have the polyA showing up as polyT in the reads. |
Hi all, I am just starting to implement 3' end polyX trimming. Another important question: Although I will make these options configurable by command line, I still need these information to set the default configuration. Thanks |
https://s3-ap-southeast-2.amazonaws.com/umccr/umccr/qc/polyg/ipmn2219-2_33_tumor_hotspot_fastqc.html#M9 is a FASTQC report including just one problematic region. I can only speak for poly-G tracks, but even removing just reads that consist of 50 consecutive I think at least for the poly-G issue a conservative setting would work. Not sure what the Atropos defaults are though. |
fastp's current polyG implementation is a bit aggressive: 10 bp consecutive G will be categorised as polyG, with one mismatch allowed in per 8 bp. |
Shifu; |
I just implemented this feature. Please try to build fastp with latest code on master. Or download http://opengene.org/fastp/fastp to test. |
Thanks so much for the quick work integrating this, we really appreciate it. I've re-run a test on the same dataset used above and this does provide some nice improvements in removing these stretches. In comparison to what we were seeing before, here are the top 3' ends for the first 10 million reads:
The low complexity dinucleotide repeats are still expected with our current trimming, but the last two polyT with a different base end are ones I'd expect to remove. I dug into them and here are some example reads with these ends left after trimming:
Happy to provide more examples if it would help. Thanks again for the work on this. |
@chapmanb how about trim the last base with Anyway, I can make a minor revision to trim polyX like |
Shifu -- thanks for this. Generally I thought these should fall into the logic of being > poly_x_min_len (which I left at 10) and the logic of 1 allowed mismatch per 8bp with a maximum of 5 mismatches. Are you requiring that the polyX stretch initiate with the 3' most end and implicitly disallowing mismatches there? MSIs should primarily by in more complex di-nucleotide+ repeats (like the first 4 examples in the remaining trim) and I agree the polyX shouldn't touch those to avoid messing with these. Exploring low complexity filters and the impact of this type of detection would be a useful secondary filter but something more long term. We're hoping to isolate the smaller set of trimming changes which help most with runtimes as a first pass, so clearing out the remaining noisy polyT and other reads would be helpful. Thanks again for the discussion and help. |
@chapmanb I just made an update to trim the polyX like |
Shifu; https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5 I'm continuing to try to understand which reads specifically help with downstream speed improvements, and appreciate all the help trying to be able to be more aggressive in removing this polyX noise. Thanks again. |
Thank you Brad, I will do a test with the file you provided. |
Hi @chapmanb I cannot download the file you uploaded at https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5 Would you please attach it in this issue? I think it's small enough to be attached here. Thanks |
Shifu -- what issues are you running into downloading it? This is the direct link for the raw version for download if that helps: |
This direct link works, thanks. |
Hi @chapmanb With the latest build, I implemented low complexity filtering, which can filter out most of these reads. You can test it by specifying |
Shifu; It looks in the end like the primary different is due to quality trimming differences, and polyX 3' ends up being a reflection of that: https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim I'm trying to harmonize fastp and atropos trimming to better understand the differences but am having trouble replicating the runtime improvements we find with atropos trimming. For fastp I use:
to get 3' only quality trimming and with atropos use:
which I think should be roughly equivalent. I've tried increasing I'll continue to dig but welcome any thoughts about how best to synchronize these, or if I'm missing anything obvious. Thank you again for all the help and happy to have two quality options to test with. |
Hi @chapmanb A quick update, fastp output to STDOUT
I hope you will like this new feature. BTW, I think it's you that maintain |
Shifu; For the bioconda package, you don't need to be a collaborator or anything special. If you update the recipe, send a PR and cc me I'd be happy to merge: https://bioconda.github.io/contributing.html or you can also ask to become a contributor to the project to merge them yourself: https://bioconda.github.io/contrib-setup.html#request-to-be-added-to-the-bioconda-team-optional It's pretty lightweight meant to enable as many contributors as we can. Note that right nowe we're in the middle of a huge transition to a new compiler system and rebuild so things are bogged down on new recipes. Hopefully that backlog will get cleared soon. Thanks again. |
Hi @chapmanb I sent u an email yesterday. Just wondering that whether you have received it? Thanks |
Shifu; |
I sent the email to your Harvard mailbox, and probably it was filtered to the junk box. So I just resent it to your fastmail.com mailbox :) |
Shifu;
Congrats on the paper in bioRxiv and thanks for all the great work on fastp. We've been working on improving the runtimes for somatic variant calling workflows and exploring quality and polyX trimming. We did a test run with fastp and atropos and found that the major improvements in runtime were due to removal of polyX sequences at the 3' ends of reads:
https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim
We'd used the new polyG trimming functionality (thank you), but a crude method of 3' polyA/T/C adapter removal, which appears to be less effective with fastp compared to atropos trimming. When additional polyX stretches get removed we get much better runtimes for alignment and variant calling.
I saw general polyX and low complexity trimming are on the roadmap for fastp and would like to express my support for this. We've been making great use of fastp for adapter conversion and would like to offer trimming as part of an effort to speed up alignment and variant calling both on NovaSeqs and more generally.
As a secondary help for integration, is streaming trimming a possibility for paired ends? To help improve preparation runtimes I'd been thinking of including trimming and streaming directly into bwa/minimap2 alignment, or being able to stream outputs into bgzip so we can index and parallelize variant calling.
Thanks again for all the work on fastp.
The text was updated successfully, but these errors were encountered: