Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling #33

Closed
chapmanb opened this issue Mar 3, 2018 · 29 comments

Comments

@chapmanb
Copy link

chapmanb commented Mar 3, 2018

Shifu;
Congrats on the paper in bioRxiv and thanks for all the great work on fastp. We've been working on improving the runtimes for somatic variant calling workflows and exploring quality and polyX trimming. We did a test run with fastp and atropos and found that the major improvements in runtime were due to removal of polyX sequences at the 3' ends of reads:

https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim

We'd used the new polyG trimming functionality (thank you), but a crude method of 3' polyA/T/C adapter removal, which appears to be less effective with fastp compared to atropos trimming. When additional polyX stretches get removed we get much better runtimes for alignment and variant calling.

I saw general polyX and low complexity trimming are on the roadmap for fastp and would like to express my support for this. We've been making great use of fastp for adapter conversion and would like to offer trimming as part of an effort to speed up alignment and variant calling both on NovaSeqs and more generally.

As a secondary help for integration, is streaming trimming a possibility for paired ends? To help improve preparation runtimes I'd been thinking of including trimming and streaming directly into bwa/minimap2 alignment, or being able to stream outputs into bgzip so we can index and parallelize variant calling.

Thanks again for all the work on fastp.

@sfchen
Copy link
Member

sfchen commented Mar 3, 2018

Thank you Brad!

Actually fastp only implemented polyG trimming in 3' ends since I thought polyG is much more serious than poly A/T/C for NextSeq/NovaSeq. It's not difficult to support polyX trimming based on the polyG trimming implementation. And I will implement it, according to your good suggestion.

Since the paired-end processing will output two files, I cannot find a good solution to stream the output to aligners.

Thank you again for your support and good suggestions. I will continue to improve this tool.

@ohofmann
Copy link

ohofmann commented Mar 3, 2018

Shifu,

thank you so much -- we are one of the NovaSeq sites that would positively benefit from a fast trimming implementation, particularly if it can handle polyX sequences. A +1 on the roadmap from us.

@sfchen
Copy link
Member

sfchen commented Mar 4, 2018

Hi Brad and Oliver,

I'd like to ask some questions about polyX trimming:

  1. We know that polyG is caused by signal attenuation of SBS. When the signal is very weak, the base will be detected as G in a two-colour system like NovaSeq or NextSeq. But what causes poly A/T/C?
  2. If polyX trimming is implemented, do we still need polyG trimming options?
  3. Do you think polyX trimming should be enabled by default for preprocessing NovSeq / NextSeq data?

Thanks
Shifu

@chapmanb
Copy link
Author

chapmanb commented Mar 4, 2018

Shifu;
Thanks so much for considering this feature. In terms of motivation, I think the polyX trimming is separate from NovaSeq polyG trims. There isn't a NovaSeq mechanism contributing to polyA/T/C since this is also present in non-NovaSeq (HiSeq and friends) reads. Rather this is due to some kind of slippage issue or is sequencing real signal. But either way what happens with this during variant calling is that these reads pile up on A stretches of the genome, creating very deep pileups of noise which take a long time to variant call through especially when identifying low frequency somatic variants.

So, I'd suggest having these as separate options. Some users might want only polyG and to keep the other poly stretches. I think having both as optional flags is probably the best option for now until trimming these becomes more standard accepted practice.

Regarding streaming for paired end reads, there are a couple of use cases where we could take advantage of this;

  • Streaming interleaved reads directly into mappers like bwa and minimap2. So instead of -o and -O, stream both ends, interleaved, directly to standard out.
  • Streaming into bgzip preparation of reads, doing something like -o >(bgzip --threads 8 > out_R1.fq.gz) -O >(bgzip --threads 8 > out_R2.fq.gz). This would help us prepare bgzipped outputs which are indexable for looks like grabix (https://github.com/arq5x/grabix) so we could parallelize subsequent alignment over regions of the fastq file.

Thanks again for considering all these options and all the work on fastp.

@ohofmann
Copy link

ohofmann commented Mar 4, 2018

Shifu,

I think both options are useful. polyG is a 2-color-chemistry specific problem and causes variant callers to fail depending on coverage depth (or at least stall for a long time). That in itself is valuable, but based on the preliminary results removing other polyX regions helps with both speed and precision, something I didn't anticipate.

For two color chemistry I'd suggest polyG as a default, with everything else up to the user.

@sfchen
Copy link
Member

sfchen commented Mar 5, 2018

Okay.

Another question: do you want to discard reads containing long polyX in the middle of reads. I mean the polyX is not in 3' or 5' ends.

This function may have side effects on MSI detection.

@sfchen
Copy link
Member

sfchen commented Mar 5, 2018

Hi @chapmanb ,

Streaming both ends, interleaved, directly into STDOUT is a good suggestion. I have just confirmed that minimap2 also supports interleaved FASTQ, as well as bwa does.

I will add this feature to the roadmap.

Thanks
Shifu

@ndaniel
Copy link

ndaniel commented Mar 5, 2018

@sfchen

But what causes poly A/T/C?

In RNA-seq one has polyA due to polyadenylation of mRNA, and therefore one would want to trim the polyA ends from the reads before aligning the reads. Some RNA-seq datasets, which come from non-stranded library preparation, will have the polyA showing up as polyT in the reads.

@sfchen
Copy link
Member

sfchen commented Mar 6, 2018

Hi all, I am just starting to implement 3' end polyX trimming.

Another important question:
1, in your opinion, how many consecutive bases must be detected as a polyX? How about 10 bp?
2, to be error tolerant, how many mismatches are allowed? How about 1 mismatch per 10bp?

Although I will make these options configurable by command line, I still need these information to set the default configuration.

Thanks
Shifu

@ohofmann
Copy link

ohofmann commented Mar 6, 2018

https://s3-ap-southeast-2.amazonaws.com/umccr/umccr/qc/polyg/ipmn2219-2_33_tumor_hotspot_fastqc.html#M9 is a FASTQC report including just one problematic region. I can only speak for poly-G tracks, but even removing just reads that consist of 50 consecutive Gs would allow most variant callers to finish the region in a reasonable amount of time; allowing one mismatch in 50 would get rid of all but the a small minority.

I think at least for the poly-G issue a conservative setting would work. Not sure what the Atropos defaults are though.

@sfchen
Copy link
Member

sfchen commented Mar 6, 2018

fastp's current polyG implementation is a bit aggressive: 10 bp consecutive G will be categorised as polyG, with one mismatch allowed in per 8 bp.

@chapmanb
Copy link
Author

chapmanb commented Mar 6, 2018

Shifu;
The defaults you use for polyG -- 10 Gs with 1 mismatch per 8bp -- also seem reasonable to me for polyX. Having them consistent as defaults seems like a worthwhile strategy. Thanks again for working on this.

sfchen added a commit that referenced this issue Mar 6, 2018
@sfchen
Copy link
Member

sfchen commented Mar 6, 2018

I just implemented this feature.

Please try to build fastp with latest code on master. Or download http://opengene.org/fastp/fastp to test.

@chapmanb
Copy link
Author

chapmanb commented Mar 7, 2018

Thanks so much for the quick work integrating this, we really appreciate it. I've re-run a test on the same dataset used above and this does provide some nice improvements in removing these stretches. In comparison to what we were seeing before, here are the top 3' ends for the first 10 million reads:

GTGTGTGTGT 1217
TGTGTGTGTG 1175
CACACACACA 1122
ACACACACAC 1089
TTTTTTTTTG 1083
TTTTTTTTTA 1009

The low complexity dinucleotide repeats are still expected with our current trimming, but the last two polyT with a different base end are ones I'd expect to remove. I dug into them and here are some example reads with these ends left after trimming:

AGGAATTCTGCAGCTTTTTCTTTTCTTAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTGTTTTTTTTTATTTTTTTGTTTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTTA
CCCTTCTTTACGGTGAAGCTTATTCTGATTAAGCCTAGACTGTGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA
TGAAGGCCTGGGGATGGTGACTGAAGAAGGAACACGTAAGTAACTAATGAATGTGAAGGCCATTCTCTTCCTGATTAAAATCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTG
TGGGTGTGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTG

Happy to provide more examples if it would help. Thanks again for the work on this.

@sfchen
Copy link
Member

sfchen commented Mar 7, 2018

@chapmanb how about trim the last base with -t 1 option? Are they from the 151st cycle?

Anyway, I can make a minor revision to trim polyX like TTTTTTTTTTTTTTTTTTTTTTTTA, but we should consider that whether it will cause over trimming. Will it impact MSI detection?

@chapmanb
Copy link
Author

chapmanb commented Mar 7, 2018

Shifu -- thanks for this. Generally I thought these should fall into the logic of being > poly_x_min_len (which I left at 10) and the logic of 1 allowed mismatch per 8bp with a maximum of 5 mismatches. Are you requiring that the polyX stretch initiate with the 3' most end and implicitly disallowing mismatches there?

MSIs should primarily by in more complex di-nucleotide+ repeats (like the first 4 examples in the remaining trim) and I agree the polyX shouldn't touch those to avoid messing with these. Exploring low complexity filters and the impact of this type of detection would be a useful secondary filter but something more long term. We're hoping to isolate the smaller set of trimming changes which help most with runtimes as a first pass, so clearing out the remaining noisy polyT and other reads would be helpful.

Thanks again for the discussion and help.

@sfchen
Copy link
Member

sfchen commented Mar 7, 2018

@chapmanb I just made an update to trim the polyX like TTTTTTTTTTTTTTTTTTTTTTTTA . Could you please try the latest build?

@chapmanb
Copy link
Author

chapmanb commented Mar 9, 2018

Shifu;
Thanks much for the fix and continuing to work on this. I've been testing this and comparing with atropos trimming and finding some reads where atropos trims and fastp does not. This is a list of ~1000 that atropos will trim after being fed the fastp trimmed outputs. I don't think atropos does the right thing in all of these cases but would especially like to remove a lot of these problematic reads that are almost all polyX. The file is two columns, with the first the remaining read and the second the part that gets trimmed, so adding those first two columns gives the original post-fastp read:

https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5

I'm continuing to try to understand which reads specifically help with downstream speed improvements, and appreciate all the help trying to be able to be more aggressive in removing this polyX noise. Thanks again.

@sfchen
Copy link
Member

sfchen commented Mar 9, 2018

Thank you Brad, I will do a test with the file you provided.

@sfchen
Copy link
Member

sfchen commented Mar 13, 2018

Hi @chapmanb I cannot download the file you uploaded at https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5

Would you please attach it in this issue? I think it's small enough to be attached here.

Thanks
Shifu

@chapmanb
Copy link
Author

Shifu -- what issues are you running into downloading it? This is the direct link for the raw version for download if that helps:

https://gist.githubusercontent.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5/raw/9867e4ccb5c8feef5eecc14b3499dc2a36cdf318/untrimmed_fastp.txt

@sfchen
Copy link
Member

sfchen commented Mar 13, 2018

This direct link works, thanks.

@sfchen
Copy link
Member

sfchen commented Mar 14, 2018

Hi @chapmanb

With the latest build, I implemented low complexity filtering, which can filter out most of these reads. You can test it by specifying -y

@chapmanb
Copy link
Author

Shifu;
Thanks so much for the implementation. I've been using this and working through different trimming comparisons with both fastp and atropos to try and find a good minimal combination of quality and polyX trimming that improves calling runtimes and helps with senstivity/specificity.

It looks in the end like the primary different is due to quality trimming differences, and polyX 3' ends up being a reflection of that:

https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim

I'm trying to harmonize fastp and atropos trimming to better understand the differences but am having trouble replicating the runtime improvements we find with atropos trimming. For fastp I use:

--cut_by_quality3 --cut_mean_quality 5 --disable_quality_filtering

to get 3' only quality trimming and with atropos use:

--quality-cutoff 5

which I think should be roughly equivalent. I've tried increasing --cut_mean_quality without much change, so must be missing something.

I'll continue to dig but welcome any thoughts about how best to synchronize these, or if I'm missing anything obvious. Thank you again for all the help and happy to have two quality options to test with.

@sfchen
Copy link
Member

sfchen commented Jun 20, 2018

Hi @chapmanb

A quick update, fastp 0.16.0 supports streaming to STDOUT, and also supports interleaved input. Please see the update on README: https://github.com/OpenGene/fastp#input-and-output

output to STDOUT

fastp supports streaming the passing-filter reads to STDOUT, so that it can be passed to other compressors like bzip2, or be passed to aligners like bwa and bowtie2.

  • specify --stdout to enable this mode to stream output to STDOUT
  • for PE data, the output will be interleaved FASTQ, which means the files will contain records like record1-R1 -> record1-R2 --> record2-R1 -> record2-R2 --> record3-R1 -> record3-R2 ...

I hope you will like this new feature.

BTW, I think it's you that maintain fastp on BioConda, am I right? If yes, can you please add me as one collaborator so that I can update it after new version released?

@chapmanb
Copy link
Author

Shifu;
Thanks so much for these improvements, that is really helpful. I'll look at incorporating this into bcbio's use of fastp.

For the bioconda package, you don't need to be a collaborator or anything special. If you update the recipe, send a PR and cc me I'd be happy to merge:

https://bioconda.github.io/contributing.html

or you can also ask to become a contributor to the project to merge them yourself:

https://bioconda.github.io/contrib-setup.html#request-to-be-added-to-the-bioconda-team-optional

It's pretty lightweight meant to enable as many contributors as we can. Note that right nowe we're in the middle of a huge transition to a new compiler system and rebuild so things are bogged down on new recipes. Hopefully that backlog will get cleared soon. Thanks again.

@sfchen sfchen closed this as completed Jul 23, 2018
@sfchen
Copy link
Member

sfchen commented Aug 30, 2018

Hi @chapmanb

I sent u an email yesterday. Just wondering that whether you have received it?

Thanks
Shifu

@chapmanb
Copy link
Author

Shifu;
Sorry, I don't think I got this. I'm not sure what happened, but could you resend or we can discuss here? My e-mail is in my GitHub profile. Thanks again.

@sfchen
Copy link
Member

sfchen commented Aug 31, 2018

I sent the email to your Harvard mailbox, and probably it was filtered to the junk box. So I just resent it to your fastmail.com mailbox :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants