Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling #33

chapmanb · 2018-03-03T11:28:00Z

Shifu;
Congrats on the paper in bioRxiv and thanks for all the great work on fastp. We've been working on improving the runtimes for somatic variant calling workflows and exploring quality and polyX trimming. We did a test run with fastp and atropos and found that the major improvements in runtime were due to removal of polyX sequences at the 3' ends of reads:

https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim

We'd used the new polyG trimming functionality (thank you), but a crude method of 3' polyA/T/C adapter removal, which appears to be less effective with fastp compared to atropos trimming. When additional polyX stretches get removed we get much better runtimes for alignment and variant calling.

I saw general polyX and low complexity trimming are on the roadmap for fastp and would like to express my support for this. We've been making great use of fastp for adapter conversion and would like to offer trimming as part of an effort to speed up alignment and variant calling both on NovaSeqs and more generally.

As a secondary help for integration, is streaming trimming a possibility for paired ends? To help improve preparation runtimes I'd been thinking of including trimming and streaming directly into bwa/minimap2 alignment, or being able to stream outputs into bgzip so we can index and parallelize variant calling.

Thanks again for all the work on fastp.

sfchen · 2018-03-03T14:12:16Z

Thank you Brad!

Actually fastp only implemented polyG trimming in 3' ends since I thought polyG is much more serious than poly A/T/C for NextSeq/NovaSeq. It's not difficult to support polyX trimming based on the polyG trimming implementation. And I will implement it, according to your good suggestion.

Since the paired-end processing will output two files, I cannot find a good solution to stream the output to aligners.

Thank you again for your support and good suggestions. I will continue to improve this tool.

ohofmann · 2018-03-03T22:22:59Z

Shifu,

thank you so much -- we are one of the NovaSeq sites that would positively benefit from a fast trimming implementation, particularly if it can handle polyX sequences. A +1 on the roadmap from us.

sfchen · 2018-03-04T05:45:10Z

Hi Brad and Oliver,

I'd like to ask some questions about polyX trimming:

We know that polyG is caused by signal attenuation of SBS. When the signal is very weak, the base will be detected as G in a two-colour system like NovaSeq or NextSeq. But what causes poly A/T/C?
If polyX trimming is implemented, do we still need polyG trimming options?
Do you think polyX trimming should be enabled by default for preprocessing NovSeq / NextSeq data?

Thanks
Shifu

chapmanb · 2018-03-04T11:14:20Z

Shifu;
Thanks so much for considering this feature. In terms of motivation, I think the polyX trimming is separate from NovaSeq polyG trims. There isn't a NovaSeq mechanism contributing to polyA/T/C since this is also present in non-NovaSeq (HiSeq and friends) reads. Rather this is due to some kind of slippage issue or is sequencing real signal. But either way what happens with this during variant calling is that these reads pile up on A stretches of the genome, creating very deep pileups of noise which take a long time to variant call through especially when identifying low frequency somatic variants.

So, I'd suggest having these as separate options. Some users might want only polyG and to keep the other poly stretches. I think having both as optional flags is probably the best option for now until trimming these becomes more standard accepted practice.

Regarding streaming for paired end reads, there are a couple of use cases where we could take advantage of this;

Streaming interleaved reads directly into mappers like bwa and minimap2. So instead of -o and -O, stream both ends, interleaved, directly to standard out.
Streaming into bgzip preparation of reads, doing something like -o >(bgzip --threads 8 > out_R1.fq.gz) -O >(bgzip --threads 8 > out_R2.fq.gz). This would help us prepare bgzipped outputs which are indexable for looks like grabix (https://github.com/arq5x/grabix) so we could parallelize subsequent alignment over regions of the fastq file.

Thanks again for considering all these options and all the work on fastp.

ohofmann · 2018-03-04T23:56:07Z

Shifu,

I think both options are useful. polyG is a 2-color-chemistry specific problem and causes variant callers to fail depending on coverage depth (or at least stall for a long time). That in itself is valuable, but based on the preliminary results removing other polyX regions helps with both speed and precision, something I didn't anticipate.

For two color chemistry I'd suggest polyG as a default, with everything else up to the user.

sfchen · 2018-03-05T00:01:17Z

Okay.

Another question: do you want to discard reads containing long polyX in the middle of reads. I mean the polyX is not in 3' or 5' ends.

This function may have side effects on MSI detection.

sfchen · 2018-03-05T00:07:42Z

Hi @chapmanb ,

Streaming both ends, interleaved, directly into STDOUT is a good suggestion. I have just confirmed that minimap2 also supports interleaved FASTQ, as well as bwa does.

I will add this feature to the roadmap.

Thanks
Shifu

ndaniel · 2018-03-05T12:14:22Z

@sfchen

But what causes poly A/T/C?

In RNA-seq one has polyA due to polyadenylation of mRNA, and therefore one would want to trim the polyA ends from the reads before aligning the reads. Some RNA-seq datasets, which come from non-stranded library preparation, will have the polyA showing up as polyT in the reads.

sfchen · 2018-03-06T07:24:43Z

Hi all, I am just starting to implement 3' end polyX trimming.

Another important question:
1, in your opinion, how many consecutive bases must be detected as a polyX? How about 10 bp?
2, to be error tolerant, how many mismatches are allowed? How about 1 mismatch per 10bp?

Although I will make these options configurable by command line, I still need these information to set the default configuration.

Thanks
Shifu

ohofmann · 2018-03-06T07:32:03Z

https://s3-ap-southeast-2.amazonaws.com/umccr/umccr/qc/polyg/ipmn2219-2_33_tumor_hotspot_fastqc.html#M9 is a FASTQC report including just one problematic region. I can only speak for poly-G tracks, but even removing just reads that consist of 50 consecutive Gs would allow most variant callers to finish the region in a reasonable amount of time; allowing one mismatch in 50 would get rid of all but the a small minority.

I think at least for the poly-G issue a conservative setting would work. Not sure what the Atropos defaults are though.

sfchen · 2018-03-06T07:37:17Z

fastp's current polyG implementation is a bit aggressive: 10 bp consecutive G will be categorised as polyG, with one mismatch allowed in per 8 bp.

chapmanb · 2018-03-06T11:05:45Z

Shifu;
The defaults you use for polyG -- 10 Gs with 1 mismatch per 8bp -- also seem reasonable to me for polyX. Having them consistent as defaults seems like a worthwhile strategy. Thanks again for working on this.

#33

sfchen · 2018-03-06T14:38:11Z

I just implemented this feature.

Please try to build fastp with latest code on master. Or download http://opengene.org/fastp/fastp to test.

chapmanb · 2018-03-07T02:31:58Z

Thanks so much for the quick work integrating this, we really appreciate it. I've re-run a test on the same dataset used above and this does provide some nice improvements in removing these stretches. In comparison to what we were seeing before, here are the top 3' ends for the first 10 million reads:

GTGTGTGTGT 1217
TGTGTGTGTG 1175
CACACACACA 1122
ACACACACAC 1089
TTTTTTTTTG 1083
TTTTTTTTTA 1009

The low complexity dinucleotide repeats are still expected with our current trimming, but the last two polyT with a different base end are ones I'd expect to remove. I dug into them and here are some example reads with these ends left after trimming:

AGGAATTCTGCAGCTTTTTCTTTTCTTAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTGTTTTTTTTTATTTTTTTGTTTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTTA
CCCTTCTTTACGGTGAAGCTTATTCTGATTAAGCCTAGACTGTGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA
TGAAGGCCTGGGGATGGTGACTGAAGAAGGAACACGTAAGTAACTAATGAATGTGAAGGCCATTCTCTTCCTGATTAAAATCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTG
TGGGTGTGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTG

Happy to provide more examples if it would help. Thanks again for the work on this.

sfchen · 2018-03-07T02:42:23Z

@chapmanb how about trim the last base with -t 1 option? Are they from the 151st cycle?

Anyway, I can make a minor revision to trim polyX like TTTTTTTTTTTTTTTTTTTTTTTTA, but we should consider that whether it will cause over trimming. Will it impact MSI detection?

chapmanb · 2018-03-07T02:51:16Z

Shifu -- thanks for this. Generally I thought these should fall into the logic of being > poly_x_min_len (which I left at 10) and the logic of 1 allowed mismatch per 8bp with a maximum of 5 mismatches. Are you requiring that the polyX stretch initiate with the 3' most end and implicitly disallowing mismatches there?

MSIs should primarily by in more complex di-nucleotide+ repeats (like the first 4 examples in the remaining trim) and I agree the polyX shouldn't touch those to avoid messing with these. Exploring low complexity filters and the impact of this type of detection would be a useful secondary filter but something more long term. We're hoping to isolate the smaller set of trimming changes which help most with runtimes as a first pass, so clearing out the remaining noisy polyT and other reads would be helpful.

Thanks again for the discussion and help.

#33

sfchen · 2018-03-07T03:38:46Z

@chapmanb I just made an update to trim the polyX like TTTTTTTTTTTTTTTTTTTTTTTTA . Could you please try the latest build?

chapmanb · 2018-03-09T17:51:18Z

Shifu;
Thanks much for the fix and continuing to work on this. I've been testing this and comparing with atropos trimming and finding some reads where atropos trims and fastp does not. This is a list of ~1000 that atropos will trim after being fed the fastp trimmed outputs. I don't think atropos does the right thing in all of these cases but would especially like to remove a lot of these problematic reads that are almost all polyX. The file is two columns, with the first the remaining read and the second the part that gets trimmed, so adding those first two columns gives the original post-fastp read:

https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5

I'm continuing to try to understand which reads specifically help with downstream speed improvements, and appreciate all the help trying to be able to be more aggressive in removing this polyX noise. Thanks again.

sfchen · 2018-03-09T23:02:19Z

Thank you Brad, I will do a test with the file you provided.

sfchen · 2018-03-13T07:06:57Z

Hi @chapmanb I cannot download the file you uploaded at https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5

Would you please attach it in this issue? I think it's small enough to be attached here.

Thanks
Shifu

chapmanb · 2018-03-13T09:59:36Z

Shifu -- what issues are you running into downloading it? This is the direct link for the raw version for download if that helps:

https://gist.githubusercontent.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5/raw/9867e4ccb5c8feef5eecc14b3499dc2a36cdf318/untrimmed_fastp.txt

sfchen · 2018-03-13T10:01:39Z

This direct link works, thanks.

sfchen · 2018-03-14T14:48:00Z

Hi @chapmanb

With the latest build, I implemented low complexity filtering, which can filter out most of these reads. You can test it by specifying -y

chapmanb · 2018-03-21T13:35:10Z

Shifu;
Thanks so much for the implementation. I've been using this and working through different trimming comparisons with both fastp and atropos to try and find a good minimal combination of quality and polyX trimming that improves calling runtimes and helps with senstivity/specificity.

It looks in the end like the primary different is due to quality trimming differences, and polyX 3' ends up being a reflection of that:

https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim

I'm trying to harmonize fastp and atropos trimming to better understand the differences but am having trouble replicating the runtime improvements we find with atropos trimming. For fastp I use:

--cut_by_quality3 --cut_mean_quality 5 --disable_quality_filtering

to get 3' only quality trimming and with atropos use:

--quality-cutoff 5

which I think should be roughly equivalent. I've tried increasing --cut_mean_quality without much change, so must be missing something.

I'll continue to dig but welcome any thoughts about how best to synchronize these, or if I'm missing anything obvious. Thank you again for all the help and happy to have two quality options to test with.

sfchen · 2018-06-20T07:41:55Z

Hi @chapmanb

A quick update, fastp 0.16.0 supports streaming to STDOUT, and also supports interleaved input. Please see the update on README: https://github.com/OpenGene/fastp#input-and-output

output to STDOUT

fastp supports streaming the passing-filter reads to STDOUT, so that it can be passed to other compressors like bzip2, or be passed to aligners like bwa and bowtie2.

specify --stdout to enable this mode to stream output to STDOUT
for PE data, the output will be interleaved FASTQ, which means the files will contain records like record1-R1 -> record1-R2 --> record2-R1 -> record2-R2 --> record3-R1 -> record3-R2 ...

I hope you will like this new feature.

BTW, I think it's you that maintain fastp on BioConda, am I right? If yes, can you please add me as one collaborator so that I can update it after new version released?

chapmanb · 2018-06-20T20:05:21Z

Shifu;
Thanks so much for these improvements, that is really helpful. I'll look at incorporating this into bcbio's use of fastp.

For the bioconda package, you don't need to be a collaborator or anything special. If you update the recipe, send a PR and cc me I'd be happy to merge:

https://bioconda.github.io/contributing.html

or you can also ask to become a contributor to the project to merge them yourself:

https://bioconda.github.io/contrib-setup.html#request-to-be-added-to-the-bioconda-team-optional

It's pretty lightweight meant to enable as many contributors as we can. Note that right nowe we're in the middle of a huge transition to a new compiler system and rebuild so things are bogged down on new recipes. Hopefully that backlog will get cleared soon. Thanks again.

sfchen · 2018-08-30T02:07:51Z

Hi @chapmanb

I sent u an email yesterday. Just wondering that whether you have received it?

Thanks
Shifu

chapmanb · 2018-08-31T01:38:30Z

Shifu;
Sorry, I don't think I got this. I'm not sure what happened, but could you resend or we can discuss here? My e-mail is in my GitHub profile. Thanks again.

sfchen · 2018-08-31T03:15:15Z

I sent the email to your Harvard mailbox, and probably it was filtered to the junk box. So I just resent it to your fastmail.com mailbox :)

sfchen added a commit that referenced this issue Mar 6, 2018

support polyX trimming in 3' ends

986bd76

#33

sfchen added a commit that referenced this issue Mar 7, 2018

trim the polyX having a different last base in 3' end

1d20ad1

#33

sfchen closed this as completed Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling #33

Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling #33

chapmanb commented Mar 3, 2018

sfchen commented Mar 3, 2018

ohofmann commented Mar 3, 2018

sfchen commented Mar 4, 2018

chapmanb commented Mar 4, 2018

ohofmann commented Mar 4, 2018

sfchen commented Mar 5, 2018 •

edited

Loading

sfchen commented Mar 5, 2018

ndaniel commented Mar 5, 2018 •

edited

Loading

sfchen commented Mar 6, 2018 •

edited

Loading

ohofmann commented Mar 6, 2018

sfchen commented Mar 6, 2018 •

edited

Loading

chapmanb commented Mar 6, 2018

sfchen commented Mar 6, 2018

chapmanb commented Mar 7, 2018 •

edited

Loading

sfchen commented Mar 7, 2018 •

edited

Loading

chapmanb commented Mar 7, 2018

sfchen commented Mar 7, 2018

chapmanb commented Mar 9, 2018

sfchen commented Mar 9, 2018

sfchen commented Mar 13, 2018

chapmanb commented Mar 13, 2018

sfchen commented Mar 13, 2018

sfchen commented Mar 14, 2018

chapmanb commented Mar 21, 2018

sfchen commented Jun 20, 2018 •

edited

Loading

chapmanb commented Jun 20, 2018

sfchen commented Aug 30, 2018

chapmanb commented Aug 31, 2018

sfchen commented Aug 31, 2018

Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling #33

Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling #33

Comments

chapmanb commented Mar 3, 2018

sfchen commented Mar 3, 2018

ohofmann commented Mar 3, 2018

sfchen commented Mar 4, 2018

chapmanb commented Mar 4, 2018

ohofmann commented Mar 4, 2018

sfchen commented Mar 5, 2018 • edited Loading

sfchen commented Mar 5, 2018

ndaniel commented Mar 5, 2018 • edited Loading

sfchen commented Mar 6, 2018 • edited Loading

ohofmann commented Mar 6, 2018

sfchen commented Mar 6, 2018 • edited Loading

chapmanb commented Mar 6, 2018

sfchen commented Mar 6, 2018

chapmanb commented Mar 7, 2018 • edited Loading

sfchen commented Mar 7, 2018 • edited Loading

chapmanb commented Mar 7, 2018

sfchen commented Mar 7, 2018

chapmanb commented Mar 9, 2018

sfchen commented Mar 9, 2018

sfchen commented Mar 13, 2018

chapmanb commented Mar 13, 2018

sfchen commented Mar 13, 2018

sfchen commented Mar 14, 2018

chapmanb commented Mar 21, 2018

sfchen commented Jun 20, 2018 • edited Loading

output to STDOUT

chapmanb commented Jun 20, 2018

sfchen commented Aug 30, 2018

chapmanb commented Aug 31, 2018

sfchen commented Aug 31, 2018

sfchen commented Mar 5, 2018 •

edited

Loading

ndaniel commented Mar 5, 2018 •

edited

Loading

sfchen commented Mar 6, 2018 •

edited

Loading

sfchen commented Mar 6, 2018 •

edited

Loading

chapmanb commented Mar 7, 2018 •

edited

Loading

sfchen commented Mar 7, 2018 •

edited

Loading

sfchen commented Jun 20, 2018 •

edited

Loading