Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative trimming step for polyA/T removal #663

Closed
Reda94 opened this issue Jun 25, 2021 · 11 comments
Closed

Alternative trimming step for polyA/T removal #663

Reda94 opened this issue Jun 25, 2021 · 11 comments
Milestone

Comments

@Reda94
Copy link

Reda94 commented Jun 25, 2021

Hello, it looks like TrimGalore does not automatically perform polyA/T removal (see post-trimming fastqc screenshot below)
). They have an experimental option to do so (https://github.com/FelixKrueger/TrimGalore/blob/e9b8fd847f4da01fa3b886d134bc2ecd447a8068/trim_galore#L3230-L3257) but it would require running the nf-core/rnaseq pipeline twice (https://github.com/FelixKrueger/TrimGalore/blob/e9b8fd847f4da01fa3b886d134bc2ecd447a8068/trim_galore#L3248-L3254) (@drpatelh).

Would be great to have e.g. bbduk (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/) as it allows simultaneous base quality, adapter and polyA/T trimming.

Thanks!

Capture d’écran 2021-06-25 à 11 17 43

@FelixKrueger
Copy link

To just quickly chip in here. It is correct that the option --polyA in Trim Galore is designed to work with an RNA kit which specifically produces reads from polyA labelled transcripts (Thermo Fisher, Collibri) https://www.thermofisher.com/uk/en/home/life-science/sequencing/next-generation-sequencing/ngs-library-preparation-illumina-systems/collibri-stranded-rna-library-prep-kits.html?gclid=EAIaIQobChMIv7mB-tXn5QIVibHtCh3dTg7gEAAYASAAEgI7KPD_BwE&ef_id=EAIaIQobChMIv7mB-tXn5QIVibHtCh3dTg7gEAAYASAAEgI7KPD_BwE:G:s&s_kwcid=AL!3652!3!303512728519!p!!g!!collibri.

Standard removal of PolyA sequences could be accomplished using e.g. -a {A}10.

Trim Galore is does indeed not trim several things at the same time, and intentionally so. One could probably argue that you for the sequences you showed above, PolyA/PolyT sequence would remove all full homo-polymers entirely, and would truncate the PolyT sequences to a few nucleotides (certainly <15), which are either removed by a size-selection step, or will not map (uniquely). So in other words: these sequences are useless from a library point of view.

In your case, mapping will have the exact same effect though: PolyX sequences will probably either not map properly in the first place, or be soft-clipped (and then not map properly). The result should be pretty much the same: These sequences will not result in useful RNA-seq counts.

By all means, add another trimming option for the pipeline, but I don't think that it'll have a noteworthy advantage over ignoring PolyX sequences (and let the aligner take care of it), or even running an adapter trimming process first, and then using that output as the input for PolyA/T trimming (which should be just a matter of copy paste with DSL2).

It is still good to know if you have these contaminants in your library, as they may dramatically impact the mapping efficiency. You could add these sequences to a custom contaminants.txt file, and then run:

fastqc -a contaminants.txt *fastq.gz

to see the extent of PolyX sequences in your library.

Happy to discuss :)
Felix

Illumina Universal Adapter			AGATCGGAAGAG
Illumina Small RNA Adapter			ATGGAATTCTCG
Nextera Transposase Sequence		        CTGTCTCTTAT
#SOLiD Small RNA Adapter			CGCCTTGGCCGTA
Illumina Paired End PCR Primer 2		CTACACGACGC
PolyA						AAAAAAAAAAAA
PolyT						TTTTTTTTTTTT
PolyG						GGGGGGGGGGGG
PolyC						CCCCCCCCCCCC

@drpatelh
Copy link
Member

drpatelh commented Jun 25, 2021

Thank you for the detailed info @FelixKrueger ! I agree that most of these sequences won't align but I would personally prefer for these to removed at the "adapter" trimming step because it gives you an obvious assessment as to the fact that you may have such artifacts in the library. Trying to figure out why things haven't mapped can be quite painful.

I didn't know you could use -a {A}10 like that. Neat! In which case, you may be able to just tweak the pipeline options and run it once. Can you create a file called custom.config with the contents below @Reda94 and add -c custom.config to the command you are running?

params {
  modules {
    'trimgalore' {
      args = '--fastqc -a {A}10'
    }
  }
}

So would you need an additional -a {T}10 for polyT's @FelixKrueger or will the reverse complement be auto-trimmed?

@FelixKrueger
Copy link

Hi Harshil,

yes, args = '--fastqc -a {A}10' should work straight out of the box, but again, this will not be on top the adapter (auto-detection) and trimming, but this will be used as the adapter sequence itself. And if you look for -a AAAAAAAAAA it will trim - drumroll - PolyA! We are deliberately trying to keep things simple, and not do any obscure and undocumented reverse complement trimming as well...

@drpatelh
Copy link
Member

drpatelh commented Jun 25, 2021

Ah, I see! So it won't trim the conventional adapters if you use args = '--fastqc -a {A}10' in which case you would have to perform the adapter trimming twice anyway. Ok, maybe not the appropriate solution then unless you run the pipeline twice or if you are prepared to let the mapping take care of this.

@FelixKrueger
Copy link

FelixKrueger commented Jun 25, 2021

It wouldn't necessarily be the entire pipeline twice, but just add an extra trimming step. So something like the nf-core equivalent of:

TRIM_GALORE                     (file_ch, params.outdir, params.trim_galore_args, params.verbose)
if (PolyX_trimming_will_make_my_day){
     params.trim_galore += " -a {10} "
     TRIM_GALORE                     (TRIM_GALORE.out.reads, params.outdir, params.trim_galore_args, params.verbose)
}

I would also split out FastQC as it can then be run in parallel rather then as part of Trim Galore:

FASTQC2                         (TRIM_GALORE.out.reads, params.outdir, params.fastqc_args, params.verbose)

@FelixKrueger
Copy link

but yea, in the end the results will be pretty much the same however you implement it :P

@drpatelh
Copy link
Member

Sorry, I meant running only the trimming (and other mandatory) steps first time around and then running the pipeline full blown with either --polyA or -a {10} the second time around. Running the trimming twice isn't ideal 😅 Just out of interest why do you need to run the trimming twice for the polyA stuff?

@FelixKrueger
Copy link

For --polyA with the Collibri kit, it was actually a little complicated. But the bottom line was that you need to remove poor qualities and sequencing adapters first, and then as a second step find and remove poly-T sequences (I believe it was different for R1 and R2 as well, and write the found PolyT bases into the read IDs of both reads, and then use only one read for mapping (and use the readID for downstream processing). This protocol was specifically designed to identify alternative transcription end and PolyA initiation sites, and yea sequences with tons of AAAAA or TTTT are not exactly great for anything really...

Regarding your earlier quote:

Trying to figure out why things haven't mapped can be quite painful.

This is certainly true, but if you are running pipelines that will magically do lots of things at the same time you might just look at the mapping efficiency, see that it's great and move on - and potentially miss that you had 50% of sequences lost because of a technical issue that results in homo-polymers (which you should probably try and address on the wet lab side...)

@drpatelh
Copy link
Member

you might just look at the mapping efficiency, see that it's great and move on

Good point! We also do have the cutadapt logs in the MultiQC report where this sort of info is easily accessible. But I guess it will come down to what is actually looked at further e.g. users may not even look at the mapping rates 🤷🏽 Another advantage of trimming beforehand is that there will most likely be a diversity of sequences with a variable count of these polyA bases - assuming that the aligner will handle these appropriately could be a little dangerous and even if it may be minimal this could actually impact the quantification.

@tomsing1
Copy link
Contributor

Just because this came up in a thread on slack: STAR can clip 3' polyA sequences, too, e.g. by adding the following STAR argument: --clip3pAdapterSeq AAAAAAAA

@drpatelh
Copy link
Member

Added fastp support in #970 which will allow you to use --trimmer fastp --extra_fastp_args '--trim_poly_x' to trim the polyA tails by default along with any adapter sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants