Add 'auto' option to samplesheet to automatically detect strandedness for samples #729

ojziff · 2021-12-03T12:07:55Z

Dear nfcore/rnaseq team,

The fetchngs > rnaseq pipline is simply awesome!

Because sample strandedness is almost never documented in GEO or ENA metadata, one thing that could be improved is adding an optional argument to let infer_experiment.py decide the strandedness - especially when the strandedness is clear cut.

This would save a lot of time - currently i take a complete stab in the dark as to whether it is forward/reverse/unstranded. More often than not (66% of the time to be precise) I have to restart the pipeline from scratch because my initial guess at the strandedness was wrong. This typically wastes ~ 5 hours per GEO accession.

In cases when strandedness is borderline, it would still be helpful to allow infer_experiment.py to decide as this is far more informed than my guess. This is also already flags up in the current multiqc output and would allow the user to be aware / make changes if necessary - although from my experience (i have run nfcore/rnaseq on public data ~ 100 times) the borderlines cases are typically still clear cut e.g. 80% rather than 90% reverse .

The current samplesheet output from fetchngs does not include a column for strandedness. For beginners particularly, this is a problem as your cant smoothly run the pipeline without adding the strandedness variable to the samplesheet. This has led to unnecessary supplemental scripts that add strandedness e.g. https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py . Ideally, i would like to see that if you specify the argument to allow infer_experiment.py decide the strandedness, then the samplesheet.csv should not require a strandedness column.

Another confusing issue is nomenclature around strandedness. Some tools refer to sense / antisense; others forward / reverse and others first / second.

For comparison, many aligners have a default argument that decide on strandedness themselves and I believe nfcore/rnaseq would benefit from doing the same.

Many thanks!
Oliver

The text was updated successfully, but these errors were encountered:

drpatelh · 2021-12-13T15:17:12Z

Hi @ojziff ! I agree this would be an awesome feature but I suspect we would need a separate sub-workflow to get a small sub-sample of reads, align and then run infer_experiment.py. The main reason is that some of the aligners themselves (as you can see in the list of modules below) require a strandedness value and so it becomes a bit of a 🐔 and 🥚 situation where you need to map to figure out the strandedness.

$ find ./modules -type f -exec grep -H 'strandedness' {} \; | cut -f 1 -d' ' | grep "main.nf" | sort | uniq
./modules/nf-core/modules/hisat2/align/main.nf:
./modules/nf-core/modules/rsem/calculateexpression/main.nf:
./modules/nf-core/modules/salmon/quant/main.nf:
./modules/nf-core/modules/stringtie/stringtie/main.nf:
./modules/nf-core/modules/qualimap/rnaseq/main.nf:
./modules/nf-core/modules/subread/featurecounts/main.nf:

I think this is going to be too much work to add in the 3.5 release but I will bump it to the 3.6 release milestones.

ojziff · 2021-12-13T15:40:27Z

Thanks @drpatelh! Yes this approach seems the best way forwards as infer_experiment.py needs an aligned bam file. Although from your list i cant see STAR so one possibility is to run STAR upfront > infer_experiment.py > Salmon quant | RSEM (not sure how this would work with HISAT2 though).

Thanks for prioritising this - anecdotally i've spoken with quite a few others (mostly at Crick) who will greatly benefit from this.

Many thanks!
Oliver

ChiaraF32 · 2022-02-10T03:20:42Z

I have had a similar experience to @ojziff regarding not knowing the strandedness of the data prior to executing the pipeline. I set the strandedness as a default to "unstranded" for a set of data, and noticed in the multiqc output that all of my data is actually "reverse" stranded. I thought that this would not be an issue after reading the following in the documentation for the infer_experiment.py: "You don’t need to know the RNA sequencing protocol before mapping your reads to the reference genome. Mapping your RNA-seq reads as if they were non-strand specific, this script can “guess” how RNA-seq reads were stranded."

My question is thus: should I be concerned that I did not specify the strandedness correctly? Will the data in the BAM file be incorrect regarding strandedness? I am using the BAMs in another pipeline to do the counts and detect expression and splicing outliers in the data. This pipeline requires strandedness as input too, so I was hoping that if I specify "reverse" here that it would work properly, but I am not 100% sure.

Many thanks in advance for your help with clarifying this.
Chiara

ojziff · 2022-02-10T09:27:49Z

Hi @ChiaraF32

As many of the processes require the correct strandedness, unfortunately, you need to re-run the pipeline with the correct strandedness in the samplesheet.csv.

You can sometimes see in the publication methods associated with the accession what the strandedness should be. If it says that TruSeq Stranded library prep was used, then strandedness should be reverse. Almost all the other library preps e.g. NEBNext Ultra, produce unstranded libraries. I am yet to see a forward-stranded library.

Good luck,
Oliver

ChiaraF32 · 2022-02-14T04:14:18Z

Hi Oliver,

Thanks for your advice!

Kind regards,
Chiara

drpatelh · 2022-04-26T15:05:03Z

I have made a start to solve this here but it still needs a little more work and discussion.

Some linked comments on Slack here

bounlu · 2022-11-28T03:36:29Z

Here is a list of library kits with their strandedness info and parameters to be used in the tools:
https://chipster.csc.fi/manual/library-type-summary.html

drpatelh · 2022-12-19T17:34:38Z

Implemented in #910

If you set the strandedness value to auto in the input samplesheet, the pipeline will sub-sample the input FastQ files to 1 million reads, use Salmon Quant to infer the strandedness automatically and then propagate this information to the remainder of the pipeline. If the strandedness has been inferred or provided incorrectly a warning will be present at the top of the MultiQC report so please be sure to check when looking at the QC for your samples.

sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto

ojziff added the enhancement label Dec 3, 2021

drpatelh added this to the 3.5 milestone Dec 13, 2021

drpatelh modified the milestones: 3.5, 3.6 Dec 13, 2021

drpatelh modified the milestones: 3.6, 3.7 Feb 20, 2022

drpatelh modified the milestones: 3.7, 3.8 Apr 26, 2022

drpatelh modified the milestones: 3.9, 3.10 Sep 25, 2022

drpatelh changed the title ~~add optional argument to let infer_experiment.py decide strandedness~~ Add 'auto' option to samplesheet to automatically detect strandedness for samples Dec 19, 2022

drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Dec 19, 2022

Fix nf-core#729

897efde

drpatelh mentioned this issue Dec 19, 2022

Add 'auto' option to auto-detect strandedness #910

Merged

drpatelh closed this as completed Dec 19, 2022

ryanyord mentioned this issue Mar 14, 2023

Add 'auto' option to fastq_dir_to_samplesheet.py and set as default argument #956

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'auto' option to samplesheet to automatically detect strandedness for samples #729

Add 'auto' option to samplesheet to automatically detect strandedness for samples #729

ojziff commented Dec 3, 2021

drpatelh commented Dec 13, 2021

ojziff commented Dec 13, 2021

ChiaraF32 commented Feb 10, 2022

ojziff commented Feb 10, 2022

ChiaraF32 commented Feb 14, 2022

drpatelh commented Apr 26, 2022

bounlu commented Nov 28, 2022

drpatelh commented Dec 19, 2022

Add 'auto' option to samplesheet to automatically detect strandedness for samples #729

Add 'auto' option to samplesheet to automatically detect strandedness for samples #729

Comments

ojziff commented Dec 3, 2021

drpatelh commented Dec 13, 2021

ojziff commented Dec 13, 2021

ChiaraF32 commented Feb 10, 2022

ojziff commented Feb 10, 2022

ChiaraF32 commented Feb 14, 2022

drpatelh commented Apr 26, 2022

bounlu commented Nov 28, 2022

drpatelh commented Dec 19, 2022