Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'auto' option to samplesheet to automatically detect strandedness for samples #729

Closed
ojziff opened this issue Dec 3, 2021 · 8 comments
Milestone

Comments

@ojziff
Copy link

ojziff commented Dec 3, 2021

Dear nfcore/rnaseq team,

The fetchngs > rnaseq pipline is simply awesome!

Because sample strandedness is almost never documented in GEO or ENA metadata, one thing that could be improved is adding an optional argument to let infer_experiment.py decide the strandedness - especially when the strandedness is clear cut.

This would save a lot of time - currently i take a complete stab in the dark as to whether it is forward/reverse/unstranded. More often than not (66% of the time to be precise) I have to restart the pipeline from scratch because my initial guess at the strandedness was wrong. This typically wastes ~ 5 hours per GEO accession.

In cases when strandedness is borderline, it would still be helpful to allow infer_experiment.py to decide as this is far more informed than my guess. This is also already flags up in the current multiqc output and would allow the user to be aware / make changes if necessary - although from my experience (i have run nfcore/rnaseq on public data ~ 100 times) the borderlines cases are typically still clear cut e.g. 80% rather than 90% reverse .

The current samplesheet output from fetchngs does not include a column for strandedness. For beginners particularly, this is a problem as your cant smoothly run the pipeline without adding the strandedness variable to the samplesheet. This has led to unnecessary supplemental scripts that add strandedness e.g. https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py . Ideally, i would like to see that if you specify the argument to allow infer_experiment.py decide the strandedness, then the samplesheet.csv should not require a strandedness column.

Another confusing issue is nomenclature around strandedness. Some tools refer to sense / antisense; others forward / reverse and others first / second.

For comparison, many aligners have a default argument that decide on strandedness themselves and I believe nfcore/rnaseq would benefit from doing the same.

Many thanks!
Oliver

@drpatelh drpatelh added this to the 3.5 milestone Dec 13, 2021
@drpatelh
Copy link
Member

Hi @ojziff ! I agree this would be an awesome feature but I suspect we would need a separate sub-workflow to get a small sub-sample of reads, align and then run infer_experiment.py. The main reason is that some of the aligners themselves (as you can see in the list of modules below) require a strandedness value and so it becomes a bit of a 🐔 and 🥚 situation where you need to map to figure out the strandedness.

$ find ./modules -type f -exec grep -H 'strandedness' {} \; | cut -f 1 -d' ' | grep "main.nf" | sort | uniq
./modules/nf-core/modules/hisat2/align/main.nf:
./modules/nf-core/modules/rsem/calculateexpression/main.nf:
./modules/nf-core/modules/salmon/quant/main.nf:
./modules/nf-core/modules/stringtie/stringtie/main.nf:
./modules/nf-core/modules/qualimap/rnaseq/main.nf:
./modules/nf-core/modules/subread/featurecounts/main.nf:

I think this is going to be too much work to add in the 3.5 release but I will bump it to the 3.6 release milestones.

@drpatelh drpatelh modified the milestones: 3.5, 3.6 Dec 13, 2021
@ojziff
Copy link
Author

ojziff commented Dec 13, 2021

Thanks @drpatelh! Yes this approach seems the best way forwards as infer_experiment.py needs an aligned bam file. Although from your list i cant see STAR so one possibility is to run STAR upfront > infer_experiment.py > Salmon quant | RSEM (not sure how this would work with HISAT2 though).

Thanks for prioritising this - anecdotally i've spoken with quite a few others (mostly at Crick) who will greatly benefit from this.

Many thanks!
Oliver

@ChiaraF32
Copy link

I have had a similar experience to @ojziff regarding not knowing the strandedness of the data prior to executing the pipeline. I set the strandedness as a default to "unstranded" for a set of data, and noticed in the multiqc output that all of my data is actually "reverse" stranded. I thought that this would not be an issue after reading the following in the documentation for the infer_experiment.py: "You don’t need to know the RNA sequencing protocol before mapping your reads to the reference genome. Mapping your RNA-seq reads as if they were non-strand specific, this script can “guess” how RNA-seq reads were stranded."

My question is thus: should I be concerned that I did not specify the strandedness correctly? Will the data in the BAM file be incorrect regarding strandedness? I am using the BAMs in another pipeline to do the counts and detect expression and splicing outliers in the data. This pipeline requires strandedness as input too, so I was hoping that if I specify "reverse" here that it would work properly, but I am not 100% sure.

Many thanks in advance for your help with clarifying this.
Chiara

@ojziff
Copy link
Author

ojziff commented Feb 10, 2022

Hi @ChiaraF32

As many of the processes require the correct strandedness, unfortunately, you need to re-run the pipeline with the correct strandedness in the samplesheet.csv.

You can sometimes see in the publication methods associated with the accession what the strandedness should be. If it says that TruSeq Stranded library prep was used, then strandedness should be reverse. Almost all the other library preps e.g. NEBNext Ultra, produce unstranded libraries. I am yet to see a forward-stranded library.

Good luck,
Oliver

@ChiaraF32
Copy link

Hi Oliver,

Thanks for your advice!

Kind regards,
Chiara

@drpatelh drpatelh modified the milestones: 3.6, 3.7 Feb 20, 2022
@drpatelh
Copy link
Member

I have made a start to solve this here but it still needs a little more work and discussion.

Some linked comments on Slack here

@drpatelh drpatelh modified the milestones: 3.7, 3.8 Apr 26, 2022
@drpatelh drpatelh modified the milestones: 3.9, 3.10 Sep 25, 2022
@bounlu
Copy link

bounlu commented Nov 28, 2022

Here is a list of library kits with their strandedness info and parameters to be used in the tools:
https://chipster.csc.fi/manual/library-type-summary.html

@drpatelh drpatelh changed the title add optional argument to let infer_experiment.py decide strandedness Add 'auto' option to samplesheet to automatically detect strandedness for samples Dec 19, 2022
drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Dec 19, 2022
@drpatelh
Copy link
Member

Implemented in #910

If you set the strandedness value to auto in the input samplesheet, the pipeline will sub-sample the input FastQ files to 1 million reads, use Salmon Quant to infer the strandedness automatically and then propagate this information to the remainder of the pipeline. If the strandedness has been inferred or provided incorrectly a warning will be present at the top of the MultiQC report so please be sure to check when looking at the QC for your samples.

sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants