-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'auto' option to samplesheet to automatically detect strandedness for samples #729
Comments
Hi @ojziff ! I agree this would be an awesome feature but I suspect we would need a separate sub-workflow to get a small sub-sample of reads, align and then run
I think this is going to be too much work to add in the 3.5 release but I will bump it to the 3.6 release milestones. |
Thanks @drpatelh! Yes this approach seems the best way forwards as Thanks for prioritising this - anecdotally i've spoken with quite a few others (mostly at Crick) who will greatly benefit from this. Many thanks! |
I have had a similar experience to @ojziff regarding not knowing the strandedness of the data prior to executing the pipeline. I set the strandedness as a default to "unstranded" for a set of data, and noticed in the multiqc output that all of my data is actually "reverse" stranded. I thought that this would not be an issue after reading the following in the documentation for the My question is thus: should I be concerned that I did not specify the strandedness correctly? Will the data in the BAM file be incorrect regarding strandedness? I am using the BAMs in another pipeline to do the counts and detect expression and splicing outliers in the data. This pipeline requires strandedness as input too, so I was hoping that if I specify "reverse" here that it would work properly, but I am not 100% sure. Many thanks in advance for your help with clarifying this. |
Hi @ChiaraF32 As many of the processes require the correct strandedness, unfortunately, you need to re-run the pipeline with the correct strandedness in the samplesheet.csv. You can sometimes see in the publication methods associated with the accession what the strandedness should be. If it says that Good luck, |
Hi Oliver, Thanks for your advice! Kind regards, |
Here is a list of library kits with their strandedness info and parameters to be used in the tools: |
Implemented in #910 If you set the strandedness value to sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto |
Dear nfcore/rnaseq team,
The fetchngs > rnaseq pipline is simply awesome!
Because sample strandedness is almost never documented in GEO or ENA metadata, one thing that could be improved is adding an optional argument to let
infer_experiment.py
decide the strandedness - especially when the strandedness is clear cut.This would save a lot of time - currently i take a complete stab in the dark as to whether it is forward/reverse/unstranded. More often than not (66% of the time to be precise) I have to restart the pipeline from scratch because my initial guess at the strandedness was wrong. This typically wastes ~ 5 hours per GEO accession.
In cases when strandedness is borderline, it would still be helpful to allow
infer_experiment.py
to decide as this is far more informed than my guess. This is also already flags up in the current multiqc output and would allow the user to be aware / make changes if necessary - although from my experience (i have run nfcore/rnaseq on public data ~ 100 times) the borderlines cases are typically still clear cut e.g. 80% rather than 90% reverse .The current samplesheet output from fetchngs does not include a column for strandedness. For beginners particularly, this is a problem as your cant smoothly run the pipeline without adding the strandedness variable to the samplesheet. This has led to unnecessary supplemental scripts that add strandedness e.g. https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py . Ideally, i would like to see that if you specify the argument to allow
infer_experiment.py
decide the strandedness, then the samplesheet.csv should not require a strandedness column.Another confusing issue is nomenclature around strandedness. Some tools refer to sense / antisense; others forward / reverse and others first / second.
For comparison, many aligners have a default argument that decide on strandedness themselves and I believe nfcore/rnaseq would benefit from doing the same.
Many thanks!
Oliver
The text was updated successfully, but these errors were encountered: