Overhaul strandedness detection / comparison #1306

pinin4fjords · 2024-05-29T12:45:46Z

I'm not 100% convinced by the way the RSeQC results are used to check strandedness, and it leads to confusion.

To illustrate, unstranded data in RSeQC looks like:

This is PairEnd Data
Fraction of reads failed to determine: 0.0172
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925

... not:

This is PairEnd Data
Fraction of reads failed to determine: 0.6742
Fraction of reads explained by "1++,1--,2+-,2-+": 0.2654
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0604

The latter case is the consequence of reads aligning to regions where strand cannot be determined. This would include:

regions with genes annotated on both strands and
I think, not 100% sure), reads aligning outside of annotations completely.

(2) might be quite common with either genomic DNA contamination or intronic reads.

As is, the supplied strandedness must be correct for 70% of reads for the check to pass. The problem with this is that where there is a high level of undetermined reads, this check can fail easily, even where the Salmon-based check generated the strandedness automatically, which is confusing.

The more important statistic in determining strandedness is whether the two 'Fraction of reads explained by' lines are similar, or not. The undetermined section might make you worry about why, but it shouldn't concern you if you're just checking the strand bias.

That's what I'm proposing here, that one of the 'Fraction of reads explained by' lines should be at least e.g. 5X the magnitude of the other. This would be consistent with the Salmon check and reduce confusion.

Hoping for some discussion / agreement!

Phase 2: the bigger picture

I thought harder about this, and realised that a lot of issues stemmed from comparing Salmon's internal strand inference (over which we have little control) and the bespoke way we were inferring strandedness from RSeQC results.

My proposal is now as follows:

Don't use Salmon's decision making on strandedness. Use its lib_format_counts.json output to derive our own strandedess from its numbers.
Base strandedness on the proportion of stranded reads (as above)
Use the same logic to infer strandedness from Salmon and from RSeQC. Mark as 'undetermined' anything not convincingly stranded OR unstranded. Parameterise the threshold used.
Display the strandedness inferences as standard, rather than just in the event of an error (I think they're useful).
Mark as an error anything where strandedness supplied or predicted via Salmon does not match with strandedness inferred in the same way from RSeQC results, or where strandedness is 'undetermined'.

I think that by doing this we serve the nuances alluded to by @tdanhorn without having to engineer a variety of error messages.

The result currently looks like this:

PR checklist

github-actions · 2024-05-29T12:51:27Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit 4dd03ca

+| ✅ 173 tests passed       |+
#| ❔   9 tests were ignored |#
!| ❗   7 tests had warnings |!

❗ Test warnings:

files_exist - File not found: assets/multiqc_config.yml
files_exist - File not found: .github/workflows/awstest.yml
files_exist - File not found: .github/workflows/awsfulltest.yml
pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

files_exist - File is ignored: conf/modules.config
nextflow_config - Config default ignored: params.ribo_database_manifest
files_unchanged - File ignored due to lint config: assets/email_template.html
files_unchanged - File ignored due to lint config: assets/email_template.txt
files_unchanged - File ignored due to lint config: .gitignore or .prettierignore
actions_ci - actions_ci
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
multiqc_config - multiqc_config
modules_config - modules_config

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-rnaseq_logo_light.png
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-rnaseq_logo_light.png
files_exist - File found: docs/images/nf-core-rnaseq_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: main.nf
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: modules.json
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: docs/images/nf-core-rnaseq_logo.png
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/NfcoreTemplate.groovy
files_exist - File not found check: lib/Utils.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: lib/WorkflowMain.groovy
files_exist - File not found check: lib/WorkflowRnaseq.groovy
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: Singularity
files_exist - File not found check: lib/nfcore_external_java_deps.jar
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 3.15.0dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.hisat2_build_memory= 200.GB
nextflow_config - Config default value correct: params.gtf_extra_attributes= gene_name
nextflow_config - Config default value correct: params.gtf_group_features= gene_id
nextflow_config - Config default value correct: params.featurecounts_group_type= gene_biotype
nextflow_config - Config default value correct: params.featurecounts_feature_type= exon
nextflow_config - Config default value correct: params.igenomes_base= s3://ngi-igenomes/igenomes/
nextflow_config - Config default value correct: params.trimmer= trimgalore
nextflow_config - Config default value correct: params.min_trimmed_reads= 10000
nextflow_config - Config default value correct: params.umitools_extract_method= string
nextflow_config - Config default value correct: params.umitools_grouping_method= directional
nextflow_config - Config default value correct: params.aligner= star_salmon
nextflow_config - Config default value correct: params.pseudo_aligner_kmer_size= 31
nextflow_config - Config default value correct: params.min_mapped_reads= 5.0
nextflow_config - Config default value correct: params.kallisto_quant_fraglen= 200
nextflow_config - Config default value correct: params.kallisto_quant_fraglen_sd= 200
nextflow_config - Config default value correct: params.stranded_threshold= 0.8
nextflow_config - Config default value correct: params.unstranded_threshold= 0.1
nextflow_config - Config default value correct: params.deseq2_vst= true
nextflow_config - Config default value correct: params.rseqc_modules= bam_stat,inner_distance,infer_experiment,junction_annotation,junction_saturation,read_distribution,read_duplication
nextflow_config - Config default value correct: params.skip_bbsplit= true
nextflow_config - Config default value correct: params.skip_preseq= true
nextflow_config - Config default value correct: params.custom_config_version= master
nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Config default value correct: params.max_cpus= 16
nextflow_config - Config default value correct: params.max_memory= 128.GB
nextflow_config - Config default value correct: params.max_time= 240.h
nextflow_config - Config default value correct: params.publish_dir_mode= copy
nextflow_config - Config default value correct: params.max_multiqc_email_size= 25.MB
nextflow_config - Config default value correct: params.validate_params= true
nextflow_config - Config default value correct: params.pipelines_testdata_base_path= https://raw.githubusercontent.com/nf-core/test-datasets/7f1614baeb0ddf66e60be78c3d9fa55440465ac8/
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - assets/nf-core-rnaseq_logo_light.png matches the template
files_unchanged - docs/images/nf-core-rnaseq_logo_light.png matches the template
files_unchanged - docs/images/nf-core-rnaseq_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (538 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
actions_schema_validation - Workflow validation passed: download_pipeline.yml
actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
actions_schema_validation - Workflow validation passed: ci.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
base_config - conf/base.config found and not ignored.
nfcore_yml - Repository type in .nf-core.yml is valid: pipeline
nfcore_yml - nf-core version in .nf-core.yml is set to the latest version: 2.14.1

Run details

nf-core/tools version 2.14.1
Run at 2024-06-19 16:14:12

subworkflows/local/utils_nfcore_rnaseq_pipeline/main.nf

Co-authored-by: Harshil Patel <drpatelh@users.noreply.github.com>

tdanhorn · 2024-06-04T17:40:53Z

I agree that the ratio between "forward" and "reverse" is more meaningful that some fixed percentage of either. Ideally I would like to see a different message, though, because there are basically three scenarios:

The strandedness is consistent with what was either in the sample sheet or inferred by Salmon (auto), and the "wrong strand fraction" is reasonably low (in a perfect world <2-3%, but we can tolerate 5-10%). In this case, no warning is needed.
The strandedness is not what was declared, but is obvious (either a 50:50 [±5%] split -> unstranded, or >90% on the "other" strand -> stranded [forward or reverse]). This is typically a mixup in the sample sheet, which should result in a warning to that effect, e.g. "The strandedness in the sample sheet may be wrong!"
The strand distribution is in the "no man's land" between stranded and unstranded, possibly a rate of "undetermined", but more importantly, the ratio of forward/reverse is not close to 50:50, and the "wrong strand fraction" is >10%. This hints at serious QC issues, e.g. DNA contamination, and should result in a warning that reflects that, e.g. "The samples is neither clearly stranded nor unstranded, which indicates QC problems, including possible DNA contamination."

It would be nice if we could differentiate between those cases.

…e strand check results standard

subworkflows/local/utils_nfcore_rnaseq_pipeline/main.nf

workflows/rnaseq/main.nf

workflows/rnaseq/assets/multiqc/multiqc_config.yml

maxulysse

minor typos, once fixed, ok to me

maxulysse · 2024-06-17T17:40:37Z

@nf-core-bot fix linting pretty please 🙏

…ypes in strand totals

… insert snapshots from the only tests I do want to update

docs/output.md

docs/images/mqc_strand_check.png

Co-authored-by: Maxime U Garcia <max.u.garcia@gmail.com>

…q into improve_rseqc_strandedness

Check RSeQC strandedness without reference to undetermined

0cfb662

pinin4fjords linked an issue May 29, 2024 that may be closed by this pull request

Salmon strand inference is often wrong #1185

Closed

drpatelh reviewed May 29, 2024

View reviewed changes

subworkflows/local/utils_nfcore_rnaseq_pipeline/main.nf Outdated Show resolved Hide resolved

Update subworkflows/local/utils_nfcore_rnaseq_pipeline/main.nf

5ce89ae

Co-authored-by: Harshil Patel <drpatelh@users.noreply.github.com>

pinin4fjords added this to the 3.15.0 milestone May 30, 2024

pinin4fjords added 5 commits June 12, 2024 12:10

fix strand message

26a329b

Merge branch 'dev' into improve_rseqc_strandedness

7a20206

Update Salmon

4c4afd0

Add strandedness detection threshold parameter

2a5414c

Add a consistent library type check between Salmon and RSeQC. Make th…

8ccffc9

…e strand check results standard

pinin4fjords changed the title ~~Check RSeQC strandedness without reference to undetermined~~ Overhaul strandedness detection / comparison Jun 14, 2024

Amend for undetermined

c6b91a9

maxulysse reviewed Jun 14, 2024

View reviewed changes

subworkflows/local/utils_nfcore_rnaseq_pipeline/main.nf Outdated Show resolved Hide resolved

maxulysse reviewed Jun 14, 2024

View reviewed changes

subworkflows/local/utils_nfcore_rnaseq_pipeline/main.nf Outdated Show resolved Hide resolved

maxulysse reviewed Jun 14, 2024

View reviewed changes

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved

pinin4fjords added 7 commits June 14, 2024 13:49

Constraint strand detection threshold

ab0c62b

Consistent defaults

38bdb9e

Merge branch 'dev' into improve_rseqc_strandedness

cb54dc2

Fix up after merge

9aadd4a

Fix for linting

82724b8

update CHANGELOG

c4d3535

Fix the fix

9dd6d6d

maxulysse reviewed Jun 17, 2024

View reviewed changes

workflows/rnaseq/assets/multiqc/multiqc_config.yml Outdated Show resolved Hide resolved

maxulysse reviewed Jun 17, 2024

View reviewed changes

workflows/rnaseq/assets/multiqc/multiqc_config.yml Outdated Show resolved Hide resolved

maxulysse approved these changes Jun 17, 2024

View reviewed changes

pinin4fjords and others added 2 commits June 17, 2024 18:41

fix typos

451cb64

[automated] Fix linting with Prettier

3df6ec8

pinin4fjords added 12 commits June 17, 2024 18:58

Fix conditionality

1977f6d

Clarify stranded/ unstraded at library level, include other library t…

bc6189f

…ypes in strand totals

Allow for missing keys in salmon library counts for testing

62ed423

Fix function tests

b4760ce

Auto updating snapshot broke things for unaffected tests. So manually…

2ded8c0

… insert snapshots from the only tests I do want to update

Merge branch 'dev' into improve_rseqc_strandedness

3733031

Merge branch 'dev' into improve_rseqc_strandedness

231fba6

Merge branch 'dev' into improve_rseqc_strandedness

29265f0

Return percentages for strandedness

ac9abda

Explicitly publish lib_format_counts

3ad69a7

update docs

4ce478d

Fix typo

3583d2c

maxulysse reviewed Jun 18, 2024

View reviewed changes

docs/output.md Outdated Show resolved Hide resolved

Prettier

eef388e

maxulysse reviewed Jun 18, 2024

View reviewed changes

docs/images/mqc_strand_check.png Outdated Show resolved Hide resolved

maxulysse approved these changes Jun 18, 2024

View reviewed changes

pinin4fjords and others added 10 commits June 18, 2024 17:10

Fix typo

c95c489

Co-authored-by: Maxime U Garcia <max.u.garcia@gmail.com>

Set maximum on bars per column

035e85f

Pass/fail in separate mqc column

cdbc971

Merge branch 'dev' into improve_rseqc_strandedness

0b16071

Merge branch 'improve_rseqc_strandedness' of github.com:nf-core/rnase…

6eb5b06

…q into improve_rseqc_strandedness

Fix multiqc config for cell colors to highlight strandedness

dc22231

Fix strand status conditional

bd48aa7

update strand check image

36c6658

Fix multiqc yaml

e45e521

fix up params + usage

4dd03ca

pinin4fjords merged commit bd64ab6 into dev Jun 19, 2024
37 checks passed

pinin4fjords deleted the improve_rseqc_strandedness branch June 19, 2024 16:30

This was referenced Jun 19, 2024

Salmon strand inference is often wrong #1185

Closed

Show Salmon-based strand inference in MultiQC report #1318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul strandedness detection / comparison #1306

Overhaul strandedness detection / comparison #1306

pinin4fjords commented May 29, 2024 •

edited

Loading

github-actions bot commented May 29, 2024 •

edited

Loading

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

tdanhorn commented Jun 4, 2024

maxulysse left a comment

maxulysse commented Jun 17, 2024

Overhaul strandedness detection / comparison #1306

Overhaul strandedness detection / comparison #1306

Conversation

pinin4fjords commented May 29, 2024 • edited Loading

PR checklist

github-actions bot commented May 29, 2024 • edited Loading

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

tdanhorn commented Jun 4, 2024

maxulysse left a comment

Choose a reason for hiding this comment

maxulysse commented Jun 17, 2024

pinin4fjords commented May 29, 2024 •

edited

Loading

github-actions bot commented May 29, 2024 •

edited

Loading

`nf-core lint` overall result: Passed ✅ ⚠️