Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul strandedness detection / comparison #1306

Merged
merged 43 commits into from
Jun 19, 2024

Conversation

pinin4fjords
Copy link
Member

@pinin4fjords pinin4fjords commented May 29, 2024

I'm not 100% convinced by the way the RSeQC results are used to check strandedness, and it leads to confusion.

To illustrate, unstranded data in RSeQC looks like:

This is PairEnd Data
Fraction of reads failed to determine: 0.0172
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925

... not:

This is PairEnd Data
Fraction of reads failed to determine: 0.6742
Fraction of reads explained by "1++,1--,2+-,2-+": 0.2654
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0604

The latter case is the consequence of reads aligning to regions where strand cannot be determined. This would include:

  1. regions with genes annotated on both strands and
  2. I think, not 100% sure), reads aligning outside of annotations completely.

(2) might be quite common with either genomic DNA contamination or intronic reads.

As is, the supplied strandedness must be correct for 70% of reads for the check to pass. The problem with this is that where there is a high level of undetermined reads, this check can fail easily, even where the Salmon-based check generated the strandedness automatically, which is confusing.

The more important statistic in determining strandedness is whether the two 'Fraction of reads explained by' lines are similar, or not. The undetermined section might make you worry about why, but it shouldn't concern you if you're just checking the strand bias.

That's what I'm proposing here, that one of the 'Fraction of reads explained by' lines should be at least e.g. 5X the magnitude of the other. This would be consistent with the Salmon check and reduce confusion.

Hoping for some discussion / agreement!

Phase 2: the bigger picture

I thought harder about this, and realised that a lot of issues stemmed from comparing Salmon's internal strand inference (over which we have little control) and the bespoke way we were inferring strandedness from RSeQC results.

My proposal is now as follows:

  • Don't use Salmon's decision making on strandedness. Use its lib_format_counts.json output to derive our own strandedess from its numbers.
  • Base strandedness on the proportion of stranded reads (as above)
  • Use the same logic to infer strandedness from Salmon and from RSeQC. Mark as 'undetermined' anything not convincingly stranded OR unstranded. Parameterise the threshold used.
  • Display the strandedness inferences as standard, rather than just in the event of an error (I think they're useful).
  • Mark as an error anything where strandedness supplied or predicted via Salmon does not match with strandedness inferred in the same way from RSeQC results, or where strandedness is 'undetermined'.

I think that by doing this we serve the nuances alluded to by @tdanhorn without having to engineer a variety of error messages.

The result currently looks like this:

Screenshot 2024-06-17 at 18 38 56

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@pinin4fjords pinin4fjords linked an issue May 29, 2024 that may be closed by this pull request
Copy link

github-actions bot commented May 29, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 4dd03ca

+| ✅ 173 tests passed       |+
#| ❔   9 tests were ignored |#
!| ❗   7 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.14.1
  • Run at 2024-06-19 16:14:12

Co-authored-by: Harshil Patel <drpatelh@users.noreply.github.com>
@pinin4fjords pinin4fjords added this to the 3.15.0 milestone May 30, 2024
@tdanhorn
Copy link

tdanhorn commented Jun 4, 2024

I agree that the ratio between "forward" and "reverse" is more meaningful that some fixed percentage of either. Ideally I would like to see a different message, though, because there are basically three scenarios:

  • The strandedness is consistent with what was either in the sample sheet or inferred by Salmon (auto), and the "wrong strand fraction" is reasonably low (in a perfect world <2-3%, but we can tolerate 5-10%). In this case, no warning is needed.
  • The strandedness is not what was declared, but is obvious (either a 50:50 [±5%] split -> unstranded, or >90% on the "other" strand -> stranded [forward or reverse]). This is typically a mixup in the sample sheet, which should result in a warning to that effect, e.g. "The strandedness in the sample sheet may be wrong!"
  • The strand distribution is in the "no man's land" between stranded and unstranded, possibly a rate of "undetermined", but more importantly, the ratio of forward/reverse is not close to 50:50, and the "wrong strand fraction" is >10%. This hints at serious QC issues, e.g. DNA contamination, and should result in a warning that reflects that, e.g. "The samples is neither clearly stranded nor unstranded, which indicates QC problems, including possible DNA contamination."

It would be nice if we could differentiate between those cases.

@pinin4fjords pinin4fjords changed the title Check RSeQC strandedness without reference to undetermined Overhaul strandedness detection / comparison Jun 14, 2024
workflows/rnaseq/main.nf Outdated Show resolved Hide resolved
Copy link
Member

@maxulysse maxulysse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typos, once fixed, ok to me

@maxulysse
Copy link
Member

@nf-core-bot fix linting pretty please 🙏

docs/output.md Outdated Show resolved Hide resolved
@pinin4fjords pinin4fjords merged commit bd64ab6 into dev Jun 19, 2024
37 checks passed
@pinin4fjords pinin4fjords deleted the improve_rseqc_strandedness branch June 19, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Salmon strand inference is often wrong
5 participants