Salmon strand inference is often wrong #1185

zeehio · 2024-01-09T11:53:35Z

Description of feature

Hi,

Thanks for all the work on this pipeline.

I have had to analyse several public datasets, and I plan to analyse more. Since the strandedness on such datasets is usually not provided, I use the "strandedness: auto" option in the pipeline to guess it.

Quite often (apologies for not having statistics about that, I could try to get some if needed) I get "WARNING: Fail Strand Check" messages, and I find that Salmon had set the strandedness to "reverse" when infer_experiment.py founds it to be "unstranded".

When this happens, I set the strandedness to "unstranded" and rerun the pipeline.

Would it make sense for the pipeline to just reset the strandedness and rerun automatically?

Thanks again

The text was updated successfully, but these errors were encountered:

mvheetve · 2024-01-30T13:45:55Z

Hi there,

I'm seeing behaviour, which could be similar to your issue, in data I received from several collaborators. The logs show for example:
Please check MultiQC report: 17/47 samples failed strandedness check

When I look in detail at the infer_experiment files, I get:

This is PairEnd Data
Fraction of reads failed to determine: 0.2379
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0050
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7571

This is PairEnd Data
Fraction of reads failed to determine: 0.2594
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0084
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7323

This is PairEnd Data
Fraction of reads failed to determine: 0.2518
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0050
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7432

This is PairEnd Data
Fraction of reads failed to determine: 0.2775
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0047
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7178

This is PairEnd Data
Fraction of reads failed to determine: 0.2672
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0040
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7288

Results like these are not covered by the examples in the rseqc docs on infer_experiment.py. I was wondering:

If you are seeing similar results or if your data are clear cut unstranded, like in example 1 of the rseqc docs on infer_experiment.py?
How anyone reading this would proceed with for example StringTie2? My idea would to treat this as fr-firststrand, but maybe this is incorrect because of the high amount of reads with undetermined strandedness...
Does anyone have an idea where the 25% of reads with undetermined strandedness could come from and if this can be prevented in the future?

Regards
Mattias

me-orlov · 2024-05-20T20:56:30Z

Hi, thank you for posting these questions! I would love to hear from either of you, if you have any further thoughts on the issue! I too am getting similar warnings across the board. In all cases, salmon marks the experiment samples as "reverse" and infer_experiment.py identifies something like 28% reverse and 75% undetermined. Do you think I should be concerned, or proceed ahead trusting the salmon identification? Thank you!

pinin4fjords · 2024-05-29T15:07:03Z

The reason this comes up is because the auto strand setting comes from Salmon based on its pseudo-alignment against transcript sequences, while the final strandedness check is based on genomic alignments and RSeQC's assessment.

The main source of the discrepancy is the reads of undetermined strand in RSeQC which play a part in the the assessment the pipeline makes bases on those statistics, and (possibly) shouldn't. I've opened the above PR to discuss and/ or address this.

pinin4fjords · 2024-06-19T16:30:47Z

Tackled in #1306

zeehio added the enhancement label Jan 9, 2024

mvheetve closed this as completed Jan 30, 2024

mvheetve reopened this Jan 30, 2024

drpatelh added this to the 3.15.0 milestone May 13, 2024

drpatelh added the awaiting-response-developers label May 29, 2024

pinin4fjords mentioned this issue May 29, 2024

Overhaul strandedness detection / comparison #1306

Merged

11 tasks

pinin4fjords linked a pull request May 29, 2024 that will close this issue

Overhaul strandedness detection / comparison #1306

Merged

11 tasks

pinin4fjords added Ready for review and removed awaiting-response-developers labels May 30, 2024

pinin4fjords closed this as completed Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Salmon strand inference is often wrong #1185

Salmon strand inference is often wrong #1185

zeehio commented Jan 9, 2024

mvheetve commented Jan 30, 2024

me-orlov commented May 20, 2024

pinin4fjords commented May 29, 2024

pinin4fjords commented Jun 19, 2024

Salmon strand inference is often wrong #1185

Salmon strand inference is often wrong #1185

Comments

zeehio commented Jan 9, 2024

Description of feature

mvheetve commented Jan 30, 2024

me-orlov commented May 20, 2024

pinin4fjords commented May 29, 2024

pinin4fjords commented Jun 19, 2024