Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rdata object "deseq2.dds.RData" and gene count matrix "salmon.merged.gene_counts.tsv" not consistent between 2 runs on same data #585

Closed
Solyris83 opened this issue Mar 23, 2021 · 10 comments
Labels
bug Something isn't working
Milestone

Comments

@Solyris83
Copy link

I have successfully ran the nfcore/rnaseq v3.0 with docker profile on a dataset with 4 samples twice to check the reproducibility of the result.

I am checking the

  1. RData object produced in result/star_salmon/deseq2_qc/deseq2.dds.RData
  2. Gene Count in result/star_salmon/salmon.merged.gene_counts.tsv

the comparison gene by gene versus run1 vs run2 produces 81 genes which are not pseudo-autosomal gene (PAR keyword in ensembl ID) and has more than 1 read count difference between the 2 runs. Some of these has over a hundred read count differences between the 2 run.

May I know if that is expected and how can I "seed" this to make the result in this run reproducible when I re-run this?

Regards
Zhihui

@Solyris83 Solyris83 added the bug Something isn't working label Mar 23, 2021
@drpatelh
Copy link
Member

Any idea what is going on here @lpantano @j-andrews7? Going to try and get a release together soonish so be great if we can mop up, fix and test all of these Salmon issues.

@j-andrews7
Copy link

j-andrews7 commented Apr 11, 2021

My guess is this is related to #553. I'm not sure which counts table is eventually used for DESeq2, but I think it's the lengthScaled one (which on the main branch is incorrect anyway). I don't have the know-how to actually figure out what the counts being passed here are:

I still consider that a major issue, there should be a disclaimer indicating that there are issues with the merged tables and that the DESeq2 object may not be totally kosher. @Solyris83, can you confirm you used the exact same commands/inputs on the two different runs?

@Solyris83
Copy link
Author

Solyris83 commented Apr 12, 2021

rnaseq_2runs.zip
Hi @j-andrews7 , yes, the run command can be found in the execution_report and the output (salmon.merged.gene_counts.tsv) which I am looking at can be found attached herein.

The 2nd run I am using the STAR index created in the first run.

@j-andrews7
Copy link

j-andrews7 commented Apr 12, 2021

PARs are where X and Y share homologous sequence, so the PAR genes are going to be very similar if not identical to a non-PAR gene (e.g. ENSG00000002586.20 and ENSG00000002586.20_PAR_Y). In cases where salmon can't differentiate between two genes for a given read, there is some amount of randomness in terms of count assignment.

@rob-p may be able to give us a little insight into that process. I'd be inclined to just go remove all the PAR transcripts from my annotation file.

@lpantano
Copy link
Contributor

Hi, thanks for the detailed report. Can you check the salmon output files ( for those genes that you see with a lot of difference) are the same or not across runs? I am trying to figure out if the issue is the processing of the salmon files or upstream. As well, 3.0 has a bug (as mentioned), so the merged files are not totally correct anyway, but it would be good to determine when the difference happens exactly. Thanks!

@j-andrews7
Copy link

I believe this is expected behavior due to the EM used by salmon to deal with multimapping to the transcriptome. In this case, my guess is that these sequences are very slightly different so that they don't get collapsed during the salmon indexing but the vast majority of mappings to each are impossible to distinguish, hence the differences in each run from the model. See relevant bit from the salmon FAQ:

Q: The results in salmon’s quant.sf have fewer transcripts than the transcriptome I indexed. Why is this?
A When salmon indexes transcripts it collapses exact duplicates (i.e. transcripts with perfectly identical sequence). For each group of sequence-identical transcripts, it will select only one representative to quantify. Such sequence identical transcripts often arise from transcripts annotated on ALT contigs which nonetheless do not contain any variants with respect to the reference transcripts. While salmon does collapse these exact duplicates, it records all of the decisions it makes.

In the salmon index directory, there is a file called “duplicate_clusters.tsv”. This is a 2-column tsv file where the first column lists a retained transcript and the right column lists a discarded duplicate. A transcript can represent multiple discarded sequences, in which case it appears multiple times in the first column. This will allow you to see the sequence duplicate clusters. If quantified, all transcripts in these clusters would obtain identical quantification estimates.

I'm no expert, but I don't believe EM is deterministic, so this is going to happen.

@Solyris83
Copy link
Author

Hi @j-andrews7 , regarding the PAR genes, I initially thought those were the culprits causing the problem and tried to remove those with the PAR keyword along with their non-PAR keyword genes as well. But there is still a small population of genes (i remember counting 80 genes ) with this problem and that is when I started this thread.

So with this known salmon behavior, is it possible to allow featureCount to just count on the STAR output and GTF coordinates instead of salmon estimating gene count?

@lpantano sorry I am not exactly sure what would be a good check. Maybe can you name a few files which you think are good candidates?

@j-andrews7
Copy link

j-andrews7 commented Apr 13, 2021

featureCount has its own issues and just tosses multi-mappers. So rather than those genes having slightly variable counts, they'll be dramatically undercounted. The other genes you see exhibiting this behavior are likely pseudogenes.

As far as I know, there is no way to "seed" runs such that they are exactly reproducible, but those run to run variations should be quite small.

@j-andrews7
Copy link

j-andrews7 commented Apr 13, 2021

These is a nice discussion of this here as well: COMBINE-lab/salmon#613

Rob Patro confirms that my above comment is correct, there is no way to ensure completely identical runs.

@drpatelh drpatelh added this to the 3.1 milestone Apr 14, 2021
@drpatelh
Copy link
Member

Description added in #598

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants