salmon output must not be read in individually when countsFromAbundance = "lengthScaledTPM" #553

j-andrews7 · 2021-01-20T23:26:53Z

This is related to #499. In short, it appears that somewhere along the way, the length scaled gene counts are somehow lost and the output for the salmon.merged.gene_counts_length_scaled.rds file is practically identical to the salmon.merged.gene_counts.rds (or tsv) file with the exception of minor rounding differences. These counts are then used in the eventual DESeq2 dds output as well.

This can be seen with the full test output from the docs:

> head(assay(salmon.merged.gene_counts, "counts"))
                GM12878_R1 GM12878_R2    H1_R1    H1_R2  K562_R1  K562_R2   MCF7_R1   MCF7_R2
ENSG00000000003      8.000      0.000 8591.780 4593.953    3.000    2.000  6736.203  4243.113
ENSG00000000005      0.000      0.000   71.000   35.000    0.000    0.000     3.000     0.000
ENSG00000000419   4301.001   3755.001 2015.892  867.519 6695.001 9274.999 11453.747 11043.871
ENSG00000000457   2083.850   1690.065  680.152  469.025 1040.569 1531.306  2280.989  2335.835
ENSG00000000460   2016.010   1746.025 3981.889 1951.693 3122.168 3222.999  4750.221  3933.307
ENSG00000000938  11030.999  12004.000  146.000  111.000    1.000    2.000     0.000     4.000
> head(assay(salmon.merged.gene_counts_length_scaled, "counts"))
                  GM12878_R1 GM12878_R2      H1_R1      H1_R2      K562_R1     K562_R2     MCF7_R1      MCF7_R2
ENSG00000000003     7.999946      0.000 8591.77937 4593.95369    2.9999472    1.999947  6736.20254  4243.111634
ENSG00000000005     0.000000      0.000   71.00003   34.99998    0.0000000    0.000000     2.99996     0.000000
ENSG00000000419  4300.998986   3755.000 2015.89212  867.51786 6695.0011421 9275.001886 11453.74482 11043.870057
ENSG00000000457  2083.850215   1690.064  680.15315  469.02601 1040.5683967 1531.305074  2280.98881  2335.834200
ENSG00000000460  2016.010505   1746.025 3981.89072 1951.69264 3122.1690993 3222.999149  4750.22201  3933.306967
ENSG00000000938 11030.999426  12004.000  146.00007  110.99995    0.9999902    2.000012     0.00000     3.999931
> load("./deseq2.dds.RData")
> head(counts(dds))
                GM12878_R1 GM12878_R2 H1_R1 H1_R2 K562_R1 K562_R2 MCF7_R1 MCF7_R2
ENSG00000000003          8          0  8592  4594       3       2    6736    4243
ENSG00000000005          0          0    71    35       0       0       3       0
ENSG00000000419       4301       3755  2016   868    6695    9275   11454   11044
ENSG00000000457       2084       1690   680   469    1041    1531    2281    2336
ENSG00000000460       2016       1746  3982  1952    3122    3223    4750    3933
ENSG00000000938      11031      12004   146   111       1       2       0       4

This bug is also present at the sample-specific level, so it's not occurring when all the samples are merged:

The same issue is present in the salmon.merged.gene_tpm and salmon.merged.gene_tpm_length_scaled files:

Expected behaviour

The raw salmon output is fine, and importing via tximport(files, countsFromAbundance = "lengthScaledTPM", tx2gene = tx2gene) results in slightly different output as expected. The tx2gene file isn't available on the website for the test results, but I have confirmed this with my own data:

> head(assay(salmon.merged.gene_counts, "counts"))
                    K27M_R1  K27M_R2  K27M_R3  K27M_R4  K27M_R5 KD_R1    KD_R2   KD_R3    KD_R4
ENSG00000000003.10  783.001  906.860  898.568  711.133  657.000   317  648.484 538.000 1008.001
ENSG00000000005.5     0.000    0.000    0.000    1.000    0.000     0    0.000   2.000    0.000
ENSG00000000419.8  1308.670 1190.424 1284.727 1246.347 1055.150   579 1443.286 822.784 1640.176
ENSG00000000457.9   594.000  588.120  703.999  408.008  414.047   239  878.540 378.830 1003.270
ENSG00000000460.12  748.141  638.450  634.001  604.869  615.159   220  496.491 233.461  934.138
ENSG00000000938.8     1.000    3.000    0.000    0.000    0.000     0    1.000   2.000    2.000
> head(assay(salmon.merged.gene_counts_length_scaled, "counts"))
                        K27M_R1     K27M_R2   K27M_R3      K27M_R4   K27M_R5    KD_R1       KD_R2      KD_R3       KD_R4
ENSG00000000003.10  782.9998502  906.859827  898.5681  711.1319753  657.0003 317.0001  648.484301 537.999938 1008.001419
ENSG00000000005.5     0.0000000    0.000000    0.0000    0.9999984    0.0000   0.0000    0.000000   1.999978    0.000000
ENSG00000000419.8  1308.6701712 1190.422889 1284.7275 1246.3459179 1055.1498 579.0002 1443.285853 822.782891 1640.176996
ENSG00000000457.9   593.9994689  588.119107  703.9999  408.0082260  414.0473 239.0004  878.540269 378.829605 1003.270107
ENSG00000000460.12  748.1405163  638.450246  634.0000  604.8687300  615.1595 220.0001  496.491738 233.462204  934.138402
ENSG00000000938.8     0.9999992    3.000011    0.0000    0.0000000    0.0000   0.0000    1.000015   2.000029    2.000001
> head(counts(dds))
                   K27M_R1 K27M_R2 K27M_R3 K27M_R4 K27M_R5 KD_R1 KD_R2 KD_R3 KD_R4
ENSG00000000003.10     783     907     899     711     657   317   648   538  1008
ENSG00000000005.5        0       0       0       1       0     0     0     2     0
ENSG00000000419.8     1309    1190    1285    1246    1055   579  1443   823  1640
ENSG00000000457.9      594     588     704     408     414   239   879   379  1003
ENSG00000000460.12     748     638     634     605     615   220   496   233   934
ENSG00000000938.8        1       3       0       0       0     0     1     2     2

txi <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
> head(txi$counts)
                       K27M_R1     K27M_R2   K27M_R3     K27M_R4   K27M_R5    KD_R1       KD_R2      KD_R3      KD_R4
ENSG00000000003.10  775.014127  933.264040  857.8735  715.940320  677.1829 308.8780  663.383646 538.527775  980.54925
ENSG00000000005.5     0.000000    0.000000    0.0000    1.838131    0.0000   0.0000    0.000000   1.185171    0.00000
ENSG00000000419.8  1295.680498 1187.477353 1260.9515 1246.370597 1067.9605 570.1271 1458.904672 828.630685 1631.09717
ENSG00000000457.9   624.634291  583.297353  692.4547  384.683372  411.1735 238.4277  853.707047 372.682467 1079.93698
ENSG00000000460.12  771.648841  608.617888  629.9666  614.663943  652.6226 202.9262  515.992348 223.970674  950.17393
ENSG00000000938.8     1.252449    1.373209    0.0000    0.000000    0.0000   0.0000    1.277896   2.973765    2.52064

System

Hardware: HPC, Cloud
Executor: lsf, awsbatch

Nextflow Installation

nextflow-20.12.0-edge-all

Container engine

Engine: Singularity
Image tag: nfcore/rnaseq:3.0

The text was updated successfully, but these errors were encountered:

j-andrews7 · 2021-01-22T16:03:46Z

Reading is good:

From the tximport docs:

countsFromAbundance
character, either "no" (default), "scaledTPM", or "lengthScaledTPM", for whether to generate estimated counts using abundance estimates scaled up to library size (scaledTPM) or additionally scaled using the average transcript length over samples and the library size (lengthScaledTPM). if using scaledTPM or lengthScaledTPM, then the counts are no longer correlated with average transcript length, and so the length offset matrix should not be used.

So the lengthScaledTPM values depend on the other samples and should not be read in individually as is currently being done.

@j-andrews7

see #553. Thanks @j-andrews7!

drpatelh · 2021-01-25T19:45:16Z

Fixed in #555 Thanks @lpantano !

j-andrews7 added the bug Something isn't working label Jan 20, 2021

lpantano pushed a commit that referenced this issue Jan 22, 2021

fix counts files. tximport needs all samples at once.

9927b15

see #553. Thanks @j-andrews7!

lpantano mentioned this issue Jan 22, 2021

Fix merged files having wrong values #555

Merged

6 tasks

j-andrews7 changed the title ~~salmon.merged.gene_counts and salmon.merged.gene_counts_length_scaled output practically identical~~ salmon output must not be read in individually when countsFromAbundance = "lengthScaledTPM" Jan 24, 2021

drpatelh closed this as completed Jan 25, 2021

grst mentioned this issue Jan 26, 2021

Read in files in parallel thelovelab/tximport#46

Closed

j-andrews7 mentioned this issue Apr 11, 2021

Rdata object "deseq2.dds.RData" and gene count matrix "salmon.merged.gene_counts.tsv" not consistent between 2 runs on same data #585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

salmon output must not be read in individually when countsFromAbundance = "lengthScaledTPM" #553

salmon output must not be read in individually when countsFromAbundance = "lengthScaledTPM" #553

j-andrews7 commented Jan 20, 2021

j-andrews7 commented Jan 22, 2021

drpatelh commented Jan 25, 2021

salmon output must not be read in individually when countsFromAbundance = "lengthScaledTPM" #553

salmon output must not be read in individually when countsFromAbundance = "lengthScaledTPM" #553

Comments

j-andrews7 commented Jan 20, 2021

Expected behaviour

System

Nextflow Installation

Container engine

j-andrews7 commented Jan 22, 2021

drpatelh commented Jan 25, 2021