Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all salmon.merged.gene_tpm*.tsv files contain the same exact values #847

Closed
jacorvar opened this issue Jul 14, 2022 · 4 comments
Closed
Labels
bug Something isn't working
Milestone

Comments

@jacorvar
Copy link

jacorvar commented Jul 14, 2022

Description of the bug

salmon_tximport.r script produces three files with TPM values at gene-level:

  • salmon.merged.gene_tpm.tsv: stores the raw abundances produced by salmon at transcript level and summarized at gene level.
  • salmon.merged.gene_tpm_scaled.tsv: stores the same abundances as in salmon.merged.gene_tpm.tsv but normalized by library size.
  • salmon.merged.gene_tpm_length_scaled.tsv: stores the same abundances as in salmon.merged.gene_tpm.tsv but normalized by library size AND average transcript length.

As far as I understand, these three files should provide different values, but they are all identical.

$ cut -f2-8 salmon.merged.gene_tpm.tsv | head
gene_name	X04S	X04T	X10S	X10T	X11S.FIDIS	X11T.FIDIS
TSPAN6	24.132338	2.917993	54.918612	5.67429	19.277248	2.545522
TNMD	0.207567	0	0.105188	0	0	0
DPM1	18.345738	25.45351	18.982674	15.297568	23.218046	33.745858
SCYL3	3.834507	2.412431	4.602108	1.664488	3.152804	3.360443
C1orf112	1.694871	2.447275	1.708524	1.110848	1.143803	2.322585
FGR	2.92266	7.166522	0.787274	1.529363	0.639491	1.055823
CFH	24.313129	20.593207	24.559955	24.067773	62.227481	7.129656
FUCA2	6.561708	8.532212	5.228672	8.413182	8.060727	5.354021
GCLC	59.804869	31.302783	14.608898	10.564397	81.377122	12.485694

$ cut -f2-8 salmon.merged.gene_tpm_scaled.tsv | head
gene_name	X04S	X04T	X10S	X10T	X11S.FIDIS	X11T.FIDIS
TSPAN6	24.132338	2.917993	54.918612	5.67429	19.277248	2.545522
TNMD	0.207567	0	0.105188	0	0	0
DPM1	18.345738	25.45351	18.982674	15.297568	23.218046	33.745858
SCYL3	3.834507	2.412431	4.602108	1.664488	3.152804	3.360443
C1orf112	1.694871	2.447275	1.708524	1.110848	1.143803	2.322585
FGR	2.92266	7.166522	0.787274	1.529363	0.639491	1.055823
CFH	24.313129	20.593207	24.559955	24.067773	62.227481	7.129656
FUCA2	6.561708	8.532212	5.228672	8.413182	8.060727	5.354021
GCLC	59.804869	31.302783	14.608898	10.564397	81.377122	12.485694

$ cut -f2-8 salmon.merged.gene_tpm_length_scaled.tsv | head
gene_name	X04S	X04T	X10S	X10T	X11S.FIDIS	X11T.FIDIS
TSPAN6	24.132338	2.917993	54.918612	5.67429	19.277248	2.545522
TNMD	0.207567	0	0.105188	0	0	0
DPM1	18.345738	25.45351	18.982674	15.297568	23.218046	33.745858
SCYL3	3.834507	2.412431	4.602108	1.664488	3.152804	3.360443
C1orf112	1.694871	2.447275	1.708524	1.110848	1.143803	2.322585
FGR	2.92266	7.166522	0.787274	1.529363	0.639491	1.055823
CFH	24.313129	20.593207	24.559955	24.067773	62.227481	7.129656
FUCA2	6.561708	8.532212	5.228672	8.413182	8.060727	5.354021
GCLC	59.804869	31.302783	14.608898	10.564397	81.377122	12.485694

When looking at the salmon.merged.gene_counts*.tsv files, the values are different though:

$ cut -f2-8 salmon.merged.gene_counts.tsv | head
gene_name	X04S	X04T	X10S	X10T	X11S.FIDIS	X11T.FIDIS
TSPAN6	993	122	2761	398	1504	221
TNMD	3	0	2	0	0	0
DPM1	267.001	423.999	325	340	496.999	708.001
SCYL3	158	112.999	250	149	243	232
C1orf112	65	89.001	59	61.999	65	143
FGR	55.999	228.001	27	91	38	38.001
CFH	1233.141	1183.906	1511.945	2007.725	4892.099	547.579
FUCA2	199	310	217	469	422	278
GCLC	1471.001	839	477.999	394.001	3622	350.999

$ cut -f2-8 salmon.merged.gene_counts_scaled.tsv | head
gene_name	X04S	X04T	X10S	X10T	X11S.FIDIS	X11T.FIDIS
TSPAN6	226.442607359473	27.325455413908	680.982329517792	64.2363448394888	254.038754538895	33.225507361704
TNMD	1.94767753881881	0	1.30431499756981	0	0	0
DPM1	172.144810281281	238.358609027664	235.382233640515	173.17758754902	305.971240742805	440.468891412456
SCYL3	35.9805901532686	22.5911356640093	57.0654303231929	18.8429962425592	41.5481712672496	43.862289791676
C1orf112	15.9035982496995	22.9174312268986	21.1854344308093	12.5754614572495	15.0732246406671	30.31561503522
FGR	27.4243942226085	67.110674146165	9.76207633377169	17.3133007041859	8.42731790237031	13.781163493836
CFH	228.139035837605	192.844451548677	304.539658954821	272.461535442591	820.044011019246	93.060063088992
FUCA2	61.5709206564035	79.8996360128384	64.8347274116187	95.2422347376289	106.225590279175	69.883530431172
GCLC	561.171092048839	293.13394555701	181.148084946645	119.595270723432	1072.40113945931	162.969920477208

$ cut -f2-8 salmon.merged.gene_counts_length_scaled.tsv | head
gene_name	X04S	X04T	X10S	X10T	X11S.FIDIS	X11T.FIDIS
TSPAN6	1009.89538613071	145.674915501123	3036.13600233514	425.158554183428	1412.3339507741	179.663001697246
TNMD	2.8139289243869	0	1.88384728463668	0	0	0
DPM1	234.747151082247	388.540561304232	320.883372483089	350.469267458502	520.123153074086	728.266655861836
SCYL3	166.298457316012	124.81222537306	263.669668052293	129.247383249636	239.381748242847	245.798951903383
C1orf112	51.9171282120011	89.4294020564928	69.1384720045292	60.9243711627601	61.339554864032	119.991487594531
FGR	74.6100360692731	218.248331939976	26.550306790444	69.9023760823351	28.5804305459881	45.4584769004731
CFH	1111.04633324021	1122.63568145619	1482.66680176865	1969.19909647884	4978.38996909827	549.496208248179
FUCA2	200.825423561577	311.520742893486	211.406273589949	461.025004293187	431.908131185053	276.367308907589
GCLC	1470.36420604633	918.11032098626	474.493771440154	465.0455540331	3502.72531314	517.733030792211

Therefore, either I misunderstand what salmon does (most likely) or there is a bug.

Command used and terminal output

$ nextflow -Dnxf.pool.type=sync -log nf.log run -with-timeline -with-trace -with-report -w work -dump-hashes nf-core/rnaseq -r 3.8.1 \
        -profile singularity \
        --outdir results \
        --input "data/sample_data.csv" \
        --fasta $(realpath data/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz) \
        --gtf $(realpath data/Homo_sapiens.GRCh38.106.gtf.gz) \
        --remove_ribo_rna \
        --save_reference \
        -process.cache='lenient' \
        --ribo_database_manifest $(realpath data/rrna_db.txt)

Relevant files

No response

System information

  • Nextflow version: 22.04.3
  • Hardware: HPC
  • Executor: Slurm
  • Container engine: Singularity
  • OS: CentOS 7.9
  • Version of nf-core/rnaseq: 3.8.1
@jacorvar jacorvar added the bug Something isn't working label Jul 14, 2022
@jacorvar jacorvar changed the title all salmon.merged.gene_tpm* contain the same exact values all salmon.merged.gene_tpm* files contain the same exact values Jul 14, 2022
@jacorvar jacorvar changed the title all salmon.merged.gene_tpm* files contain the same exact values all salmon.merged.gene_tpm*.tsv files contain the same exact values Jul 14, 2022
@drpatelh drpatelh modified the milestones: 3.9, 3.10 Sep 25, 2022
@drpatelh drpatelh modified the milestones: 3.10, 3.11 Dec 22, 2022
@drpatelh
Copy link
Member

drpatelh commented May 7, 2023

Hi @rob-p ! Would love your input on this please?

@rob-p
Copy link

rob-p commented May 7, 2023

pinging @mikelove, who might have some insight on the tximport behavior here (or if the issue likely resides elsewhere).

@mikelove
Copy link

mikelove commented May 8, 2023

tximport doesn’t modify abundances. So when you write out those matrices (as far as tximport is concerned) they would be unchanged from Salmon TPM column.

@pinin4fjords
Copy link
Member

Resolved in #1304

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants