Use of salmon counts for DESeq2 #499

tamuanand · 2020-11-17T20:30:50Z

Question/Suggestion on this - https://github.com/nf-core/rnaseq/blob/master/bin/salmon_tximport.r

I believe that if one has to use Salmon quant.sf counts for downstream DESeq2, the counts have to be "non-normalized".

See below (different links and my thread with Mike Love on tximport with salmon counts for DESeq2)

https://hbctraining.github.io/DGE_workshop_salmon/lessons/01_DGE_setup_and_overview.html

The pseudocounts generated by Salmon are represented as normalized TPM (transcripts per million) counts and map to transcripts. These need to be converted into non-normalized count estimates for performing DESeq2 analysis.

Note that although there is a column in our quant.sf files that corresponds to the estimated count value for each transcript, those values are correlated by effective length. What we want to do is use the countsFromAbundance=“lengthScaledTPM” argument.

https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor

Do not manually pass the original gene-level counts to downstream methods without an offset. The only case where this would make sense is if there is no length bias to the counts, as happens in 3’ tagged RNA-seq data (see section below). The original gene-level counts are in txi$counts when tximport was run with countsFromAbundance="no". This is simply passing the summed estimated transcript counts, and does not correct for potential differential isoform usage (the offset), which is the point of the tximport methods (Soneson, Love, and Robinson 2015) for gene-level analysis.

My discussions with Mike Love

The text was updated successfully, but these errors were encountered:

drpatelh · 2020-11-18T17:16:48Z

Thank you for reporting this @tamuanand :) Yes, I included the Salmon PCA plots as a bonus really because I got carried away with the DSL2 modularisation of the pipeline. I am guilty of not using Salmon very much. I think @olgabot @lpantano added this functionality to the pipeline. How would you propose we fix this?

lpantano · 2020-11-18T18:52:06Z

we can use a parameter to the pipeline and this passed to the R script? that would be the countsFromAbundance option?

tamuanand · 2020-11-18T19:11:40Z

Hi @drpatelh @lpantano

I would propose that the pipeline can give all possible outputs from Salmon+tximport, but only use the countsFromAbundance=lengthScaledTPM part for DESeq

this way, the end user can use other output files for other needs (say find out Differential Transcript usage, etc)

This is how I have implemented within my code - In my case, I have quant.sf within foldernames salmon_Sample1, salmon_Sample2, and so on. Feel free to use/repurpose as needed.

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

# test if there is at least one argument: if not, return an error
if (length(args) < 1) {
      stop("This script requires at least 2 arguments\nArg1=path to directory that has sample info which then have quant.sf file in
their subfolders\nArg2=Folder path to a directory that has a 2 column file that ends in tx2gene.tsv - column1 is transcript, column2 is gene\n",
call.=FALSE)
}

get_quants <- function(path1, ...) {
    additionalPath = list(...)

    suppressMessages(library(tximport))
    suppressMessages(library(readr))

    salmon_filepaths=file.path(path=path1,list.files(path1,recursive=TRUE, pattern="quant.sf"))

    samples = data.frame(samples = gsub(".*?quant/salmon_(.*?)/quant.sf", "\\1", salmon_filepaths) )
    row.names(samples)=samples[,1]
    names(salmon_filepaths)=samples$samples


    # no tx2Gene available
    salmon_tx_data_1 = tximport(salmon_filepaths, type="salmon",
                              txOut = TRUE)

    salmon_tx_data_2 = tximport(salmon_filepaths, type="salmon",
                              txOut = TRUE, countsFromAbundance = "lengthScaledTPM")

    salmon_tx_data_3 = tximport(salmon_filepaths, type="salmon",
                              txOut = TRUE, countsFromAbundance = "scaledTPM")
    
    ## Get transcript count summarization
    write.csv(as.data.frame(salmon_tx_data_1$counts),
              file = "tx_NumReads.csv", quote=FALSE)
    write.csv(as.data.frame(salmon_tx_data_2$counts),
              file = "lengthScaledTPM_tx_NumReads.csv", quote=FALSE)
    write.csv(as.data.frame(salmon_tx_data_3$counts),
              file = "scaledTPM_tx_NumReads.csv", quote=FALSE)

    ## Get TPM
    write.csv(as.data.frame(salmon_tx_data_1$abundance),
              file  =  "tx_TPM_Abundance.csv", quote=FALSE)
    write.csv(as.data.frame(salmon_tx_data_2$abundance),
              file  =  "lengthScaledTPM_tx_TPM_Abundance.csv", quote=FALSE)
    write.csv(as.data.frame(salmon_tx_data_3$abundance),
              file  =  "scaledTPM_tx_TPM_Abundance.csv", quote=FALSE)


    rm(salmon_tx_data_1)
    rm(salmon_tx_data_2)
    rm(salmon_tx_data_3)

   #tx2gene available
    
   if(length(additionalPath > 0)) {
        tx2geneFile = additionalPath[[1]]
        my_tx2gene=read.csv(tx2geneFile,sep = "\t",stringsAsFactors = F, header=F)
        salmon_tx2gene_data_1 = tximport(salmon_filepaths, type="salmon",
                                       txOut = FALSE, tx2gene=my_tx2gene)

        salmon_tx2gene_data_2 = tximport(salmon_filepaths, type="salmon",
                                       txOut = FALSE, tx2gene=my_tx2gene,
                                       countsFromAbundance = "lengthScaledTPM")

        salmon_tx2gene_data_3 = tximport(salmon_filepaths, type="salmon",
                                       txOut = FALSE, tx2gene=my_tx2gene,
                                       countsFromAbundance = "scaledTPM")

        salmon_tx2gene_data_4 = tximport(salmon_filepaths, type="salmon",
                                       txOut = TRUE, tx2gene=my_tx2gene,
                                       countsFromAbundance = "dtuScaledTPM")

        ## Get Gene count summarization
        write.csv(as.data.frame(salmon_tx2gene_data_1$counts),
                  file = "tx2gene_NumReads.csv", quote=FALSE)

        write.csv(as.data.frame(salmon_tx2gene_data_2$counts),
                  file = "lengthScaledTPM_tx2gene_NumReads.csv", quote=FALSE)

        write.csv(as.data.frame(salmon_tx2gene_data_3$counts),
                  file = "scaledTPM_tx2gene_NumReads.csv", quote=FALSE)

        write.csv(as.data.frame(salmon_tx2gene_data_4$counts),
                  file = "dtuScaledTPM_NumReads.csv", quote=FALSE)

      
        ## Get TPM
        write.csv(as.data.frame(salmon_tx2gene_data_1$abundance),
                  file  =  "tx2gene_TPM_Abundance.csv", quote=FALSE)
        write.csv(as.data.frame(salmon_tx2gene_data_2$abundance),
                  file  =  "lengthScaledTPM_tx2gene_TPM_Abundance.csv", quote=FALSE)

        write.csv(as.data.frame(salmon_tx2gene_data_3$abundance),
                  file  =  "scaledTPM_tx2gene_TPM_Abundance.csv", quote=FALSE)
        write.csv(as.data.frame(salmon_tx2gene_data_4$abundance),
                  file  =  "dtuScaledTPM_TPM_Abundance.csv", quote=FALSE)
        
        rm(salmon_tx2gene_data_1)
        rm(salmon_tx2gene_data_2)
        rm(salmon_tx2gene_data_3)
        rm(salmon_tx2gene_data_4)
        
    } 
}

if(length(args) == 1){
    quant_dirpath =   gsub("/$", '', args[1])

    get_quants(quant_dirpath)
}else if(length(args) == 2){

    quant_dirpath =   gsub("/$", '', args[1])

    tx2gene_File =  gsub("/$", '', args[2])

    suppressWarnings(get_quants(quant_dirpath, tx2gene_File))
    
}else{
    stop("This script requires at least 2 arguments\nArg1=path to directory that has sample info which then have quant.sf file in
      their subfolders\nArg2=Folder path to a directory that has a 2 column file - column1 is transcript, column2 is gene\n",
call.=FALSE)
}

lpantano · 2020-11-18T19:17:07Z

this is fine with me, if this version works, you could just do the PR with this update? not sure if the output section in the pipeline needs to be updated to store all these new files into the final folder. I think is good to have all and explained in the output documentation.

tamuanand · 2020-11-18T19:28:02Z

@lpantano @olgabot @drpatelh

I am not really comfortable with me doing the PR part as there are so many intricacies with this workflow and I am afraid I would inadvertently break some key functionality.

I can exchange messages on this thread about my code above -- I can vouch that I have pressure tested all facets on the code I shared.

Code aside - the key thing to be noted in the nf-core/rnaseq documentation is that if salmon is used, the counts have to be non-normalized before DESeq and the above code does that.

Using the code example above, I would then use the file lengthScaledTPM_tx2gene_NumReads.csv with DESeqDataSetFromMatrix after reading in the file, rounding the counts, etc

drpatelh · 2020-11-23T09:57:40Z

I am planning on doing a minor release soon so be good if we can fix this. If someone is able to submit a PR to fix the salmon_tximport.r R script that generates the counts then I am happy to do the rest. Does that sound reasonable?

lpantano · 2020-11-30T14:13:49Z

I can try this week, I am planning on Wednesday, if that is ok?

drpatelh · 2020-11-30T14:46:12Z

Absolutely 👌

@tamuanand

thanks to @tamuanand issue #499

drpatelh · 2020-12-09T15:15:42Z

I believe this has now been addressed @tamuanand. Mostly by the work from @lpantano in #519. I updated the main script to use these counts for DESeq2 here. This will use the length scaled counts generated by Salmon as initially created here. How does this look now? Are we ok to close this issue?

tamuanand · 2020-12-09T16:10:48Z

Thanks @lpantano

@drpatelh - I think it looks good and this issue can be closed.

Thanks a lot everyone.

tamuanand · 2020-12-09T16:11:58Z

@drpatelh I closed this - probably you should remove the 'bug' label on this issue?

drpatelh · 2020-12-09T17:42:43Z

Thanks @tamuanand 👍🏽 We can leave the label as it is for posterity.

mikelove · 2020-12-11T19:29:36Z

I got pointed here by Rob, thanks everyone here for the effort to include Salmon and tximport in the nf-core.

Just to say, I typically use:

counts + offset for use with DESeq(). This is what you get with basic calls of tximport, DESeqDataSetFromTximport, DESeq without additional arguments.
counts alone with countsFromAbundance="lengthScaledTPM" if I am prevented from passing an offset matrix for some reason. Maybe if you want a single matrix for use with DESeq2, edgeR, limma then this is a good compromise. You can still use DESeqDataSetFromTximport, it will figure out to not build an offset.
scaledTPM or dtuScaledTPM for DTU analysis.

Happy to answer any remaining questions as they come up.

drpatelh · 2020-12-14T15:55:00Z

Thanks @mikelove ! I should be releasing the shiny new version of the pipeline tomorrow! I have pointed users to this issue in the documentation so hopefully, that will help us refine the docs within the pipeline once we reach a consensus on exactly what is missing.

tamuanand · 2020-12-14T16:43:07Z

@drpatelh Often people just use the default kmer size of 31 from salmon

Here is a post from @rob-p on k-mer selection - COMBINE-lab/salmon#482 (comment)

Suggestion: If ok with you, probably the bolded/italicized note below can be added to the nf-core documentation - where this section is listed

NB: The default Salmon parameters and a k-mer size of 31 are used to create the index. As discussed here), a k-mer size off 31 works well with reads that are 75bp or longer. Please also see this

drpatelh · 2020-12-14T16:51:14Z

Done 🙂 ccd21ea

drpatelh added the bug Something isn't working label Nov 18, 2020

lpantano pushed a commit that referenced this issue Dec 2, 2020

Update salmon tximport

1ca2fb1

thanks to @tamuanand issue #499

lpantano mentioned this issue Dec 2, 2020

Update salmon tximport #519

Merged

4 tasks

tamuanand closed this as completed Dec 9, 2020

drpatelh mentioned this issue Dec 9, 2020

Replace featureCounts with Salmon quant #526

Merged

j-andrews7 mentioned this issue Jan 20, 2021

salmon output must not be read in individually when countsFromAbundance = "lengthScaledTPM" #553

Closed

ggrimes mentioned this issue Feb 17, 2022

Export txi object gi from https://github.com/nf-core/rnaseq/blob/master/bin/salmon_tximport.r script #768

Closed

pmoris mentioned this issue Aug 29, 2024

Clarify docs on different tximport count files #1366

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of salmon counts for DESeq2 #499

Use of salmon counts for DESeq2 #499

tamuanand commented Nov 17, 2020 •

edited

Loading

drpatelh commented Nov 18, 2020 •

edited

Loading

lpantano commented Nov 18, 2020

tamuanand commented Nov 18, 2020 •

edited by ewels

Loading

lpantano commented Nov 18, 2020

tamuanand commented Nov 18, 2020

drpatelh commented Nov 23, 2020

lpantano commented Nov 30, 2020

drpatelh commented Nov 30, 2020

drpatelh commented Dec 9, 2020

tamuanand commented Dec 9, 2020

tamuanand commented Dec 9, 2020

drpatelh commented Dec 9, 2020

mikelove commented Dec 11, 2020

drpatelh commented Dec 14, 2020 •

edited

Loading

tamuanand commented Dec 14, 2020

drpatelh commented Dec 14, 2020

Use of salmon counts for DESeq2 #499

Use of salmon counts for DESeq2 #499

Comments

tamuanand commented Nov 17, 2020 • edited Loading

drpatelh commented Nov 18, 2020 • edited Loading

lpantano commented Nov 18, 2020

tamuanand commented Nov 18, 2020 • edited by ewels Loading

lpantano commented Nov 18, 2020

tamuanand commented Nov 18, 2020

drpatelh commented Nov 23, 2020

lpantano commented Nov 30, 2020

drpatelh commented Nov 30, 2020

drpatelh commented Dec 9, 2020

tamuanand commented Dec 9, 2020

tamuanand commented Dec 9, 2020

drpatelh commented Dec 9, 2020

mikelove commented Dec 11, 2020

drpatelh commented Dec 14, 2020 • edited Loading

tamuanand commented Dec 14, 2020

drpatelh commented Dec 14, 2020

tamuanand commented Nov 17, 2020 •

edited

Loading

drpatelh commented Nov 18, 2020 •

edited

Loading

tamuanand commented Nov 18, 2020 •

edited by ewels

Loading

drpatelh commented Dec 14, 2020 •

edited

Loading