Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of salmon counts for DESeq2 #499

Closed
tamuanand opened this issue Nov 17, 2020 · 16 comments
Closed

Use of salmon counts for DESeq2 #499

tamuanand opened this issue Nov 17, 2020 · 16 comments
Labels
bug Something isn't working

Comments

@tamuanand
Copy link

tamuanand commented Nov 17, 2020

Hi @drpatelh

Question/Suggestion on this - https://github.com/nf-core/rnaseq/blob/master/bin/salmon_tximport.r

I believe that if one has to use Salmon quant.sf counts for downstream DESeq2, the counts have to be "non-normalized".

See below (different links and my thread with Mike Love on tximport with salmon counts for DESeq2)

The pseudocounts generated by Salmon are represented as normalized TPM (transcripts per million) counts and map to transcripts. These need to be converted into non-normalized count estimates for performing DESeq2 analysis.

Note that although there is a column in our quant.sf files that corresponds to the estimated count value for each transcript, those values are correlated by effective length. What we want to do is use the countsFromAbundance=“lengthScaledTPM” argument.

Do not manually pass the original gene-level counts to downstream methods without an offset. The only case where this would make sense is if there is no length bias to the counts, as happens in 3’ tagged RNA-seq data (see section below). The original gene-level counts are in txi$counts when tximport was run with countsFromAbundance="no". This is simply passing the summed estimated transcript counts, and does not correct for potential differential isoform usage (the offset), which is the point of the tximport methods (Soneson, Love, and Robinson 2015) for gene-level analysis.

My discussions with Mike Love

@drpatelh
Copy link
Member

drpatelh commented Nov 18, 2020

Thank you for reporting this @tamuanand :) Yes, I included the Salmon PCA plots as a bonus really because I got carried away with the DSL2 modularisation of the pipeline. I am guilty of not using Salmon very much. I think @olgabot @lpantano added this functionality to the pipeline. How would you propose we fix this?

@drpatelh drpatelh added the bug Something isn't working label Nov 18, 2020
@lpantano
Copy link
Contributor

we can use a parameter to the pipeline and this passed to the R script? that would be the countsFromAbundance option?

@tamuanand
Copy link
Author

tamuanand commented Nov 18, 2020

Hi @drpatelh @lpantano

I would propose that the pipeline can give all possible outputs from Salmon+tximport, but only use the countsFromAbundance=lengthScaledTPM part for DESeq

  • this way, the end user can use other output files for other needs (say find out Differential Transcript usage, etc)

This is how I have implemented within my code - In my case, I have quant.sf within foldernames salmon_Sample1, salmon_Sample2, and so on. Feel free to use/repurpose as needed.

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

# test if there is at least one argument: if not, return an error
if (length(args) < 1) {
      stop("This script requires at least 2 arguments\nArg1=path to directory that has sample info which then have quant.sf file in
their subfolders\nArg2=Folder path to a directory that has a 2 column file that ends in tx2gene.tsv - column1 is transcript, column2 is gene\n",
call.=FALSE)
}

get_quants <- function(path1, ...) {
    additionalPath = list(...)

    suppressMessages(library(tximport))
    suppressMessages(library(readr))

    salmon_filepaths=file.path(path=path1,list.files(path1,recursive=TRUE, pattern="quant.sf"))

    samples = data.frame(samples = gsub(".*?quant/salmon_(.*?)/quant.sf", "\\1", salmon_filepaths) )
    row.names(samples)=samples[,1]
    names(salmon_filepaths)=samples$samples


    # no tx2Gene available
    salmon_tx_data_1 = tximport(salmon_filepaths, type="salmon",
                              txOut = TRUE)

    salmon_tx_data_2 = tximport(salmon_filepaths, type="salmon",
                              txOut = TRUE, countsFromAbundance = "lengthScaledTPM")

    salmon_tx_data_3 = tximport(salmon_filepaths, type="salmon",
                              txOut = TRUE, countsFromAbundance = "scaledTPM")
    
    ## Get transcript count summarization
    write.csv(as.data.frame(salmon_tx_data_1$counts),
              file = "tx_NumReads.csv", quote=FALSE)
    write.csv(as.data.frame(salmon_tx_data_2$counts),
              file = "lengthScaledTPM_tx_NumReads.csv", quote=FALSE)
    write.csv(as.data.frame(salmon_tx_data_3$counts),
              file = "scaledTPM_tx_NumReads.csv", quote=FALSE)

    ## Get TPM
    write.csv(as.data.frame(salmon_tx_data_1$abundance),
              file  =  "tx_TPM_Abundance.csv", quote=FALSE)
    write.csv(as.data.frame(salmon_tx_data_2$abundance),
              file  =  "lengthScaledTPM_tx_TPM_Abundance.csv", quote=FALSE)
    write.csv(as.data.frame(salmon_tx_data_3$abundance),
              file  =  "scaledTPM_tx_TPM_Abundance.csv", quote=FALSE)


    rm(salmon_tx_data_1)
    rm(salmon_tx_data_2)
    rm(salmon_tx_data_3)

   #tx2gene available
    
   if(length(additionalPath > 0)) {
        tx2geneFile = additionalPath[[1]]
        my_tx2gene=read.csv(tx2geneFile,sep = "\t",stringsAsFactors = F, header=F)
        salmon_tx2gene_data_1 = tximport(salmon_filepaths, type="salmon",
                                       txOut = FALSE, tx2gene=my_tx2gene)

        salmon_tx2gene_data_2 = tximport(salmon_filepaths, type="salmon",
                                       txOut = FALSE, tx2gene=my_tx2gene,
                                       countsFromAbundance = "lengthScaledTPM")

        salmon_tx2gene_data_3 = tximport(salmon_filepaths, type="salmon",
                                       txOut = FALSE, tx2gene=my_tx2gene,
                                       countsFromAbundance = "scaledTPM")

        salmon_tx2gene_data_4 = tximport(salmon_filepaths, type="salmon",
                                       txOut = TRUE, tx2gene=my_tx2gene,
                                       countsFromAbundance = "dtuScaledTPM")

        ## Get Gene count summarization
        write.csv(as.data.frame(salmon_tx2gene_data_1$counts),
                  file = "tx2gene_NumReads.csv", quote=FALSE)

        write.csv(as.data.frame(salmon_tx2gene_data_2$counts),
                  file = "lengthScaledTPM_tx2gene_NumReads.csv", quote=FALSE)

        write.csv(as.data.frame(salmon_tx2gene_data_3$counts),
                  file = "scaledTPM_tx2gene_NumReads.csv", quote=FALSE)

        write.csv(as.data.frame(salmon_tx2gene_data_4$counts),
                  file = "dtuScaledTPM_NumReads.csv", quote=FALSE)

      
        ## Get TPM
        write.csv(as.data.frame(salmon_tx2gene_data_1$abundance),
                  file  =  "tx2gene_TPM_Abundance.csv", quote=FALSE)
        write.csv(as.data.frame(salmon_tx2gene_data_2$abundance),
                  file  =  "lengthScaledTPM_tx2gene_TPM_Abundance.csv", quote=FALSE)

        write.csv(as.data.frame(salmon_tx2gene_data_3$abundance),
                  file  =  "scaledTPM_tx2gene_TPM_Abundance.csv", quote=FALSE)
        write.csv(as.data.frame(salmon_tx2gene_data_4$abundance),
                  file  =  "dtuScaledTPM_TPM_Abundance.csv", quote=FALSE)
        
        rm(salmon_tx2gene_data_1)
        rm(salmon_tx2gene_data_2)
        rm(salmon_tx2gene_data_3)
        rm(salmon_tx2gene_data_4)
        
    } 
}

if(length(args) == 1){
    quant_dirpath =   gsub("/$", '', args[1])

    get_quants(quant_dirpath)
}else if(length(args) == 2){

    quant_dirpath =   gsub("/$", '', args[1])

    tx2gene_File =  gsub("/$", '', args[2])

    suppressWarnings(get_quants(quant_dirpath, tx2gene_File))
    
}else{
    stop("This script requires at least 2 arguments\nArg1=path to directory that has sample info which then have quant.sf file in
      their subfolders\nArg2=Folder path to a directory that has a 2 column file - column1 is transcript, column2 is gene\n",
call.=FALSE)
}

@lpantano
Copy link
Contributor

this is fine with me, if this version works, you could just do the PR with this update? not sure if the output section in the pipeline needs to be updated to store all these new files into the final folder. I think is good to have all and explained in the output documentation.

@tamuanand
Copy link
Author

@lpantano @olgabot @drpatelh

I am not really comfortable with me doing the PR part as there are so many intricacies with this workflow and I am afraid I would inadvertently break some key functionality.

I can exchange messages on this thread about my code above -- I can vouch that I have pressure tested all facets on the code I shared.

Code aside - the key thing to be noted in the nf-core/rnaseq documentation is that if salmon is used, the counts have to be non-normalized before DESeq and the above code does that.

Using the code example above, I would then use the file lengthScaledTPM_tx2gene_NumReads.csv with DESeqDataSetFromMatrix after reading in the file, rounding the counts, etc

@drpatelh
Copy link
Member

I am planning on doing a minor release soon so be good if we can fix this. If someone is able to submit a PR to fix the salmon_tximport.r R script that generates the counts then I am happy to do the rest. Does that sound reasonable?

@lpantano
Copy link
Contributor

I can try this week, I am planning on Wednesday, if that is ok?

@drpatelh
Copy link
Member

Absolutely 👌

lpantano pushed a commit that referenced this issue Dec 2, 2020
@lpantano lpantano mentioned this issue Dec 2, 2020
4 tasks
@drpatelh
Copy link
Member

drpatelh commented Dec 9, 2020

I believe this has now been addressed @tamuanand. Mostly by the work from @lpantano in #519. I updated the main script to use these counts for DESeq2 here. This will use the length scaled counts generated by Salmon as initially created here. How does this look now? Are we ok to close this issue?

@tamuanand
Copy link
Author

Thanks @lpantano

@drpatelh - I think it looks good and this issue can be closed.

Thanks a lot everyone.

@tamuanand
Copy link
Author

@drpatelh I closed this - probably you should remove the 'bug' label on this issue?

@drpatelh
Copy link
Member

drpatelh commented Dec 9, 2020

Thanks @tamuanand 👍🏽 We can leave the label as it is for posterity.

@mikelove
Copy link

I got pointed here by Rob, thanks everyone here for the effort to include Salmon and tximport in the nf-core.

Just to say, I typically use:

  • counts + offset for use with DESeq(). This is what you get with basic calls of tximport, DESeqDataSetFromTximport, DESeq without additional arguments.
  • counts alone with countsFromAbundance="lengthScaledTPM" if I am prevented from passing an offset matrix for some reason. Maybe if you want a single matrix for use with DESeq2, edgeR, limma then this is a good compromise. You can still use DESeqDataSetFromTximport, it will figure out to not build an offset.
  • scaledTPM or dtuScaledTPM for DTU analysis.

Happy to answer any remaining questions as they come up.

@drpatelh
Copy link
Member

drpatelh commented Dec 14, 2020

Thanks @mikelove ! I should be releasing the shiny new version of the pipeline tomorrow! I have pointed users to this issue in the documentation so hopefully, that will help us refine the docs within the pipeline once we reach a consensus on exactly what is missing.

@tamuanand
Copy link
Author

@drpatelh Often people just use the default kmer size of 31 from salmon

Here is a post from @rob-p on k-mer selection - COMBINE-lab/salmon#482 (comment)

Suggestion: If ok with you, probably the bolded/italicized note below can be added to the nf-core documentation - where this section is listed

NB: The default Salmon parameters and a k-mer size of 31 are used to create the index. As discussed here), a k-mer size off 31 works well with reads that are 75bp or longer. Please also see this

@drpatelh
Copy link
Member

Done 🙂 ccd21ea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants