Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DESeq2 with phyloseq issue with tutorial code #1556

Closed
paddyhooper opened this issue Feb 22, 2022 · 2 comments
Closed

DESeq2 with phyloseq issue with tutorial code #1556

paddyhooper opened this issue Feb 22, 2022 · 2 comments

Comments

@paddyhooper
Copy link

Hi @joey711 and all,

Version info:
R Version: 4.1.0
RStudio Version 1.4.1717

library("DESeq2"); packageVersion("DESeq2")
[1] ‘1.32.0’
library("phyloseq"); packageVersion("phyloseq")
[1] ‘1.36.0’

I am hoping to use the DESeq2 plugin extension in phyloseq for my data. I am following this tutorial: https://joey711.github.io/phyloseq-extensions/DESeq2.html

When I try run the script I am able to access the data and create the phyloseq object. However, when I run:

diagdds = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)

I get this warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors

diagdds = DESeq(diagdds, test="Wald", fitType="parametric")
I get this error:
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means

I have seen on a few forums (https://www.biostars.org/p/440379/) that people have had similar issues and suggest adding a pseudocount to the data which I think may come from previous discussions on issue 445?

estimateSizeFactors(dds_PvsN, type = 'iterate')

However, I wanted to check there is not a more fundamental issue I am making (quite likely!) as I was surprised to have this issue using the sample data for the walkthrough.

Many thanks for your amazing work,

Paddy

@ycl6
Copy link

ycl6 commented Feb 24, 2022

Hi @paddyhooper

As shown in the tutorial, the DESeq2 used then was v1.2.10, released many years ago, much have changed since then.

This issue #642 have discussion about adding pseudocount vs. alternative geometric mean
And this issue #445 have in-depth discussion on when to use the geometric mean and if you need to look into doing data pruning as part of the pre-processing steps (which is often necessary).

If we choose the geometric mean approach on the demo dataset kostic as shown in the tutorial:

library(phyloseq)
library(DESeq2)

filepath = system.file("extdata", "study_1457_split_library_seqs_and_mapping.zip", package="phyloseq")
kostic = microbio_me_qiime(filepath)
kostic = subset_samples(kostic, DIAGNOSIS != "None")

# Calculate geometric means prior to estimate size factors
gm_mean = function(x, na.rm = TRUE){
    exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
}

diagdds = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)
geoMeans = apply(counts(diagdds), 1, gm_mean)
diagdds = estimateSizeFactors(diagdds, geoMeans = geoMeans)
diagdds = estimateDispersions(diagdds, fitType = "parametric")
diagdds <- nbinomWaldTest(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.01
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(kostic)[rownames(sigtab), ], "matrix"))
head(sigtab)
> head(sigtab)
         baseMean log2FoldChange     lfcSE      stat       pvalue        padj  Kingdom         Phylum
340615   2.264433      -1.247421 0.3180791 -3.921733 8.791441e-05 0.003890873 Bacteria     Firmicutes
194648   8.561293      -1.266032 0.3170883 -3.992679 6.533111e-05 0.003716614 Bacteria     Firmicutes
180285 158.588600      -1.214061 0.2928066 -4.146289 3.379067e-05 0.002471546 Bacteria     Firmicutes
168181   3.106247      -1.707317 0.4275008 -3.993715 6.504595e-05 0.003716614 Bacteria  Bacteroidetes
9778     2.175787       2.417279 0.5756902  4.198924 2.681866e-05 0.002288526 Bacteria Proteobacteria
470993   7.579506       2.619388 0.6766002  3.871397 1.082134e-04 0.003890873 Bacteria     Firmicutes
                     Class             Order             Family            Genus               Species
340615          Clostridia     Clostridiales    Lachnospiraceae      Coprococcus                  <NA>
194648          Clostridia     Clostridiales               <NA>             <NA>                  <NA>
180285          Clostridia     Clostridiales    Ruminococcaceae Faecalibacterium                  <NA>
168181         Bacteroidia     Bacteroidales      Rikenellaceae        Alistipes                  <NA>
9778   Gammaproteobacteria Enterobacteriales Enterobacteriaceae       Salmonella    Salmonella bongori
470993          Clostridia     Clostridiales    Veillonellaceae      Selenomonas Selenomonas sputigena
 [1] DESeq2_1.34.0               SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [4] MatrixGenerics_1.6.0        matrixStats_0.61.0          GenomicRanges_1.46.0       
 [7] GenomeInfoDb_1.30.0         IRanges_2.28.0              S4Vectors_0.32.0           
[10] BiocGenerics_0.40.0         phyloseq_1.38.0            

@paddyhooper
Copy link
Author

Hi there,

Thanks very much for your response! I hadn't considered the package date on the tutorial. From your response it sounds like generally it could be a better option to use the alternative geometric mean on 'sparse' ASV datasets.

Thanks for pointing me in the way of these responses, I will look into these.

Kind regards,
Paddy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants