Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using transformed count data from DEseq2 #283

Closed
kfontanez opened this issue Jan 7, 2014 · 10 comments
Closed

Using transformed count data from DEseq2 #283

kfontanez opened this issue Jan 7, 2014 · 10 comments

Comments

@kfontanez
Copy link

Joey-

In your "Waste not, want not" paper you suggest that researchers interested in PCA and clustering ought to consider using variance stabilized data rather than proportions or rarified data.

DEseq2 allows output of the transformed count matrix derived from the variancestabilizingTransformation function using getVarianceStabilizedData(object). Do you have a convenient way to change that back into a phyloseq data object so that all of the available graphical and analytical functions of phyloseq can be used (barplots, heatmaps, etc)?

Thanks,
Kristina

@joey711
Copy link
Owner

joey711 commented Jan 9, 2014

Adapting Negative Binomial Tutorial for using Variance Stabilized of Counts Only

The following example uses additional commands from DESeq2. An additional wrapper in phyloseq is not needed. The otu_table<- assignment method is adequate.

In the phyloseq-included vignette example I use the publicly available data from a study on colorectal cancer:

Genomic analysis identifies association of Fusobacterium with colorectal carcinoma.
Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., et al. (2012). Genome research, 22(2), 292-298.

Study ID: 1457

Project Name: Kostic_colorectal_cancer_fusobacterium

Import data with phyloseq, convert to DESeq2

Start by loading phyloseq.

library("phyloseq")
packageVersion("phyloseq")
## [1] '1.7.12'

Defined file path, and import the published OTU count data into R.

filepath = system.file("extdata", "study_1457_split_library_seqs_and_mapping.zip", 
    package = "phyloseq")
kostic = microbio_me_qiime(filepath)
## Found biom-format file, now parsing it... 
## Done parsing biom... 
## Importing Sample Metdadata from mapping file...
## Merging the imported objects... 
## Successfully merged, phyloseq-class created. 
##  Returning...

Convert to DESeq2's DESeqDataSet class

In this example I'm using the major sample covariate, DIAGNOSIS, as the study design factor. The focus of this study was to compare the microbiomes of pairs of healthy and cancerous tissues, so this makes sense. Your study could have a more complex or nested design, and you should think carefully about the study design formula, because this is critical to the test results and their meaning. You might even need to define a new factor if none of the variables in your current table appropriately represent your study's design. See the DESeq2 home page for more details.

Here is the summary of the data variable kostic that we are about to use, as well as the first few entries of the DIAGNOSIS factor.

kostic
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2505 taxa and 190 samples ]
## sample_data() Sample Data:       [ 190 samples by 71 sample variables ]
## tax_table()   Taxonomy Table:    [ 2505 taxa by 7 taxonomic ranks ]

DESeq2 conversion

First load DESeq2.

library("DESeq2")
packageVersion("DESeq2")
## [1] '1.2.6'

The following two lines actually do all the complicated DESeq2 work. The function phyloseq_to_deseq2 converts your phyloseq-format microbiome data into a DESeqDataSet with dispersions estimated, using the experimental design formula, also shown (the ~DIAGNOSIS term). The DESeq function does the rest of the testing, in this case with default testing framework, but you can actually use alternatives.

diagdds = phyloseq_to_deseq2(kostic, ~DIAGNOSIS)
diagdds
## class: DESeqDataSet 
## dim: 2505 190 
## exptData(0):
## assays(1): counts
## rownames(2505): 304309 469478 ... 206906 298806
## rowData metadata column names(0):
## colnames(190): C0333.N.518126 C0333.T.518046 ... 32I9UNA9.518098
##   BFJMKNMP.518102
## colData names(71): X.SampleID BarcodeSequence ... HOST_TAXID
##   Description

New Part Not Shown in Original Vignette: DESeq2 Variance Stabilization

You must step through the size factor and dispersion estimates prior to calling the getVarianceStabilizedData function.

diagdds = estimateSizeFactors(diagdds)
diagdds = estimateDispersions(diagdds)
diagvst = getVarianceStabilizedData(diagdds)
dim(diagvst)
## [1] 2505  190
kostic
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2505 taxa and 190 samples ]
## sample_data() Sample Data:       [ 190 samples by 71 sample variables ]
## tax_table()   Taxonomy Table:    [ 2505 taxa by 7 taxonomic ranks ]

As you can see, the dimensions of the variance stabilized count table, diagvst, are the same as the OTU table. A one-line construction and replacement call can replace the original counts with variance stabilized counts.

# Save the untransformed data as a separate variable so you can go back to
# it
kostic0 = kostic
otu_table(kostic) <- otu_table(diagvst, taxa_are_rows = TRUE)

This modified kostic variable now has the results of DESeq2 variance-stabilization of counts instead of the original counts. I want to compare the values in both using a heatmap, but there are 2505 OTUs, enough to make the graphic rather unreadable. I will first filter using the top most differentially abundant OTUs from the original vignette example.

diagdds = DESeq(diagdds)  #, fitType='local')
res = results(diagdds)
res = res[order(res$padj, na.last = NA), ]
alpha = 0.01
keepOTUs = rownames(res[res$padj > alpha, ])[1:50]
kosticTrimvs = prune_taxa(keepOTUs, kostic)
kosticTrim0 = prune_taxa(keepOTUs, kostic0)
plot_heatmap(kosticTrimvs, taxa.order = "Phylum", taxa.label = "Genus", sample.label = "DIAGNOSIS", 
    sample.order = "DIAGNOSIS")

discrim-otus1

plot_heatmap(kosticTrim0, taxa.order = "Phylum", taxa.label = "Genus", sample.label = "DIAGNOSIS", 
    sample.order = "DIAGNOSIS")

discrim-otus2

Comparing these two heatmaps gives you a rough idea of what the variance stabilizing transformation does to the counts. In this case it looks very similar to a log transformation, and probably is similar.

@joey711 joey711 closed this as completed Jan 9, 2014
@kfontanez
Copy link
Author

Joey-

The below method worked to export the otu_table from the kostic variable containing the variance-stabilized data, so I could visually check the values in excel. Thanks for providing the detailed method for adding the vst counts to the OTU table in the kostic variable which will allow me to utilize the full range of phyloseq's graphical options.

exportkostic<-otu_table(kostic)
write.csv(exportkostic,file="kosticcounts_vst.csv")

Thank you!
Kristina

@barbara1
Copy link

Thanks for the programming work and documentation on how to obtain the VSD in deseq2 from a phyoseq object. I did a test run with a subset of my data yesterday, and everything worked out fine, but when I try using the phyloseq_to_deseq2 with my larger data set (pruned to minimum taxa sum >0) I receive this error message:
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means

I am not sure what to make of it. When looking on the web, there was a comment that this warning is a result of a zero-matrix being supplied. but when I look at the original DT (phyloseq object) and the dds_DT, the correct information seems to be there (see paste below). As with most species observation matrix, taxa are absent from some samples. I am wondering if this is a problem for generating VSD?

Any help would be great!
Thanks,
Barbara

DT
phyloseq-class experiment-level object
otu_table() OTU Table: [ 7723 taxa and 180 samples ]
sample_data() Sample Data: [ 180 samples by 9 sample variables ]
tax_table() Taxonomy Table: [ 7723 taxa by 7 taxonomic ranks ]
dds_DT=phyloseq_to_deseq2(DT, ~Horizon+Stand)
converting counts to integer mode
dds_DT
class: DESeqDataSet
dim: 7723 180
exptData(0):
assays(1): counts
rownames(7723): sp1 sp2 ... sp10086 sp10091
rowData metadata column names(0):
colnames(180): L101 L103 ... S644 S646
colData names(9): SampleID Replicated ... x_coord_UTM y_coord_UTM

dds_DT = estimateSizeFactors(dds_DT)
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means

@joey711
Copy link
Owner

joey711 commented Nov 26, 2014

Hi Barbara,

One of vignettes in the latest version of phyloseq (1.10.0+) includes an example for dealing with this. you can find it here:

http://www.bioconductor.org/packages/release/bioc/vignettes/phyloseq/inst/doc/phyloseq-mixture-models.html

The quick answer is that you need to provide you own geometric means (calculated in a manner that is tolerant of zeros)... or include pseudocount at the step that calculates geometric means. The strict definition of geometric mean includes a product of its terms, so one zero in the bunch will result in a zero value. For real data, and this data especially, we expect occasional zeros, so one of these approaches is necessary for the geometric mean to be useful.

Hope that helps!

joey

@barbara1
Copy link

Joey,

Yes, that is of great help. Thanks for the information and also the script
to adjust the zeros when calculating the geometric means. Now I just have
to decide when to prune the data set. I realized after I had transformed
that data, that I am not sure how to prune based on the VST data as I have
always used relative abundance transformation which can be easily pruned
after transformation. Are there standard practices in place for how and
when to prune taxa when using the Variance Stabilized transformation?

Thanks again for the prompt reply and useful information.

Cheers,
Barbara

Barbara D Bahnmann MSc | Marie Curie PhD Fellow
Trainbiodiverse ITN | Soil Microbial Ecology
http://www.trainbiodiverse.com/people/barbara-bahnmann
barbara.bahnmann@ualberta.ca | +420 607 398 692

On Wed, Nov 26, 2014 at 10:12 PM, Paul J. McMurdie <notifications@github.com

wrote:

Hi Barbara,

One of vignettes in the latest version of phyloseq (1.10.0+) includes an
example for dealing with this. you can find it here:

http://www.bioconductor.org/packages/release/bioc/vignettes/phyloseq/inst/doc/phyloseq-mixture-models.html

The quick answer is that you need to provide you own geometric means
(calculated in a manner that is tolerant of zeros)... or include
pseudocount at the step that calculates geometric means. The strict
definition of geometric mean includes a product of its terms, so one zero
in the bunch will result in a zero value. For real data, and this data
especially, we expect occasional zeros, so one of these approaches is
necessary for the geometric mean to be useful.

Hope that helps!

joey


Reply to this email directly or view it on GitHub
#283 (comment).

@barbara1
Copy link

barbara1 commented Feb 2, 2015

Hi Joey,

I wanted to do some further analyses on a phyloseq object I created in
December, but now I get an error message.

Error in phyloseq(DT) :
Problem with OTU/taxa indices among those you provided.
Check using intersect() and taxa_names()

I updated my phyloseq package in January and I thought this might be the
problem, so I recreated the phyloseq object from the original files
(sample, otu_table and taxa_table) and I continue to get the error message.
I had no issues in December with the set up.

I checked that the row_names matched between the otu_table and taxa_table
and assigned the row_names from the taxa_table to the otu_table, but this
did not change anything.

rownames(taxa_table)=taxa_names(otu_table)

I did not see any information on how to use the intersect() to check the
issue further. If you have any suggestions on where I should be looking for
the issue in my files or any known issues with updating the package, it
would be really helpful!

Thanks in advance,

Barbara

Barbara D Bahnmann MSc | Marie Curie PhD Fellow
Trainbiodiverse ITN | Soil Microbial Ecology
http://www.trainbiodiverse.com/people/barbara-bahnmann
barbara.bahnmann@ualberta.ca | +420 607 398 692

@joey711
Copy link
Owner

joey711 commented Feb 2, 2015

Hi Barbara,

Sorry you are having this issue. This sounds like a completely unrelated issue to the now-closed issue to which this comment has been added.

Please re-post as a new issue. Copy and paste should be fine.

Also, please note that your issue post is incomplete, as you have not indicated which version of phyloseq you were using before and after, and you did not post the code that you used to create the "new" phyloseq object from the original files. This latter case is more important, because it should have fixed your issue right away. Please indicate what the file formats are as well when you provide that code.

Thanks in advance for your updated comment, and your continued interest in phyloseq!

joey

@JamieMcDevittIrwin
Copy link

JamieMcDevittIrwin commented Jan 14, 2017

Hello,

After going through this post and trying to save my variance stabilized counts in my original otu table, I am having problems with the conversion from a dseq object (after variance stabilizing) back to a phyloseq object. I am using phyloseq version 1.16.2.

Below is my code:

#load the data with low read samples pruned out
load("data/secondmito/lowreads_pruned/otutable_top10removed_coral_866_may.Rdata") #otu_tablef_no10_coralf_sm_866f_mayf
ls()

print(otu_tablef_no10_coralf_sm_866f_mayf) # 39 samples, 6037 taxa
class(otu_tablef_no10_coralf_sm_866f_mayf) # "phyloseq"

#Try with updated gm_means, because my data has such a high prevalence of sparsely sampled OTUS
test = phyloseq_to_deseq2(otu_tablef_no10_coralf_sm_866f_mayf, ~ 1)
#calculate geometric means prior to estimate size factors (zero tolerant version of geomeans)
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(test), 1, gm_mean)
diagdds = estimateSizeFactors(test, geoMeans = geoMeans)
diagdds = estimateDispersions(diagdds)
diagdds.vst <- varianceStabilizingTransformation(diagdds)
dim(diagdds.vst) # 6037, 39 (same as original OTU table)
class(diagdds.vst)

#save the untransformed data as a separate variable so you can go back to it
otu_tablef_no10_coralf_sm_866f_mayf0 <- otu_tablef_no10_coralf_sm_866f_mayf

#replace the counts with variance stabilized counts
otu_table(otu_tablef_no10_coralf_sm_866f_mayf0) <- otu_table(diagdds.vst, taxa_are_rows=TRUE)

Error Outputed:

otu_table(otu_tablef_no10_coralf_sm_866f_mayf0) <- otu_table(diagdds.vst, t
<_coralf_sm_866f_mayf0) <- otu_table(diagdds.vst, ta xa_are_rows=TRUE)
Error in access(object, "otu_table", errorIfNULL) :
otu_table slot is empty.

Thank you very much for your help! All of these issues and vignettes on phyloseq are incredibly helpful.

Jamie

@LCzaplickiD
Copy link

I'm having the same problem:
#DESeq2 conversion
fungi.DESeq=phyloseq_to_deseq2(fprunedpsF1spooled, ~ Rep)
#DESeq2 Variance Stabilization
fungi.DESeq = estimateSizeFactors(fungi.DESeq, geoMeans = geoMeansFs)
fungi.DESeq = estimateDispersions(fungi.DESeq, fitType="local")
fungi.vst<-varianceStabilizingTransformation(fungi.DESeq,blind=FALSE,fitType = "local")
dim(fungi.vst)
#Save the untransformed data as separate variable
vst.fungi.phyloseq = fprunedpsF1spooled
#Add the variance stabilized otu numbers into the dataset
otu_table(vst.fungi.phyloseq)<-otu_table(fungi.vst, taxa_are_rows = TRUE)

error:
otu_table(vst.fungi.phyloseq)<-otu_table(fungi.vst, taxa_are_rows = TRUE)
Error in access(object, "otu_table", errorIfNULL) :
otu_table slot is empty.

Thank you again for all these helpful threads and the vignettes!

@tierrasoledad
Copy link

I had the same problem when I tried this code:
"error:
otu_table(vst.fungi.phyloseq)<-otu_table(fungi.vst, taxa_are_rows = TRUE)
Error in access(object, "otu_table", errorIfNULL) :
otu_table slot is empty."

So I tried it this way...
So I still ran the variance Stabilizing Tranformation on fungi.DESeq object.
fungi.vst<-varianceStabilizingTransformation(fungi.DESeq,blind=FALSE,fitType = "local")

But when I try to get the matrix use the getVarianceStabilizedData func and use that as my otu table it seems to work.
fungi.vst.1<- getVarianceStabilizedData(fungi.DESeq)
otu_table(vst.fungi.phyloseq)<-otu_table(fungi.vst.1, taxa_are_rows = TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants