Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When to set geoMeans in estimateSizeFactors #445

Closed
btemperton opened this issue Mar 16, 2015 · 13 comments
Closed

When to set geoMeans in estimateSizeFactors #445

btemperton opened this issue Mar 16, 2015 · 13 comments

Comments

@btemperton
Copy link

Hi there,
I am currently dealing with a dataset of 16S OTUs from sediment of two Antarctic cruises. I notice that in some of the phyloseq tutorials, the following is used:

dds1 = phyloseq_to_deseq2(phyloseq_object, ~some_factor)
gm_mean = function(x, na.rm=TRUE){
    exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(dds1), 1, gm_mean)
dds1 = estimateSizeFactors(dds1, geoMeans=geoMeans)

Other times, a standard call to estimateSizeFactors is used:

dds1 = phyloseq_to_deseq2(phyloseq_object, ~some_factor)
dds2= estimateSizeFactors(dds2)

According to the help associated with estimateSizeFactors, geoMeans is used to provide a frozen, pre-calculated set of geometric means. If this is not provided, then the function calculates geometric means from the data. I ran estimateSizeFactors with and without specifying geoMeans and it makes a major difference to the results:

> head(dds1$sizeFactor)
ANT_C1_1 ANT_C1_10 ANT_C1_11 ANT_C1_12 ANT_C1_13  ANT_C1_2 
1.837425  1.943064  1.943064  1.037106  1.943064  1.856124 
>head(dds2$sizeFactor)
 ANT_C1_1 ANT_C1_10 ANT_C1_11 ANT_C1_12 ANT_C1_13  ANT_C1_2 
3.6321766 4.1681911 2.0902822 0.6115556 2.3290326 2.5426859

I'm not quite sure why they are giving different values. From what I can see, the method gm_mean is just a standard calculation of geometric means. However, it has a major effect on normalization by rlog, as can be seen in the two attached meanSdPlot figures, generated with the following:

rld2<-rlogTransformation(dds2, blind=FALSE)
rlogMat2<-assay(rld2)
vst2<-varianceStabilizingTransformation(dds2, blind=FALSE)
vstMat2<-assay(vst2)
library('vsn')
op<-par(mfrow=c(1,3))
meanSdPlot(log2(counts(dds2, normalized=TRUE)+1), main='shifted log')
meanSdPlot(rlogMat2, main='rlog transformed')
meanSdPlot(vstMat2, main='VST transformed')
par(op)

My question is two-fold:

  1. Why does pre-calculating the geometric mean give different values?
  2. When is it appropriate to pre-calculate geometric means as in the first example?

Many thanks,

Ben

geomean set not blind
geomean auto not blind

@joey711
Copy link
Owner

joey711 commented Mar 17, 2015

Hi @btemperton

This is a good question, and you did some careful work to arrive at this point. I forget if there are special tweaks to the definition of the geometric mean in the DESeq2 implementation, but the example gm_mean function provided in the tutorials has an important deviation from the formal mathematical definition of gm_mean, in that zeroes (and NA) are ignored. The "true" definition is undefined for NA and 0.0 for any vector that has even one zero. Of course this kind of sensitivity to zeros is a precarious situation for microbiome data where zeroes are the norm rather than the exception. Hence, I provided an alternative.

My quick answer is:
If you DESeq2 size factor estimation appears to run normally without my custom tweak, try to use it. If you get an error related to DESeq2's gm-mean calculation, use my tweak.

Notice that you are comparing standard deviations between different transformations, and they don't have to agree in their scale to be useful when they are independently applied. If you were to plot one vector of scaling factors against the other, they should be highly correlated. My biggest concern in those plots is how different the two rlog plots appear to be. Might want to double-check that.

Most importantly, your inferences should be very similar between the two choices of geometric mean:
(1) differential abundance tests should be detecting the same taxa among the very-confident results
(2) sample-clustering structure and other whole-microbiome comparisons that rely on the rlog tansformed data should be very similar. You can do a procrustes test to compare ordinations or compare PERMANOVA (vegan::adonis) results on your most important experimental variable(s).

I will close for now, but please post back if you find that the choice did have a meaningful inferential difference so that we can re-open and explore this a bit further.

Thanks for the interesting feedback and for helping to support phyloseq!

joey

@joey711 joey711 closed this as completed Mar 17, 2015
@btemperton
Copy link
Author

Hi Joey,
Thanks for the answer - I have followed this up by checking the correlation between estimated size factors with a custom and default implementation of geoMeans. It would appear in this data that there is little correlation between the two (see attached figure). Indeed, when using rlog-transformed data, it has a major effect on ordination. The figures attached were produced with the following:

dds1 <- phyloseq_to_deseq2(ant_phylo, ~ cruise)
gm_mean = function(x, na.rm=TRUE){
  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(dds1), 1, gm_mean)
dds1 = estimateSizeFactors(dds1, geoMeans=geoMeans)
dds1 = estimateDispersions(dds1)
rld1<-rlogTransformation(dds1, blind=FALSE)
rlogMat1<-assay(rld1)

#now let's do the same for default geoMeans implementation
dds2 <- phyloseq_to_deseq2(ant_phylo, ~ cruise)
dds2 = estimateSizeFactors(dds2)
dds2 = estimateDispersions(dds2)
rld2<-rlogTransformation(dds2, blind=FALSE)
rlogMat2<-assay(rld2)

#Let's plot the correlation of size factors for each method
qplot(dds1$sizeFactor, dds2$sizeFactor, xlab='Size Factors with custom geoMean', 
    ylab='Size Factors with default geoMean')

 #Now let's plot an NMDS for each rlog-transformed data, 
 #447 replacing negatives with 0 and using Bray-Curtis
rlogMat1[rlogMat1<0]<-0
rlogMat2[rlogMat2<0]<-0
rlog.otu.counts.1<-otu_table(rlogMat1, taxa_are_rows=TRUE)
rlog.otu.counts.2<-otu_table(rlogMat2, taxa_are_rows=TRUE)

iDist.1<-phyloseq::distance(rlog.otu.counts.1, method='bray')
iDist.2<-phyloseq::distance(rlog.otu.counts.2, method='bray')

MDS.1<-ordinate(rlog.otu.counts.1, 'NMDS', distance=iDist.1)
MDS.2<-ordinate(rlog.otu.counts.2, 'NMDS', distance=iDist.2)

p1 <- plot_ordination(ant_phylo, MDS.1, color='cruise') + 
    ggtitle('NMDS of rlog data with custom geoMeans')
p2 <- plot_ordination(ant_phylo, MDS.2, color='cruise') + 
    ggtitle('NMDS of rlog data with default geoMeans')
multiplot(p1,p2, cols=2)

Any advice on how to proceed would be most welcome! Default geoMeans appears to significantly increase the spread of cruise 2 data, while clustering cruise 1 data more tightly. Unsurprisingly, a procrustes test between the two ordinations was significant.

size factor correlation
comparative nmds

@joey711
Copy link
Owner

joey711 commented Mar 19, 2015

Maybe @mikelove you can help us out on this thread? It ultimately comes down to a question regarding how DESeq2 is using the geometric means, both in feature selection and in the variance stabilizing (or rlog) transformations. Any help would be greatly appreciated.

Also, this brings up a larger question about quality control the comes up. In our microbiome count data we tend to encounter many more zeroes than in human gene expression, and this thread is one of the side effects. Do you have any suggestions for things to check when you know ahead of time that you expect a somewhat zero-inflated dataset? I've seen some preliminary exploration of this that leads me to believe that the NB implemented in DESeq still results in a pretty reasonable fit and overall performance, but any cautionary suggestions might help users avoid pitfalls on especially zero-laden data.

Thanks @mikelove !

@mikelove
Copy link

The geometric means are used as a pseudoreference, for calculating sample-wise ratios and then the median of this gives the size factor. Then the size factors are used throughout the model: in fitting the estimated means through the GLM, and so in estimating dispersion. I'm still not sure what the best option for size factor estimation is when the data is very sparse you have one zero in every row. already the geometric means for a row are zero when a single sample has a zero, so if all rows have a single sample with a zero, then an alternate estimator must be used. I don't work regularly on such sparse datasets, so I hesitate to say what's the best thing to do. We are trying out an iterative, model-based size factor estimator in the devel branch (see 'type' in ?estimateSizeFactors), but this is still development work.

On the other hand, very sparse count datasets, with large counts for single samples per row and the rest at 0, don't fit well to the negative binomial distribution. Here, the VST or simply shifted log, log(count+k), might be a safer choice than the rlog. A way that I test for sparsity is looking at a plot of the row sum of counts and the proportion of count which is in a single sample:

rs <- rowSums(counts(dds))
rmx <- apply(counts(dds), 1, max)
plot(rs+1, rmx/rs, log="x")

@joey711
Copy link
Owner

joey711 commented Mar 19, 2015

Thanks, @mikelove ! That's very helpful to hear. Let me know if you want some real-world examples of the microbe data. Indeed occasionally a microbe dataset has at least one zero in every row, which is why I updated phyloseq doc to include this modification to the geometric means estimate.

Also very helpful to hear suggestions for sparse-ish data.

Any thoughts/plans for a zero-inflated NB?

@btemperton
Copy link
Author

Thanks for the comments! In our dataset, 219094 out of 219400 taxa (99.9%) have a count of 0 in at least one sample. 181484 taxa only appear in 1 sample, so it's fair to say that ours is an example of very sparse data (I've included the plot suggested by @mikelove . I am slightly confused about the recommendation to use VST on very sparse data, as does the VST not take into account the estimated size factor? This in turn requires a calculation of geometric mean (possibly with @joey711's tweak). Thus, if geometric means are not suitable on such sparse data, presumably the resultant VST is also suspect? It would also rule out using log(count(dds, normalized=TRUE)+1) as this normalization also uses the calculated size factors.

Presumably, if using geometric means is not suitable for very sparse data, I guess the question I'm wrestling with is: Is it better to apply a normalization with known biases (i.e. rarefying data, using proportions etc.) or applying transformations where the effects on sparse data are unknown.

sparsity

@mikelove
Copy link

hi Ben, I think there are sparse datasets, in which row sums are often concentrated in single samples or a few samples, and then there are datasets where all rows have a zero, which pose a problem for the standard size factor estimator. I think that the technique of using non-zero counts to calculate size factors is a good workaround in the second case. From the looks of this plot, there are still many rows where the max count is a small proportion of total sum, and I think a good estimate of size factor could be estimated with this custom geometric means approach. Another note is that data transformed with varianceStabilizingTransformation() does correct for size factors, it just doesn't take the size factors directly into account when the transformation function is being calculated. A VST is calculated for the counts in the range s_j~=1 and then applied to K_ij/s_j, so there is correction for size factor through division.

@spholmes
Copy link
Contributor

Ben
Our standard workflow is to filter out taxa that appear (usually we
consider singletons to be potential noise and take
taxa that appear with at least 2 or 3 reads in at least 2 samples, this is
the least conservative choice),
you have to honestly ask yourself: how many taxa are there: 219,000 seems
unrealistic.
Of course some 0's occur we do not filter samples with zeros, but we do not
consider
taxa real unless they appear in a few samples.
Susan

On Fri, Mar 20, 2015 at 3:17 AM, Ben Temperton notifications@github.com
wrote:

Thanks for the comments! In our dataset, 219094 out of 219400 taxa (99.9%)
have a count of 0 in at least one sample. 181484 taxa only appear in 1
sample, so it's fair to say that ours is an example of very sparse data
(I've included the plot suggested by @mikelove
https://github.com/mikelove . I am slightly confused about the
recommendation to use VST on very sparse data, as does the VST not take
into account the estimated size factor? This in turn requires a calculation
of geometric mean (possibly with @joey711 https://github.com/joey711's
tweak). Thus, if geometric means are not suitable on such sparse data,
presumably the resultant VST is also suspect? It would also rule out using
log(count(dds, normalized=TRUE)+1) as this normalization also uses the
calculated size factors.

Presumably, if using geometric means is not suitable for very sparse data,
I guess the question I'm wrestling with is: Is it better to apply a
normalization with known biases (i.e. rarefying data, using proportions
etc.) or applying transformations where the effects on sparse data are
unknown.

[image: sparsity]
https://cloud.githubusercontent.com/assets/1521603/6749631/cb22a596-cee9-11e4-8f13-061e75b0e8fe.png


Reply to this email directly or view it on GitHub
#445 (comment).

Susan Holmes
Professor, Statistics and BioX
Director, MCS
Sequoia Hall,
390 Serra Mall
Stanford, CA 94305
http://www-stat.stanford.edu/~susan/

@btemperton
Copy link
Author

Thanks Susan, I've used a similar approach in the past with 454 data. My issues with such an approach is two-fold:

  1. that it is dependent on sequencing effort - if you sequence deeply enough, the probability of errors being replicated goes up, so at what point do you require 3 or 4 or 5 reads per sample, or for it to be in 3 or 4 samples at least? Do you have to change your criteria depending on whether or not all your samples come from the same site, at different times, or whether you are looking at totally different biomes (in which case the probability of the same taxa appearing in multiple samples decreases).
  2. When rare events happen, they are being excluded due to arbitrary cutoffs on what we deem rare and what we deem sequencing error. These cutoffs would need to fluctuate based on tests against standard communities (and yet we appear to be still using the same cutoffs for Illumina data as we did for 454 data).

I always felt that the whole point of using DESeq on microbiome data was to step away from arbitrary cutoffs. In practice, I doubt whether either method would have a particularly large effect on ordination of samples (provided the geoMeans is appropriately calculated!), I guess it's more a question of what is more statistically robust.

I would be interested to know if you perform such pruning before or after computing alpha and beta diversities?

@joey711
Copy link
Owner

joey711 commented Mar 20, 2015

There's lots of details to comment on in this thread now.

I just wanted to add that I also share @spholmes suspicion that this data has not been optimally denoised and you probably do not have 200,000 species... unless this is some large pan-geographic survey...
Furthermore, and by definition, you are performing these transformations in order to apply cross-sample statistical learning methods, most/all of which are doing very little with a species that shows up in just one or two samples.

I would clarify that the "point" of using a method like DESeq is to do more with the data that you have, not to absolve you of the responsibility of making any choices about your analysis, nor especially de-noising of the raw sequence data, which is a separate and necessary step.

Regarding alpha-diversity, I just recently closed an issue reminding a user that "garbage in, garbage out". Many alpha-diversity estimation methods include singletons and doubletons in their models, so you wouldn't want to bluntly remove all singletons. However, the singletons and doubletons are also going to be comprised of much more noise (sequences that are the result of PCR/sequencing errors rather than true genotypes), if your data has been poorly de-noised. This sounds like a dilemma, and is at the heart of your Point-2 above. However, if this appears to be the case and you're not sure how you would better de-noise your data, then you should use other estimates of alpha-diversity that are not so sensitive to these lowest-count features.

@btemperton
Copy link
Author

OK, consider me suitably chastised :). I hadn't forgotten the need to denoise data, beyond the standard protocols of mothur and qiime. I was however (mistakenly it seems) under the impression that normalization via DESeq would aid identification of suitable cutoffs. I thought that this was being done at the stage when you set negative values post-normalization to 0 prior to ordination. I now understand that standard denoising needs to be run prior to pushing stuff through the phyloseq/DESeq pipelines.

@joey711
Copy link
Owner

joey711 commented Mar 20, 2015

@btemperton you're not alone. I didn't mean that to be chastizing. The fact is the "standard protocols" for denoising are insufficient if you want to use a method that is sensitive to singletons and doubletons in the data.

Similarly (and you're far from alone here, too) it's important to keep appropriately separate the concept of denoising and the concept of standardization. They have been conflated mostly by purveyors of methods that use rarefying, wrongly insisting that rarefying is somehow addressing both problems and the matter is settled. Unfortunately it is a very inefficient, noise-creating method that poorly addresses either. DESeq2 and related solutions help you deal with the standardization (differing library sizes) problem and to make efficient inferences from your data. The denoising problem is best addressed at the sequence-processing level, and the best general-purpose option currently available is

  • The dada2 algorithm, if your data works well with it, current support is mainly Illumina, or
  • UPARSE in the usearch package, if you don't have sequencing data that works well with dada2

See benchmarking comparisons of "the state of the art"

Sometimes post-denoising filtering is still prudent, especially removing off-target amplicons, obvious contaminants, and detectable chimeras. This is where replicates in your experiment can help a lot, because many types of errors should be poorly-replicated or absent between replicates. Some clear, documented, justifiable filters based on repeated observations can help bridge this "gap" in the denoising methods space, and hopefully get your "number of OTUs" down into something more believable, and ultimately more useful for biological interpretation.

Thanks for your feedback! I think this is a useful thread for many other users. You are far from the only one to have these sorts of questions. There are lots of misconceptions floating around. Also this issue/decision regarding zeroes and rlog is something we should try to better address, and perhaps worthy of a new feature request issue on the tracker. Something for us to improve.

@shreyaskumbhare
Copy link

shreyaskumbhare commented Jun 10, 2020

Hi @joey711 @spholmes @btemperton. I know this post is almost a 5 year old now, but the only reason I am commenting here is that I found this thread very informative and relevant with what I am dealing with. I have been following workflows and reading some articles to deal with such zero-inflated data. I have some following specific questions, in which I would love to hear your suggestions on:

Some prior information about the data: I had an ASV table with almost 89% of sparsity. I have used the DADA2 workflow and did some filtering even later and the phyloseq object that I have now has a 68% sparsity (After removing singletons and also low reads ASV). The library size varies a lot, but I haven't rarified the data.

  1. Ordination analysis: I have been struggling with understanding a suitable transformation prior to ordination analysis. I performed a rank threshold transformation instead of a relative abundance (proportions) as suggested in the workflow, but my question is can I do this:
    Rank transform the counts --> Bray-curtis/Weighted-Unifrac--> PcoA. Is this an appropriate choice.

  2. Also, I have tried out other transformations such as CLR and also VST using DESeq2, but as also discussed in this thread above and elsewhere on the forum, dealing with the negative values for using bray curtis/Weighted unifrac with a PCoA is a bit difficult, right? Is replacing the negative values with 0's, prior to ordination appropriate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants