-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When to set geoMeans in estimateSizeFactors #445
Comments
Hi @btemperton This is a good question, and you did some careful work to arrive at this point. I forget if there are special tweaks to the definition of the geometric mean in the DESeq2 implementation, but the example My quick answer is: Notice that you are comparing standard deviations between different transformations, and they don't have to agree in their scale to be useful when they are independently applied. If you were to plot one vector of scaling factors against the other, they should be highly correlated. My biggest concern in those plots is how different the two rlog plots appear to be. Might want to double-check that. Most importantly, your inferences should be very similar between the two choices of geometric mean: I will close for now, but please post back if you find that the choice did have a meaningful inferential difference so that we can re-open and explore this a bit further. Thanks for the interesting feedback and for helping to support phyloseq! joey |
Hi Joey,
Any advice on how to proceed would be most welcome! Default geoMeans appears to significantly increase the spread of cruise 2 data, while clustering cruise 1 data more tightly. Unsurprisingly, a procrustes test between the two ordinations was significant. |
Maybe @mikelove you can help us out on this thread? It ultimately comes down to a question regarding how DESeq2 is using the geometric means, both in feature selection and in the variance stabilizing (or rlog) transformations. Any help would be greatly appreciated. Also, this brings up a larger question about quality control the comes up. In our microbiome count data we tend to encounter many more zeroes than in human gene expression, and this thread is one of the side effects. Do you have any suggestions for things to check when you know ahead of time that you expect a somewhat zero-inflated dataset? I've seen some preliminary exploration of this that leads me to believe that the NB implemented in DESeq still results in a pretty reasonable fit and overall performance, but any cautionary suggestions might help users avoid pitfalls on especially zero-laden data. Thanks @mikelove ! |
The geometric means are used as a pseudoreference, for calculating sample-wise ratios and then the median of this gives the size factor. Then the size factors are used throughout the model: in fitting the estimated means through the GLM, and so in estimating dispersion. I'm still not sure what the best option for size factor estimation is when On the other hand, very sparse count datasets, with large counts for single samples per row and the rest at 0, don't fit well to the negative binomial distribution. Here, the VST or simply shifted log, log(count+k), might be a safer choice than the rlog. A way that I test for sparsity is looking at a plot of the row sum of counts and the proportion of count which is in a single sample:
|
Thanks, @mikelove ! That's very helpful to hear. Let me know if you want some real-world examples of the microbe data. Indeed occasionally a microbe dataset has at least one zero in every row, which is why I updated phyloseq doc to include this modification to the geometric means estimate. Also very helpful to hear suggestions for sparse-ish data. Any thoughts/plans for a zero-inflated NB? |
Thanks for the comments! In our dataset, 219094 out of 219400 taxa (99.9%) have a count of 0 in at least one sample. 181484 taxa only appear in 1 sample, so it's fair to say that ours is an example of very sparse data (I've included the plot suggested by @mikelove . I am slightly confused about the recommendation to use VST on very sparse data, as does the VST not take into account the estimated size factor? This in turn requires a calculation of geometric mean (possibly with @joey711's tweak). Thus, if geometric means are not suitable on such sparse data, presumably the resultant VST is also suspect? It would also rule out using log(count(dds, normalized=TRUE)+1) as this normalization also uses the calculated size factors. Presumably, if using geometric means is not suitable for very sparse data, I guess the question I'm wrestling with is: Is it better to apply a normalization with known biases (i.e. rarefying data, using proportions etc.) or applying transformations where the effects on sparse data are unknown. |
hi Ben, I think there are sparse datasets, in which row sums are often concentrated in single samples or a few samples, and then there are datasets where all rows have a zero, which pose a problem for the standard size factor estimator. I think that the technique of using non-zero counts to calculate size factors is a good workaround in the second case. From the looks of this plot, there are still many rows where the max count is a small proportion of total sum, and I think a good estimate of size factor could be estimated with this custom geometric means approach. Another note is that data transformed with varianceStabilizingTransformation() does correct for size factors, it just doesn't take the size factors directly into account when the transformation function is being calculated. A VST is calculated for the counts in the range s_j~=1 and then applied to K_ij/s_j, so there is correction for size factor through division. |
Ben On Fri, Mar 20, 2015 at 3:17 AM, Ben Temperton notifications@github.com
Susan Holmes |
Thanks Susan, I've used a similar approach in the past with 454 data. My issues with such an approach is two-fold:
I always felt that the whole point of using DESeq on microbiome data was to step away from arbitrary cutoffs. In practice, I doubt whether either method would have a particularly large effect on ordination of samples (provided the geoMeans is appropriately calculated!), I guess it's more a question of what is more statistically robust. I would be interested to know if you perform such pruning before or after computing alpha and beta diversities? |
There's lots of details to comment on in this thread now. I just wanted to add that I also share @spholmes suspicion that this data has not been optimally denoised and you probably do not have 200,000 species... unless this is some large pan-geographic survey... I would clarify that the "point" of using a method like DESeq is to do more with the data that you have, not to absolve you of the responsibility of making any choices about your analysis, nor especially de-noising of the raw sequence data, which is a separate and necessary step. Regarding alpha-diversity, I just recently closed an issue reminding a user that "garbage in, garbage out". Many alpha-diversity estimation methods include singletons and doubletons in their models, so you wouldn't want to bluntly remove all singletons. However, the singletons and doubletons are also going to be comprised of much more noise (sequences that are the result of PCR/sequencing errors rather than true genotypes), if your data has been poorly de-noised. This sounds like a dilemma, and is at the heart of your Point-2 above. However, if this appears to be the case and you're not sure how you would better de-noise your data, then you should use other estimates of alpha-diversity that are not so sensitive to these lowest-count features. |
OK, consider me suitably chastised :). I hadn't forgotten the need to denoise data, beyond the standard protocols of mothur and qiime. I was however (mistakenly it seems) under the impression that normalization via DESeq would aid identification of suitable cutoffs. I thought that this was being done at the stage when you set negative values post-normalization to 0 prior to ordination. I now understand that standard denoising needs to be run prior to pushing stuff through the phyloseq/DESeq pipelines. |
@btemperton you're not alone. I didn't mean that to be chastizing. The fact is the "standard protocols" for denoising are insufficient if you want to use a method that is sensitive to singletons and doubletons in the data. Similarly (and you're far from alone here, too) it's important to keep appropriately separate the concept of denoising and the concept of standardization. They have been conflated mostly by purveyors of methods that use rarefying, wrongly insisting that rarefying is somehow addressing both problems and the matter is settled. Unfortunately it is a very inefficient, noise-creating method that poorly addresses either. DESeq2 and related solutions help you deal with the standardization (differing library sizes) problem and to make efficient inferences from your data. The denoising problem is best addressed at the sequence-processing level, and the best general-purpose option currently available is
See benchmarking comparisons of "the state of the art" Sometimes post-denoising filtering is still prudent, especially removing off-target amplicons, obvious contaminants, and detectable chimeras. This is where replicates in your experiment can help a lot, because many types of errors should be poorly-replicated or absent between replicates. Some clear, documented, justifiable filters based on repeated observations can help bridge this "gap" in the denoising methods space, and hopefully get your "number of OTUs" down into something more believable, and ultimately more useful for biological interpretation. Thanks for your feedback! I think this is a useful thread for many other users. You are far from the only one to have these sorts of questions. There are lots of misconceptions floating around. Also this issue/decision regarding zeroes and rlog is something we should try to better address, and perhaps worthy of a new feature request issue on the tracker. Something for us to improve. |
Hi @joey711 @spholmes @btemperton. I know this post is almost a 5 year old now, but the only reason I am commenting here is that I found this thread very informative and relevant with what I am dealing with. I have been following workflows and reading some articles to deal with such zero-inflated data. I have some following specific questions, in which I would love to hear your suggestions on: Some prior information about the data: I had an ASV table with almost 89% of sparsity. I have used the DADA2 workflow and did some filtering even later and the phyloseq object that I have now has a 68% sparsity (After removing singletons and also low reads ASV). The library size varies a lot, but I haven't rarified the data.
|
Hi there,
I am currently dealing with a dataset of 16S OTUs from sediment of two Antarctic cruises. I notice that in some of the phyloseq tutorials, the following is used:
Other times, a standard call to
estimateSizeFactors
is used:According to the help associated with
estimateSizeFactors
,geoMeans
is used to provide a frozen, pre-calculated set of geometric means. If this is not provided, then the function calculates geometric means from the data. I ranestimateSizeFactors
with and without specifyinggeoMeans
and it makes a major difference to the results:I'm not quite sure why they are giving different values. From what I can see, the method
gm_mean
is just a standard calculation of geometric means. However, it has a major effect on normalization byrlog
, as can be seen in the two attachedmeanSdPlot
figures, generated with the following:My question is two-fold:
Many thanks,
Ben
The text was updated successfully, but these errors were encountered: