-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robustness to Null Dataset Problem #20
Comments
Many papers have been written about detection of overclustering, so it's a pretty well-studied problem. I daresay that most of these papers miss the mark, though, because they don't consider the real scientific question. tl;dr A homogeneous cell type can still have interesting subclusters. Consider a cell type whose members are MVN distributed in the expression space (or PC space, or whatever space you care to think of). I think we could both agree that this could be described as "homogeneous" - there aren't any clear subclusters and it's a smooth gradient of density in any direction of travel. However, I would argue that the structure inside this cluster could very well be biologically interesting if, say, an axis of significant variation was associated with some relevant pathway. In such cases, it would make sense to at least try to subcluster and see what you find. If you stop at "oh it's homogeneous", you would never be able to interrogate the heterogeneity within each cell type. (One could say that it would be better to use trajectory inference for these continuous changes. This is fair enough but it's sometimes hard to figure out when to switch from clusters to trajectories if you don't already know it's continuous. So you usually need at least one subclustering step before you decide that it's continuous enough to switch.) A long time ago, I decided to use some metrics (WCSS, Rand, modularity ratios) to see if I could automatically determine the appropriate number of clusters. I don't remember the exact results but I do remember being disappointed because I ended up with too-broad clusters, as that was the only thing that the various methods were confident in. Moreover, each of the methods had their own tunable parameters and thresholds, so in the end I was just trading one parameter (the number of clusters) for some other parameters without any clear benefit in interpretation. I think the fundamental issue is that there isn't a clean mathematical way of expressing that some level of heterogeneity is biologically uninteresting in order to stop the subclustering. I might stop if my subclusters are all related to cell cycle, or metabolic activity, or some other boring thing, but others might get very excited by those same partitions, so who am I to say if they use those subclusters? A true "hard limit" of overclustering is when you start dropping below technical variation (e.g., the Poisson noise from sequencing), at which point you can confidently say that you've jumped the shark. But it takes a lot, like a lot, of overclustering to get to that point, so it's mostly a useless threshold. In practice, people will always overcluster to see if there's anything interesting as they keep digging. Which is fine, it's all exploratory anyway, no one's really making quantitative claims here. Nonetheless, if you want to implement this method, I'd suggest making your own package; it seems pretty involved and I don't want to be on the hook to maintain it. |
Interesting perspective. |
Is there any summary statistic in Bluster which is known to be robust to the null dataset problem?
Source: Anti-correlated Feature Selection Prevents False Discovery of Subpopulations in scRNA-seq, Nature Communications
Perhaps it would be nice to have a
SingleCellExperiment
-friendly version of this algorithm somewhere in Bioconductor.The text was updated successfully, but these errors were encountered: