Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

compute gap statistic on (user-specified) ordination axes #99

joey711 opened this Issue Apr 17, 2012 · 7 comments


None yet
1 participant

joey711 commented Apr 17, 2012

Should optionally create a plot summarizing the results.

This will be borrowed

An implementation of the gap statistic algorithm from Tibshirani, Walther, and Hastie's "Estimating the number of clusters in a data set via the gap statistic". A description of the algorithm can be found here

I wrote to the author of the version on GitHub that I found (and adapted for a private example), in the form of a license issue on his (Edwin Chen's) gap statistic development page.

As soon as this is clear, it should be possible to write an adaptation for ordination results, and formally assessing numbers of clusters.

@joey711 joey711 was assigned Apr 17, 2012


joey711 commented Jun 6, 2012

This was cleared by the author, but, there are several alternative implementations already available in common R packages, most notably the clusGap function in the cluster package. cluster has many reverse dependencies, and has Martin Maechler as its lead author, so is probably reliable.

Need to test that it works well for this purpose, can be adapted to make the nice ggplot2 output like with Edwin Chen's implementation.

If so, make a nice wrapper for it to use with phyloseq data.


joey711 commented Jun 6, 2012

(footnote: Might have also considered lga::gap, but the lga package is officially listed as ORPHANED on CRAN at the moment. Not a good idea to include it as a dependency...


joey711 commented Jun 7, 2012

This is straight-forward enough it makes sense just to show some example code. Maybe even adapt the ggplot2 plotting used by Edwin Chen. Post the example on the wiki and in a vignette, and consider it done. Unclear why a formal wrapper is necessary.


joey711 commented Jul 13, 2012

Here is an example:

I will try to work this into a fancier Rmd/html example and post the link. The following code gives you plots that are useful and shows a relatively straightforward result when two clusters is a pretty good answer.

# Load data
# ordination
ent.dca <- ordinate(enterotype)
# Plot
plot_ordination(enterotype, ent.dca, color="Enterotype")
plot_ordination(enterotype, ent.dca, color="SeqTech")

# Gap Statistic: How many clusters are there?
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))
x <- scores(ent.dca)
# gskmn <- clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn <- clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 500)
plot(gskmn, main = "Gap statistic for the 'Enterotypes' data")
mtext("k = 2 is best ... but  k = 3  pretty close")

The clusGap function does a pretty good job implementing this. Unclear why a formal wrapper is needed, especially considering that you already support the scores function, extended from vegan-package, to expose the coordinates of an ordination.


joey711 commented Jul 16, 2012

Actually, the scores extensions are hidden as internal functions to avoid conflicts with vegan, now and in the future. You could name a different function that wraps scores, but is a convoluted answer. Alternatively, could include a simple wrapper of the above code (already written). Although, there are many other reasons a user might want the coordinates. Maybe at least show an example in which plot_ordination returns the plot object, say p1, and show how to get the coordinates (plus covariates) from p1$data.


joey711 commented May 28, 2013

An updated version of this gap statistic example is shown on the gap statistic tutorial page and in the more elaborate phyloseq demo.

Will add another dependency to phyloseq to include it, but as stated earlier, the "cluster" package is a good one.


joey711 commented May 30, 2013

Closed by pull 217

@joey711 joey711 closed this May 30, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment