Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cutting with HDBSCAN, get membership probability matrix for each observation? #31

Closed
helske opened this issue May 17, 2019 · 2 comments
Closed

Comments

@helske
Copy link

helske commented May 17, 2019

I noticed that it is possible to extract arbitrary number of clusters with hdbscan by using the cutree function on the hc component of the hdbscan output. But is there any simple ways to get the membership probabilities for each element given the fixed number of cluster? I.e. a matrix which gives the cluster probabilities for each element and cluster (such as fanny? in cluster`)?

@peekxc
Copy link
Collaborator

peekxc commented May 17, 2019

The clusters given any given 'flat' cut through the HDBSCAN hierarchy corresponds to a DBSCAN* clustering with a non-normalized KNN density estimate given by 1/core_dist(x) for each point.

The so-called 'membership probabilities' are effectively just the ratio of the difference between the points core distance from a given clusters maximum core distance.

So to get these values, all one needs is the core distance.

library(dbscan)
data("DS3", package = "dbscan")
minPts <- 25L
hcl <- hdbscan(DS3, minPts, gen_hdbscan_tree = TRUE)
# plot(DS3, col = cl$cluster+1L)

## Core distance is needed to calculate membership probabilities
core_dist <- kNNdist(DS3, k = minPts - 1)[, minPts - 1]

## Substitute k / h for whatever you want
cl <- cutree(hcl$hc, k = 5L)
cluster_ids <- Filter(function(x){ x != 0L }, unique(cl))
prob <- rep(0, length(cl))
for (cid in unique(cluster_ids)) {
  max_f <- max(core_dist[cl == cid])
  pr <- (max_f - core_dist[cl == cid])/max_f
  prob[cl == cid] <- pr
}
membership_prob <- prob/sum(prob) ## membership probabilities 

Created on 2019-05-17 by the reprex package (v0.3.0)

Note that the KNN density estimate is not smooth, and the derived membership 'probabilities' are very course.

The values given by default for HDBSCAN are the probabilities for the salient clusters only, which are created by several non-global cuts to the hierarchy. The membership probabilities were more-or-less meant to describe the degree to which each point contributes to the stability of its corresponding cluster.

@helske
Copy link
Author

helske commented May 17, 2019

Thanks a lot, these should be sufficient for my purposes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants