Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict for HDBSCAN #32

Closed
moredatapls opened this issue May 22, 2019 · 9 comments
Closed

predict for HDBSCAN #32

moredatapls opened this issue May 22, 2019 · 9 comments
Assignees

Comments

@moredatapls
Copy link

For a trained HDBSCAN object, I would like to predict the cluster for new data points similar to what is described here. I see that such a functionality exists for DBSCAN in the function predict.dbscan_fast(), but is missing for hdbscan.

Would it be possible to implement a predict.hdbscan() function similar to the one for dbscan_fast? Is there any technical reason why this function doesn't exist? Otherwise, I'd be happy to try to create a PR for that.

@peekxc
Copy link
Collaborator

peekxc commented May 22, 2019

Predicting which clusters new points belong be done simply w/ the cluster membership probabilities for either the default clustering returned or for the clusters returned by cutree-ing the hierarchy (see this issue).

One small technical issue is that since both DBSCAN and HDBSCAN are unsupervised frameworks for clustering, and the predicted clusters won't necessarily match the result of e.g. running DBSCAN/HDBSCAN on the original data set w/ the new data instead, i.e. cluster(X) + predict(new X) != cluster(X + new X). But if people are fine with this w/ DBSCAN then I don't see why not to add this functionality to HDBSCAN

@mhahsler
Copy link
Owner

Predicting cluster membership on new data is a useful thing and should be added.

@moredatapls
Copy link
Author

sounds good, i will try to create a PR.

Predicting which clusters new points belong be done simply w/ the cluster membership probabilities for either the default clustering returned or for the clusters returned by cutree-ing the hierarchy (see this issue).

@peekxc could you clarify what you mean by that? I'm not entirely sure how to implement your suggestions. what do you mean by the "default clustering"?

@peekxc
Copy link
Collaborator

peekxc commented May 23, 2019

@moredatapls What I mean is that HDBSCAN is not a singular clustering algorithm per-se. If you run hdbscan, it creates a hierarchy, optimizes a mass-sensitive criterion to generate a set of local 'cuts' in the hierarchy. The clusters resulting from these cuts are what I refer to as the 'default' clustering.

But HDBSCAN isn't limited to just those local cuts, you can also use it as you would with a more traditional cluster hierarchy, e.g.

data("DS3")
res <- hdbscan(DS3, minPts = 50)
cutree(res$hc, k = 8)

For the prediction though, I think the default clustering is fine.

@mhahsler
Copy link
Owner

I think the default clustering is fine. I have now extracted the predict functions into its own file predict.R. Please put the code for HDBSCAN there.

@mhahsler
Copy link
Owner

@peekxc: Please review the code.

@jwijffels
Copy link

+1 for a predict.hdbscan, it is something we need if we want to implement https://github.com/michalovadek/top2vecr and put a package on CRAN for that.

@mhahsler
Copy link
Owner

hdbscan has now a predict function.

@jwijffels
Copy link

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants