predict for HDBSCAN #32

moredatapls · 2019-05-22T13:32:11Z

For a trained HDBSCAN object, I would like to predict the cluster for new data points similar to what is described here. I see that such a functionality exists for DBSCAN in the function predict.dbscan_fast(), but is missing for hdbscan.

Would it be possible to implement a predict.hdbscan() function similar to the one for dbscan_fast? Is there any technical reason why this function doesn't exist? Otherwise, I'd be happy to try to create a PR for that.

The text was updated successfully, but these errors were encountered:

peekxc · 2019-05-22T14:48:19Z

Predicting which clusters new points belong be done simply w/ the cluster membership probabilities for either the default clustering returned or for the clusters returned by cutree-ing the hierarchy (see this issue).

One small technical issue is that since both DBSCAN and HDBSCAN are unsupervised frameworks for clustering, and the predicted clusters won't necessarily match the result of e.g. running DBSCAN/HDBSCAN on the original data set w/ the new data instead, i.e. cluster(X) + predict(new X) != cluster(X + new X). But if people are fine with this w/ DBSCAN then I don't see why not to add this functionality to HDBSCAN

mhahsler · 2019-05-22T17:09:29Z

Predicting cluster membership on new data is a useful thing and should be added.

moredatapls · 2019-05-23T11:56:14Z

sounds good, i will try to create a PR.

Predicting which clusters new points belong be done simply w/ the cluster membership probabilities for either the default clustering returned or for the clusters returned by cutree-ing the hierarchy (see this issue).

@peekxc could you clarify what you mean by that? I'm not entirely sure how to implement your suggestions. what do you mean by the "default clustering"?

peekxc · 2019-05-23T15:26:09Z

@moredatapls What I mean is that HDBSCAN is not a singular clustering algorithm per-se. If you run hdbscan, it creates a hierarchy, optimizes a mass-sensitive criterion to generate a set of local 'cuts' in the hierarchy. The clusters resulting from these cuts are what I refer to as the 'default' clustering.

But HDBSCAN isn't limited to just those local cuts, you can also use it as you would with a more traditional cluster hierarchy, e.g.

data("DS3")
res <- hdbscan(DS3, minPts = 50)
cutree(res$hc, k = 8)

For the prediction though, I think the default clustering is fine.

mhahsler · 2019-05-23T16:29:49Z

I think the default clustering is fine. I have now extracted the predict functions into its own file predict.R. Please put the code for HDBSCAN there.

mhahsler · 2020-05-18T16:48:01Z

@peekxc: Please review the code.

jwijffels · 2021-02-15T09:39:34Z

+1 for a predict.hdbscan, it is something we need if we want to implement https://github.com/michalovadek/top2vecr and put a package on CRAN for that.

mhahsler · 2022-02-16T00:05:16Z

hdbscan has now a predict function.

jwijffels · 2022-02-18T08:32:14Z

thanks!

mhahsler added the enhancement label May 22, 2019

moredatapls mentioned this issue May 26, 2019

Added predict() for HDBSCAN #33

Closed

mhahsler assigned peekxc May 18, 2020

jwijffels mentioned this issue Feb 15, 2021

test bnosac/doc2vec#10

Open

mhahsler closed this as completed Feb 16, 2022

jwijffels mentioned this issue Feb 18, 2022

predict.top2vec bnosac/doc2vec#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict for HDBSCAN #32

predict for HDBSCAN #32

moredatapls commented May 22, 2019

peekxc commented May 22, 2019

mhahsler commented May 22, 2019

moredatapls commented May 23, 2019

peekxc commented May 23, 2019 •

edited

Loading

mhahsler commented May 23, 2019

mhahsler commented May 18, 2020

jwijffels commented Feb 15, 2021

mhahsler commented Feb 16, 2022

jwijffels commented Feb 18, 2022

predict for HDBSCAN #32

predict for HDBSCAN #32

Comments

moredatapls commented May 22, 2019

peekxc commented May 22, 2019

mhahsler commented May 22, 2019

moredatapls commented May 23, 2019

peekxc commented May 23, 2019 • edited Loading

mhahsler commented May 23, 2019

mhahsler commented May 18, 2020

jwijffels commented Feb 15, 2021

mhahsler commented Feb 16, 2022

jwijffels commented Feb 18, 2022

peekxc commented May 23, 2019 •

edited

Loading