Implement GPU based DBSCAN clustering algo #239

teju85 · 2017-10-17T05:38:05Z

Original serial algo: https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
GPU based implementations:

CUDA-DClust: http://www.dbs.ifi.lmu.de/Publikationen/Boehm/CIKM_09.pdf. It uses a collision matrix approach to name clusters. also has an index data structure on GPU that helps reduce computational complexity of eps-neighborhood detection.
G-DBSCAN: https://pdfs.semanticscholar.org/31df/abb8d1085ac468b60a83d32af2a558407c95.pdf. Simpler implementation. Generates a proximity graph out of the eps-neighborhood info and then performs BFS traversal to name the clusters. Thus exposing more parallelism than the previous approach.
IMHO, I think we should start with an implementation based on G-DBSCAN and the refine the implementation as per performance profile.

en-casa · 2017-11-05T23:29:39Z

I'm new to this repository but I'm interested in helping to implement G-DBSCAN as a part of a class project at NCSU (I'm an MS student.) I noticed these two repos: DBSCAN and CudaDBClustering, that may be good starting points. The latter is a little messy.

Has this issue made progress? Is there anything I can help with?

thanks,
nico

mdymczyk · 2017-11-06T00:11:47Z

From what I know @teju85 will not be implementing this for the time being (think they decided to work on Kalman Filters) so you'd have to do the whole implementation yourself (which of course you're more than welcome to do). This would mean:

First a CUDA backend - you can either code it yourself based on the papers @teju85 posted or use an existing repo but in such a case you have to make sure the licensing is appropriate. I don't see any license in the repo you posted so you'd have to reach out to the original committer. We require Apache 2.0 compatible licensing.

You can follow the KMeans/tSVD file structure. All our GPU code is here https://github.com/h2oai/h2o4gpu/tree/master/src/gpu

A Python wrapper - you can have a look at existing examples for KMeans (kmeans.py) or truncated SVD. Examples here https://github.com/h2oai/h2o4gpu/tree/master/src/interface_py/h2o4gpu/solvers
Benchmarks against existing implementations (i.e. sklearn).
If you have time (and will) you can work on C++ implementation.

Even only a CUDA backend (without Python wrappers and C++ impl) would be of great help!

For starters, it would be good if you just cloned and built the repo, maybe ran tests, check if everything is working. Then go through one existing algo (I think truncated SVD would be easiest) and if you think you're up to it fork the repo and start working on your impl.

If you have any questions ping us here or on gitter.

mdymczyk · 2017-11-06T05:23:39Z

@n-casale my bad, misunderstood @teju85 - they already started working on DBSCAN implementation. Not sure if they need any help with it but probably won't be very parallelizable.

teju85 · 2017-11-06T05:27:19Z

Thanks Mateusz for the clarification.

@n-casale Thanks for showing interest and also for the links!
I did look at DBSCAN which implements the G-DBSCAN algo and I believe we can improve the distance-matrix evaluation kernel (which is the biggest contributor towards the runtime). However, it will still be a few weeks before we get to the part of implementing the connected-component-labelling stage (BFS traversal part, I mean). Hence, if you are still interested, that can be an aspect you can contribute to.

Regards,
Thejaswi

tomaszdudek7 · 2018-02-20T18:10:29Z

Hello @teju85
is there already a PR regarding this issue we could follow?

mdymczyk · 2018-02-20T22:28:37Z

Czesc @spaszek :-)

From what I remember @teju85 got a bit sidetracked with other work related issues so I wouldn't count on getting DBSCAN in the nearest future but let me follow up with him and see what we can do :-) Are you using DBSCAN in production? What would be your use case - how many GPUs are you guys using? What's your data size? We're trying to figure out what the users might need.

tomaszdudek7 · 2018-02-21T14:05:02Z

Oh, okay - I was under the impression you guys would have DBScan ready at Q1 2018 (as said in the post here https://blog.h2o.ai/2017/12/h2o4gpu-hands-lab-video-updates/ ) :P

To be honest, we are not using DBScan anywhere(yet) - I am just looking around for alternatives to scikit-learn in case its implementation would be not sufficient for us. It is kinda hard to measure our data size and GPU requirements right now though.

Do you plan to port the implementation to h2o-core? I would love to call DBScan from Scala easily, even without the GPUs. I work at a heavily JVM based company and we replace Python with Scala/Clojure whenever possible.

ąęśćóźłż - if you needed them in Japan :D

teju85 · 2018-02-21T14:30:56Z

Hi @spaszek ,
Sorry, there are currently no PR's for this ready from my side, yet. Like @mdymczyk mentioned, I got distracted with other items and couldn't get to this after the implementation of neighborhood calculation. :(

I'm hopeful of getting back to this soon, though.
Regards

mdymczyk · 2018-02-22T03:32:50Z

@spaszek our roadmap is more of a blueprint than anything set in stone unfortunately as we're heavily understaffed :(

I'm actually working on Java bindings (using SWIG) as we speak and hope to have it integrated with h2o-3/flow before GTC next month (so you'll be able to use it through h2o-3 or just as standalone h2o4gpu4java lib:-). This will have certain limitations, though, as our GPU implementations are not node-distributed (something we're looking into but still not sure should we go with distribution on the CPU or GPU level and whether it is actually necessary).

tomaszdudek7 · 2018-02-22T09:21:41Z

@mdymczyk Would there be a way for me to contribute to h2o4gpu4java? :)

I'm already working on PR at sparkling water project. Should I just search for another open issue in your Jira?

mdymczyk · 2018-02-22T10:33:09Z

@spaszek getting Java bindings should be easy and I will probably have it working by the end of the week/next week. But if you want to use it through H2O-3 or Sparkling Water we'll have to somehow integrate that and there are 2 ways to do that:

write Java model classes for each of the h2o4gpu algorithms, there's some example classes for that here https://github.com/h2oai/h2o-3/tree/master/h2o-algos/src/main/java/hex/example. One "trick" will has to be used, our models have map and reduce functions, as originally H2O was based on M/R. Here, we wouldn't be using them but instead run all the modeling in a special setupLocal method.
go through the existing model classes for H2O-3 KMeans, GLM, PCA etc. and somehow add this as one of the backends.

Will have to have a chat with the h2o-3/sw teams, see which path to choose and then we'll make some JIRAs or github issues here. Will keep you posted :-)

tomaszdudek7 · 2018-02-22T11:07:33Z

great, thanks @mdymczyk

I would love to contribute to that if possible - please don't forget to mention me when you guys settle for a solution. I just started using h2o (and sparkling water) extensively. Contributing to the project would be a good way of learning how to actually use the library (as well as speeding you guys up a little in the process).

voycey · 2018-11-09T05:56:35Z

Hey @teju85 - are there any updates on this? We have tried several other parallelism methods (Spark etc) but I think that CUDA is the way to go - I also think that some of the other items we are currently doing would be a really good fit for the RTX technology (Spatial joins etc)

teju85 · 2018-11-09T11:40:02Z

Hi @voycey ,
I forgot to update this thread. We now have DBSCAN as part of cuML repo here: https://github.com/rapidsai/cuml.

voycey · 2018-11-12T00:36:57Z

This seems to have problems crashing unfortunately! Ill keep an eye on the issue that is debugging it!

teju85 · 2018-11-12T04:01:39Z

@voycey Fixes for crash issue should now be at the HEAD of rapidsai/cuml. Can you check and provide feedback, please? If you still find problems, file an issue in that repo and I'll look at it sooner.

teju85 · 2018-11-12T04:02:40Z

@mdymczyk Do you think we still need to keep this issue open?

voycey · 2018-11-12T04:35:56Z

I will have an attempt at it this week sometime if I can!

blackvvine · 2021-06-23T23:51:15Z

Sorry for necro-bumping, but has there been any progress or alternatives to this?

teju85 · 2021-06-24T02:45:53Z

@blackvvine please refer to my earlier message above. We implemented dbscan inside cuML. Not sure if we ever want to provide a wrapper for h2o4gpu, though. I'll leave this decision to the project admins. @pseudotensor ?

kanchanchy · 2022-12-06T04:10:31Z

@teju85 Which algorithm has been used to implement DBSCAN in cuML? CUDA-DClust or G-DBSCAN or something else?

mdymczyk added CUDA new algo labels Oct 17, 2017

mdymczyk assigned teju85 Nov 6, 2017

mdymczyk added this to the Greener grass milestone Feb 8, 2018

sh1ng mentioned this issue Apr 1, 2020

Lightgbm hangs sometimes #830

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GPU based DBSCAN clustering algo #239

Implement GPU based DBSCAN clustering algo #239

teju85 commented Oct 17, 2017

en-casa commented Nov 5, 2017

mdymczyk commented Nov 6, 2017 •

edited

Loading

mdymczyk commented Nov 6, 2017

teju85 commented Nov 6, 2017

tomaszdudek7 commented Feb 20, 2018

mdymczyk commented Feb 20, 2018

tomaszdudek7 commented Feb 21, 2018 •

edited

Loading

teju85 commented Feb 21, 2018

mdymczyk commented Feb 22, 2018

tomaszdudek7 commented Feb 22, 2018 •

edited

Loading

mdymczyk commented Feb 22, 2018

tomaszdudek7 commented Feb 22, 2018

voycey commented Nov 9, 2018

teju85 commented Nov 9, 2018

voycey commented Nov 12, 2018

teju85 commented Nov 12, 2018 •

edited

Loading

teju85 commented Nov 12, 2018

voycey commented Nov 12, 2018

blackvvine commented Jun 23, 2021

teju85 commented Jun 24, 2021

kanchanchy commented Dec 6, 2022

Implement GPU based DBSCAN clustering algo #239

Implement GPU based DBSCAN clustering algo #239

Comments

teju85 commented Oct 17, 2017

en-casa commented Nov 5, 2017

mdymczyk commented Nov 6, 2017 • edited Loading

mdymczyk commented Nov 6, 2017

teju85 commented Nov 6, 2017

tomaszdudek7 commented Feb 20, 2018

mdymczyk commented Feb 20, 2018

tomaszdudek7 commented Feb 21, 2018 • edited Loading

teju85 commented Feb 21, 2018

mdymczyk commented Feb 22, 2018

tomaszdudek7 commented Feb 22, 2018 • edited Loading

mdymczyk commented Feb 22, 2018

tomaszdudek7 commented Feb 22, 2018

voycey commented Nov 9, 2018

teju85 commented Nov 9, 2018

voycey commented Nov 12, 2018

teju85 commented Nov 12, 2018 • edited Loading

teju85 commented Nov 12, 2018

voycey commented Nov 12, 2018

blackvvine commented Jun 23, 2021

teju85 commented Jun 24, 2021

kanchanchy commented Dec 6, 2022

mdymczyk commented Nov 6, 2017 •

edited

Loading

tomaszdudek7 commented Feb 21, 2018 •

edited

Loading

tomaszdudek7 commented Feb 22, 2018 •

edited

Loading

teju85 commented Nov 12, 2018 •

edited

Loading