Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GPU based DBSCAN clustering algo #239

Open
teju85 opened this issue Oct 17, 2017 · 21 comments
Open

Implement GPU based DBSCAN clustering algo #239

teju85 opened this issue Oct 17, 2017 · 21 comments
Assignees
Milestone

Comments

@teju85
Copy link

teju85 commented Oct 17, 2017

Original serial algo: https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
GPU based implementations:

  1. CUDA-DClust: http://www.dbs.ifi.lmu.de/Publikationen/Boehm/CIKM_09.pdf. It uses a collision matrix approach to name clusters. also has an index data structure on GPU that helps reduce computational complexity of eps-neighborhood detection.
  2. G-DBSCAN: https://pdfs.semanticscholar.org/31df/abb8d1085ac468b60a83d32af2a558407c95.pdf. Simpler implementation. Generates a proximity graph out of the eps-neighborhood info and then performs BFS traversal to name the clusters. Thus exposing more parallelism than the previous approach.
    IMHO, I think we should start with an implementation based on G-DBSCAN and the refine the implementation as per performance profile.
@en-casa
Copy link

en-casa commented Nov 5, 2017

I'm new to this repository but I'm interested in helping to implement G-DBSCAN as a part of a class project at NCSU (I'm an MS student.) I noticed these two repos: DBSCAN and CudaDBClustering, that may be good starting points. The latter is a little messy.

Has this issue made progress? Is there anything I can help with?

thanks,
nico

@mdymczyk
Copy link
Contributor

mdymczyk commented Nov 6, 2017

From what I know @teju85 will not be implementing this for the time being (think they decided to work on Kalman Filters) so you'd have to do the whole implementation yourself (which of course you're more than welcome to do). This would mean:

  1. First a CUDA backend - you can either code it yourself based on the papers @teju85 posted or use an existing repo but in such a case you have to make sure the licensing is appropriate. I don't see any license in the repo you posted so you'd have to reach out to the original committer. We require Apache 2.0 compatible licensing.

You can follow the KMeans/tSVD file structure. All our GPU code is here https://github.com/h2oai/h2o4gpu/tree/master/src/gpu

  1. A Python wrapper - you can have a look at existing examples for KMeans (kmeans.py) or truncated SVD. Examples here https://github.com/h2oai/h2o4gpu/tree/master/src/interface_py/h2o4gpu/solvers

  2. Benchmarks against existing implementations (i.e. sklearn).

  3. If you have time (and will) you can work on C++ implementation.

Even only a CUDA backend (without Python wrappers and C++ impl) would be of great help!

For starters, it would be good if you just cloned and built the repo, maybe ran tests, check if everything is working. Then go through one existing algo (I think truncated SVD would be easiest) and if you think you're up to it fork the repo and start working on your impl.

If you have any questions ping us here or on gitter.

@mdymczyk
Copy link
Contributor

mdymczyk commented Nov 6, 2017

@n-casale my bad, misunderstood @teju85 - they already started working on DBSCAN implementation. Not sure if they need any help with it but probably won't be very parallelizable.

@teju85
Copy link
Author

teju85 commented Nov 6, 2017

Thanks Mateusz for the clarification.

@n-casale Thanks for showing interest and also for the links!
I did look at DBSCAN which implements the G-DBSCAN algo and I believe we can improve the distance-matrix evaluation kernel (which is the biggest contributor towards the runtime). However, it will still be a few weeks before we get to the part of implementing the connected-component-labelling stage (BFS traversal part, I mean). Hence, if you are still interested, that can be an aspect you can contribute to.

Regards,
Thejaswi

@mdymczyk mdymczyk added this to the Greener grass milestone Feb 8, 2018
@tomaszdudek7
Copy link

Hello @teju85
is there already a PR regarding this issue we could follow?

@mdymczyk
Copy link
Contributor

Czesc @spaszek :-)

From what I remember @teju85 got a bit sidetracked with other work related issues so I wouldn't count on getting DBSCAN in the nearest future but let me follow up with him and see what we can do :-) Are you using DBSCAN in production? What would be your use case - how many GPUs are you guys using? What's your data size? We're trying to figure out what the users might need.

@tomaszdudek7
Copy link

tomaszdudek7 commented Feb 21, 2018

Oh, okay - I was under the impression you guys would have DBScan ready at Q1 2018 (as said in the post here https://blog.h2o.ai/2017/12/h2o4gpu-hands-lab-video-updates/ ) :P

To be honest, we are not using DBScan anywhere(yet) - I am just looking around for alternatives to scikit-learn in case its implementation would be not sufficient for us. It is kinda hard to measure our data size and GPU requirements right now though.

Do you plan to port the implementation to h2o-core? I would love to call DBScan from Scala easily, even without the GPUs. I work at a heavily JVM based company and we replace Python with Scala/Clojure whenever possible.

ąęśćóźłż - if you needed them in Japan :D

@teju85
Copy link
Author

teju85 commented Feb 21, 2018

Hi @spaszek ,
Sorry, there are currently no PR's for this ready from my side, yet. Like @mdymczyk mentioned, I got distracted with other items and couldn't get to this after the implementation of neighborhood calculation. :(

I'm hopeful of getting back to this soon, though.
Regards

@mdymczyk
Copy link
Contributor

@spaszek our roadmap is more of a blueprint than anything set in stone unfortunately as we're heavily understaffed :(

I'm actually working on Java bindings (using SWIG) as we speak and hope to have it integrated with h2o-3/flow before GTC next month (so you'll be able to use it through h2o-3 or just as standalone h2o4gpu4java lib:-). This will have certain limitations, though, as our GPU implementations are not node-distributed (something we're looking into but still not sure should we go with distribution on the CPU or GPU level and whether it is actually necessary).

@tomaszdudek7
Copy link

tomaszdudek7 commented Feb 22, 2018

@mdymczyk Would there be a way for me to contribute to h2o4gpu4java? :)

I'm already working on PR at sparkling water project. Should I just search for another open issue in your Jira?

@mdymczyk
Copy link
Contributor

@spaszek getting Java bindings should be easy and I will probably have it working by the end of the week/next week. But if you want to use it through H2O-3 or Sparkling Water we'll have to somehow integrate that and there are 2 ways to do that:

  1. write Java model classes for each of the h2o4gpu algorithms, there's some example classes for that here https://github.com/h2oai/h2o-3/tree/master/h2o-algos/src/main/java/hex/example. One "trick" will has to be used, our models have map and reduce functions, as originally H2O was based on M/R. Here, we wouldn't be using them but instead run all the modeling in a special setupLocal method.

  2. go through the existing model classes for H2O-3 KMeans, GLM, PCA etc. and somehow add this as one of the backends.

Will have to have a chat with the h2o-3/sw teams, see which path to choose and then we'll make some JIRAs or github issues here. Will keep you posted :-)

@tomaszdudek7
Copy link

great, thanks @mdymczyk

I would love to contribute to that if possible - please don't forget to mention me when you guys settle for a solution. I just started using h2o (and sparkling water) extensively. Contributing to the project would be a good way of learning how to actually use the library (as well as speeding you guys up a little in the process).

@voycey
Copy link

voycey commented Nov 9, 2018

Hey @teju85 - are there any updates on this? We have tried several other parallelism methods (Spark etc) but I think that CUDA is the way to go - I also think that some of the other items we are currently doing would be a really good fit for the RTX technology (Spatial joins etc)

@teju85
Copy link
Author

teju85 commented Nov 9, 2018

Hi @voycey ,
I forgot to update this thread. We now have DBSCAN as part of cuML repo here: https://github.com/rapidsai/cuml.

@voycey
Copy link

voycey commented Nov 12, 2018

This seems to have problems crashing unfortunately! Ill keep an eye on the issue that is debugging it!

@teju85
Copy link
Author

teju85 commented Nov 12, 2018

@voycey Fixes for crash issue should now be at the HEAD of rapidsai/cuml. Can you check and provide feedback, please? If you still find problems, file an issue in that repo and I'll look at it sooner.

@teju85
Copy link
Author

teju85 commented Nov 12, 2018

@mdymczyk Do you think we still need to keep this issue open?

@voycey
Copy link

voycey commented Nov 12, 2018

I will have an attempt at it this week sometime if I can!

@blackvvine
Copy link

Sorry for necro-bumping, but has there been any progress or alternatives to this?

@teju85
Copy link
Author

teju85 commented Jun 24, 2021

@blackvvine please refer to my earlier message above. We implemented dbscan inside cuML. Not sure if we ever want to provide a wrapper for h2o4gpu, though. I'll leave this decision to the project admins. @pseudotensor ?

@kanchanchy
Copy link

@teju85 Which algorithm has been used to implement DBSCAN in cuML? CUDA-DClust or G-DBSCAN or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants