Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSCAN isn't using PointSelectionPolicy #1625

Closed
rcurtin opened this issue Dec 30, 2018 · 6 comments
Closed

DBSCAN isn't using PointSelectionPolicy #1625

rcurtin opened this issue Dec 30, 2018 · 6 comments

Comments

@rcurtin
Copy link
Member

rcurtin commented Dec 30, 2018

It was pointed out in IRC today that the DBSCAN class, which has a template parameter PointSelectionPolicy, isn't actually using that template parameter class. We should fix that. This would be a good first issue for someone looking to contribute. Here are the steps to fix the issue...

  1. Read and understand what DBSCAN is and how it works. Wikipedia or the original paper should suffice here, but in either case you should be familiar with DBSCAN.

  2. Take a look at the code at the bottom of dbscan_impl.hpp, at line 193:

  // Now loop over all points.
  for (size_t i = 0; i < data.n_cols; ++i)
  {
    // Union to all neighbors.
    for (size_t j = 0; j < neighbors[i].size(); ++j)
      uf.Union(i, neighbors[i][j]);
  }

That happens after the range search is done, and basically we are building the clusters at this point by using a UnionFind structure. The original intention of the PointSelctionPolicy was that instead of selecting points linearly (i.e. index 0, then 1, then 2, like the for loop above does), that another strategy could be used. We should aim to replace that code with this:

  // Now loop over all points.
  for (size_t i = 0; i < data.n_cols; ++i)
  {
    // Get the next index.
    const size_t index = pointSelector.Select(i, data);
    for (size_t j = 0; j < neighbors[index].size(); ++j)
      uf.Union(index, neighbors[index][j]);
  }
  1. Take a look at random_point_selection.hpp. You'll note that its signature does not match what I just proposed above, so it will need to be adapted. Additionally, the class will need to internally hold the list of which points have been selected and which haven't, so that when Select() is called it returns a random index that has not yet been selected.

  2. Implemented ordered_point_selection.hpp, which when Select(i, data) is called just returns i---so this will imitate the functionality from before.

  3. Set the default PointSelectionPolicy to OrderedPointSelection from the previous step, so that it still behaves the same as previous versions of mlpack.

  4. Modify dbscan_main.cpp to add an option like PARAM_FLAG("random_selection", ...) that will control whether or not OrderedPointSelection or RandomPointSelection is used.

  5. Modify HISTORY.md to point out this change.

  6. Add a test for DBSCAN using OrderedPointSelection to src/mlpack/tests/dbscan_test.cpp. Also make sure that RandomPointSelection is properly tested in that file.

  7. Add a test to ensure that the random_selection flag is being properly used to src/mlpack/tests/main_tests/dbscan_test.cpp.

Hopefully the clarity in what to do here is helpful. Like I said I think this would be a nice straightforward task for someone looking to get involved and contribute to the library.

@Yugandhartripathi
Copy link

I would like to work on this.

@rcurtin
Copy link
Member Author

rcurtin commented Dec 31, 2018

Hi there @Yugandhartripathi, you are more than welcome to. When you have a working PR, I'll review it and we can get it merged.

@KimSangYeon-DGU
Copy link
Member

KimSangYeon-DGU commented Jan 2, 2019

Hi @rcurtin, @Yugandhartripathi, I read this issue. The specifications were so detailed that I wanted to fix and contribute too. As a result, I made a PR #1627.

And I think the existing RandomPointSelection algorithm that uses boost::dynamic_bitmask is very efficient in finding unvisited points. So I wanted to keep this algorithm instead of using other's.

Could you review this code??

@rcurtin
Copy link
Member Author

rcurtin commented Jan 2, 2019

@KimSangYeon-DGU thanks! I reviewed it. It seems like I should try and make more issues with detailed specifications like this in the future. 👍

@Yugandhartripathi
Copy link

Yugandhartripathi commented Jan 2, 2019

Hi @rcurtin Just made a PR #1628 . This is my first contribution in any C++ repo.
Thanks for such a detailed issue, it helped me take initiative and I learnt a lot about best practices,styles and standards required.
Looking forward to learn and contribute more.

rcurtin added a commit that referenced this issue Jan 7, 2019
Fixed DBSCAN isn't using PointSelectionPolicy issue #1625
@rcurtin
Copy link
Member Author

rcurtin commented Jan 7, 2019

With #1628 merged, this can be closed now. Thanks again @KimSangYeon-DGU!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants