Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSCAN isn't using `PointSelectionPolicy` #1625

Closed
rcurtin opened this issue Dec 30, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@rcurtin
Copy link
Member

commented Dec 30, 2018

It was pointed out in IRC today that the DBSCAN class, which has a template parameter PointSelectionPolicy, isn't actually using that template parameter class. We should fix that. This would be a good first issue for someone looking to contribute. Here are the steps to fix the issue...

  1. Read and understand what DBSCAN is and how it works. Wikipedia or the original paper should suffice here, but in either case you should be familiar with DBSCAN.

  2. Take a look at the code at the bottom of dbscan_impl.hpp, at line 193:

  // Now loop over all points.
  for (size_t i = 0; i < data.n_cols; ++i)
  {
    // Union to all neighbors.
    for (size_t j = 0; j < neighbors[i].size(); ++j)
      uf.Union(i, neighbors[i][j]);
  }

That happens after the range search is done, and basically we are building the clusters at this point by using a UnionFind structure. The original intention of the PointSelctionPolicy was that instead of selecting points linearly (i.e. index 0, then 1, then 2, like the for loop above does), that another strategy could be used. We should aim to replace that code with this:

  // Now loop over all points.
  for (size_t i = 0; i < data.n_cols; ++i)
  {
    // Get the next index.
    const size_t index = pointSelector.Select(i, data);
    for (size_t j = 0; j < neighbors[index].size(); ++j)
      uf.Union(index, neighbors[index][j]);
  }
  1. Take a look at random_point_selection.hpp. You'll note that its signature does not match what I just proposed above, so it will need to be adapted. Additionally, the class will need to internally hold the list of which points have been selected and which haven't, so that when Select() is called it returns a random index that has not yet been selected.

  2. Implemented ordered_point_selection.hpp, which when Select(i, data) is called just returns i---so this will imitate the functionality from before.

  3. Set the default PointSelectionPolicy to OrderedPointSelection from the previous step, so that it still behaves the same as previous versions of mlpack.

  4. Modify dbscan_main.cpp to add an option like PARAM_FLAG("random_selection", ...) that will control whether or not OrderedPointSelection or RandomPointSelection is used.

  5. Modify HISTORY.md to point out this change.

  6. Add a test for DBSCAN using OrderedPointSelection to src/mlpack/tests/dbscan_test.cpp. Also make sure that RandomPointSelection is properly tested in that file.

  7. Add a test to ensure that the random_selection flag is being properly used to src/mlpack/tests/main_tests/dbscan_test.cpp.

Hopefully the clarity in what to do here is helpful. Like I said I think this would be a nice straightforward task for someone looking to get involved and contribute to the library.

@Yugandhartripathi

This comment has been minimized.

Copy link

commented Dec 31, 2018

I would like to work on this.

@rcurtin

This comment has been minimized.

Copy link
Member Author

commented Dec 31, 2018

Hi there @Yugandhartripathi, you are more than welcome to. When you have a working PR, I'll review it and we can get it merged.

@KimSangYeon-DGU

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

Hi @rcurtin, @Yugandhartripathi, I read this issue. The specifications were so detailed that I wanted to fix and contribute too. As a result, I made a PR #1627.

And I think the existing RandomPointSelection algorithm that uses boost::dynamic_bitmask is very efficient in finding unvisited points. So I wanted to keep this algorithm instead of using other's.

Could you review this code??

@rcurtin

This comment has been minimized.

Copy link
Member Author

commented Jan 2, 2019

@KimSangYeon-DGU thanks! I reviewed it. It seems like I should try and make more issues with detailed specifications like this in the future. 👍

@Yugandhartripathi

This comment has been minimized.

Copy link

commented Jan 2, 2019

Hi @rcurtin Just made a PR #1628 . This is my first contribution in any C++ repo.
Thanks for such a detailed issue, it helped me take initiative and I learnt a lot about best practices,styles and standards required.
Looking forward to learn and contribute more.

rcurtin added a commit that referenced this issue Jan 7, 2019

Merge pull request #1627 from KimSangYeon-DGU/develop
Fixed DBSCAN isn't using PointSelectionPolicy issue #1625
@rcurtin

This comment has been minimized.

Copy link
Member Author

commented Jan 7, 2019

With #1628 merged, this can be closed now. Thanks again @KimSangYeon-DGU!

@rcurtin rcurtin closed this Jan 7, 2019

@rcurtin rcurtin added the s: fixed label Jan 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.