Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hdbscan implementation #1169

Closed
wants to merge 11 commits into from
Closed

Hdbscan implementation #1169

wants to merge 11 commits into from

Conversation

s1998
Copy link
Contributor

@s1998 s1998 commented Dec 13, 2017

No description provided.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sudhanshu,

First of all I apologize for how long it has taken me to get to properly reviewing this. For now I have given it a first pass, and we can work on these issues. It took a very long time for me to find the time to read the HDBSCAN paper---it is time-consuming to come up to speed with new techniques to be able to review new code.

I left a number of comments that I hope are helpful in improving the code. If I can clarify any of them, please just let me know. Overall I think the code should be included in mlpack and it would be nice to have HDBSCAN, but I think before we get to that point there are some simplifications and improvements we can make.

The only two things that I think are missing, for now, is a comparison with https://github.com/scikit-learn-contrib/hdbscan to see if this implementation is faster, and possibly a more comprehensive test. Right now the tests look like the DBSCAN test, but I would love if we could think of some test that is specific to HDBSCAN. I will try to put some time into thinking about what a good test might be, and if you can also, maybe we can come up with something good.

Anyway, in conclusion, thanks for your hard work on this, and I am sorry the review has taken so long. I hope we can get this moving towards a merge now. :)

namespace mlpack {
namespace metric {

class HdbscanMetric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a little documentation to this class please? To the best of my understanding you are doing a really clever trick here---the vectors that are given to this by HDBSCAN have d_core appended to them. Whatever the "trick" ends up being, we should document it here because it will make it clear how the class is meant to be used.

{
typename VecTypeA::elem_type dcoreA = a(a.n_rows-1);
typename VecTypeB::elem_type dcoreB = b(b.n_rows-1);
typename VecTypeB::elem_type distSq = arma::accu(arma::square(a - b));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider calling out to SquaredEuclideanDistance::Evaluate() here.

typename VecTypeA::elem_type dcoreA = a(a.n_rows-1);
typename VecTypeB::elem_type dcoreB = b(b.n_rows-1);
typename VecTypeB::elem_type distSq = arma::accu(arma::square(a - b));
typename VecTypeB::elem_type lastEle = (dcoreA - dcoreB)*(dcoreA - dcoreB);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to read to use std::pow(dcoreA - dcoreB, 2.0) here.

typename VecTypeB::elem_type dcoreB = b(b.n_rows-1);
typename VecTypeB::elem_type distSq = arma::accu(arma::square(a - b));
typename VecTypeB::elem_type lastEle = (dcoreA - dcoreB)*(dcoreA - dcoreB);
typename VecTypeB::elem_type dist = sqrt(distSq - lastEle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we have to specify std::sqrt() here otherwise some other compilers may have a problem with it.

namespace mlpack {
namespace hdbscan {

template<typename NeighborSearch = neighbor::NeighborSearch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some documentation discussing what the class does? This can be similar to the comment at the top of the file, but it should also detail each of the template parameters and what they do.

Log::Warn << "--minimumClusterSize is not specified. Default value is 10!"
<< endl;
minimumClusterSize = 10;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another place where the code has changed somewhat. Now we can use functions like CheckParamValue() and ReportIgnoredParam() and RequireAtLeastOnePassed(), etc., to do the input checks. I can help refactor this; just let me know if you'd like me to.

{
// Generate 5000 points on 0.1 radius circle.
arma::mat points(2, 5000);
double pi = 3.14156;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use M_PI here.

arma::Row<size_t> assignments;
h1.Cluster(points, assignments);

// Some points in the circle will get classified as noise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to mesh with the test condition; is the comment correct?

Copy link
Contributor Author

@s1998 s1998 Feb 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately it is.

Initially I thought there was an error and added a rounding term eps_error, but then I realized eps_error dpends on distance between points in dataset and is not the correct way to go.

I tried to dig into it. Here is what I found (I don't know if it helps but this is the reason for some points getting classified as noise): When we are creating a MST, though all the points are equivalent, some points will be considered first and will be present with shorter edges in the MST.

Example : For minClusterSize = 5
Say point p6 has neighbors p1, p2, p3, p4, p5. Once all these 5 points are inserted, when p6 will be considered (for insertion) all edges that are considered with p6 will have distance ( > dCore(p6)) and p6 will be not equivalent to p1 in the MST, though we know it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are talking about something different... what I mean is this: the comment says that some points will be classified as noise, but the actual test condition (on line 40) tests that no points are classified as noise.

Unfortunately I don't understand the situation that the rest of your explanation is referring to; can you clarify please?

// Some points in the circle will get classified as noise
for (size_t i = 0; i < assignments.n_cols; i++)
{
BOOST_REQUIRE((assignments(i) == 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest BOOST_REQUIRE_EQUAL(assignments[i], 0);.

*/
template<typename MatType>
void Cluster(const MatType& data,
arma::Row<size_t>& assignments);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but I wanted to see what you thought: do you think that a user might benefit from being able to return a full hierarchy of the clusters? Let me know what you think. Maybe it would be some nice functionality.

Copy link
Contributor Author

@s1998 s1998 Feb 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the idea behind Single Linkage Tree, merge the closest clusters to obtain the next cluster and in this way it goes on forming the hierarchy. Should I create a separate function inside the HDBSCAN class ?

getSLT(MatType& data, MatType& SLT) 

or should I implement SLT as a separate class and call this inside HDBSCAN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea, but how about we come back to this point when the rest of the code is ready? Then the design we should use should be a little clearer.

@s1998
Copy link
Contributor Author

s1998 commented Feb 11, 2018

@rcurtin
I need some help with the metric implementation (I was stuck here last time too).
I modified hdbscan_metric to have

  const MatType& data;
  const arma::mat& coreDistances;

and it's constructor to

HdbscanMetric(const MatType& dataInp,
    const arma::mat& dcore):
    data(dataInp),
    coreDistances(dcore)
    { }

I am getting this error
Entire error log : log

I am not sure but what it probably means is ballbound_impl is calling metric = new HDBSCANMetric() and this would mean the new metric created is not having the data (that we stored). Please let me know how to proceed with this. Also since the data members are constant type, there can't be a default constructor with no arguments. I can refactor the code in ballbound to include a constructor that uses passed metric.Will this be correct way to proceed ?

Regarding test : I tried HDBSCAN vs DBSCAN with different datasets, what I found was that DBSCAN after some parameter tuning gave results similar to HDBSCAN.

Run-time comparison and other issues: If it's okay, I will do them once metric implementation is complete (since metric implementation will make the implementation faster).

@rcurtin
Copy link
Member

rcurtin commented Feb 13, 2018

Yeah, I think you are right that some of the other tree types are going to try to call and instantiate the metric with a default constructor. I would consider adding a default constructor, but then when you are holding MatType& data and arma::mat& coreDistances internally, don't hold references, and consider using the advanced constructor to make an alias:

http://arma.sourceforge.net/docs.html#Mat

Basically I can make an alias like this:

arma::mat matrix(10, 10, arma::fill::randu);
arma::mat alias(matrix.memptr(), 10, 10, false, true);

You could do that if the metric is constructed with matrix arguments, or just set it to an empty matrix otherwise (and warn in the documentation that the metric will not be useful unless those are set). That should help get you closer to working.

@rcurtin
Copy link
Member

rcurtin commented Apr 2, 2018

I'll close this for inactivity for now, and when it is updated we can reopen it. (I am just trying to prune the list of open PRs.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants