New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add k-means++ initialization. #2813
Conversation
const double sampleValue = mlpack::math::Random(); | ||
double* elem = std::lower_bound(distribution.begin(), distribution.end(), | ||
sampleValue); | ||
size_t position = (size_t) (elem - distribution.begin()) / sizeof(double); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should use const
here as well, to help the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; done in 79d2621.
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Oops, I forgot to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it would make sense to slightly increase the threshold for the kmeans test.
You're right---I ran the test 1000 times locally and it failed a decent number of times. I reset the tolerance to be something a decent bit larger than the largest value I saw. So, hopefully, we should never see this test fail in practice. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍
I implemented the k-means++ initialization strategy for k-means like three years ago but for some reason never got around to contributing it back upstream. So, here it is.
https://en.wikipedia.org/wiki/K-means%2B%2B
It's quite an effective initialization strategy, and is often used in practice. It seems to often outperform the Bradley/Fayyad refined start strategy that we have implemented.