Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k-means++ initialization. #2813

Merged
merged 8 commits into from Jan 26, 2021
Merged

Add k-means++ initialization. #2813

merged 8 commits into from Jan 26, 2021

Conversation

rcurtin
Copy link
Member

@rcurtin rcurtin commented Jan 21, 2021

I implemented the k-means++ initialization strategy for k-means like three years ago but for some reason never got around to contributing it back upstream. So, here it is.

https://en.wikipedia.org/wiki/K-means%2B%2B

It's quite an effective initialization strategy, and is often used in practice. It seems to often outperform the Bradley/Fayyad refined start strategy that we have implemented.

@zoq zoq added this to Need Review in PR Tracking Jan 21, 2021
const double sampleValue = mlpack::math::Random();
double* elem = std::lower_bound(distribution.begin(), distribution.end(),
sampleValue);
size_t position = (size_t) (elem - distribution.begin()) / sizeof(double);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should use const here as well, to help the compiler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; done in 79d2621.

src/mlpack/tests/main.cpp Outdated Show resolved Hide resolved
@rcurtin
Copy link
Member Author

rcurtin commented Jan 21, 2021

Oops, I forgot to run [KMeansMainTest] locally, but the issue should be resolved now. 👍

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it would make sense to slightly increase the threshold for the kmeans test.

@rcurtin
Copy link
Member Author

rcurtin commented Jan 24, 2021

You're right---I ran the test 1000 times locally and it failed a decent number of times. I reset the tolerance to be something a decent bit larger than the largest value I saw. So, hopefully, we should never see this test fail in practice. :)

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me.

@zoq zoq moved this from Need Review to Done in PR Tracking Jan 25, 2021
Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@zoq zoq merged commit f1c5ed9 into mlpack:master Jan 26, 2021
This was referenced Oct 14, 2022
@rcurtin rcurtin mentioned this pull request Oct 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants