New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel implementation of NaiveKMeans #1024

Merged
merged 13 commits into from Jun 26, 2017

Conversation

Projects
None yet
4 participants
@shikharbhardwaj
Contributor

shikharbhardwaj commented Jun 12, 2017

Add a parallel implementation of NaiveKMeans, with the same interface.

Testing

Right now, I just replicated the first test in the existing tests on KMeans, changing the LloydStepType from the default (NaiveKMeans) to ParallelNaiveKMeans.

Speedup

Around 2.33x speedup on 4 threads with a simple benchmark using the Adult dataset with 32560 instances. The measurements.

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jun 13, 2017

Contributor

Thanks for opening the PR. These are some nice speedups.

I think, since the code remains simple enough after the parallelization, we could get away with directly modifying naive_kmeans_impl.hpp instead of adding another file to the folder.

If we do it like that, people will benefit from the parallelism without explicitly specifying the parallel module. It might even speed up the other implementations such as dual trees, which you mentioned use the naive method.
What do you think?

@rcurtin, @zoq would that make the codebase too complicated in your opinions?

Contributor

mentekid commented Jun 13, 2017

Thanks for opening the PR. These are some nice speedups.

I think, since the code remains simple enough after the parallelization, we could get away with directly modifying naive_kmeans_impl.hpp instead of adding another file to the folder.

If we do it like that, people will benefit from the parallelism without explicitly specifying the parallel module. It might even speed up the other implementations such as dual trees, which you mentioned use the naive method.
What do you think?

@rcurtin, @zoq would that make the codebase too complicated in your opinions?

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 13, 2017

Member

I believe that there is no difference between NaiveKMeans and ParallelNaiveKMeans other than the OpenMP bits, so in my opinion it's better to just modify NaiveKMeans and add OpenMP support there. If a user doesn't want OpenMP to parallelize it for some weird reason, they can either specify OMP_NUM_THREADS=1 or they can simply disable OpenMP in the CMake configuration (hmm, I'm not sure if that's actually an option that's currently implemented. If it's not, I'm not too worried about it though, because I doubt anyone will want no OpenMP at all, unless it's not installed, in which case CMake already ignores it. Sorry for the tangent!).

Member

rcurtin commented Jun 13, 2017

I believe that there is no difference between NaiveKMeans and ParallelNaiveKMeans other than the OpenMP bits, so in my opinion it's better to just modify NaiveKMeans and add OpenMP support there. If a user doesn't want OpenMP to parallelize it for some weird reason, they can either specify OMP_NUM_THREADS=1 or they can simply disable OpenMP in the CMake configuration (hmm, I'm not sure if that's actually an option that's currently implemented. If it's not, I'm not too worried about it though, because I doubt anyone will want no OpenMP at all, unless it's not installed, in which case CMake already ignores it. Sorry for the tangent!).

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jun 13, 2017

Contributor

Great. I'll delete the extra file and test case.

Contributor

shikharbhardwaj commented Jun 13, 2017

Great. I'll delete the extra file and test case.

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jun 13, 2017

Contributor

I was going through the issues with kmeans. I will test whether this will be fast enough to solve #514 .

Contributor

shikharbhardwaj commented Jun 13, 2017

I was going through the issues with kmeans. I will test whether this will be fast enough to solve #514 .

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 13, 2017

Member

I think Armadillo's kmeans() implementation is only parallel via OpenBLAS, meaning only some of the operations will be parallelized, so it's possible that the OpenMP work you've done here will outperform that, which would be great (I love when old bugs can be closed!).

Member

rcurtin commented Jun 13, 2017

I think Armadillo's kmeans() implementation is only parallel via OpenBLAS, meaning only some of the operations will be parallelized, so it's possible that the OpenMP work you've done here will outperform that, which would be great (I love when old bugs can be closed!).

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

I tried the benchmark code given in the issue. This implementation is still ~2x slower than Armadillo with OpenMP. This is faster than the previous implementation, which was 4x slower with OpenMP enabled.

Contributor

shikharbhardwaj commented Jun 14, 2017

I tried the benchmark code given in the issue. This implementation is still ~2x slower than Armadillo with OpenMP. This is faster than the previous implementation, which was 4x slower with OpenMP enabled.

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

I tried a few optimizations and have some more speedups.

  1. Using unsafe_col in metric.Evaluate speeds things up quite a bit. But this function is not available in the Sparse matrix type. A template specialisation for sparse matrices is needed.

  2. Avoid the copy by constructing arma::vec at line 80. Since the class accepts an arbitrary matrix type, this is needed.

Overall, these two make the mlpack implementation around 1.2x slower than the Armadillo one.

Contributor

shikharbhardwaj commented Jun 14, 2017

I tried a few optimizations and have some more speedups.

  1. Using unsafe_col in metric.Evaluate speeds things up quite a bit. But this function is not available in the Sparse matrix type. A template specialisation for sparse matrices is needed.

  2. Avoid the copy by constructing arma::vec at line 80. Since the class accepts an arbitrary matrix type, this is needed.

Overall, these two make the mlpack implementation around 1.2x slower than the Armadillo one.

Optimization to NaiveKMeans implementation
Avoid extra copies and use unsafe_col for owned matrices
@mentekid

Minor comments on this.

I am trying to think of a way to get around the critical, which I suspect could give us a nice speedbump.

Show outdated Hide outdated src/mlpack/methods/kmeans/naive_kmeans_impl.hpp
localCounts(closestCluster)++;
}
// Combine calculated state from each thread
#pragma omp critical

This comment has been minimized.

@mentekid

mentekid Jun 14, 2017

Contributor

I haven't tested this and you probably have, so what I'm saying is dumb - but: Have you tried expressing these as an OpenMP reduction? Does it work for the overloaded + operator of armadillo matrices?
If it did, we could remove the critical section, right?

@mentekid

mentekid Jun 14, 2017

Contributor

I haven't tested this and you probably have, so what I'm saying is dumb - but: Have you tried expressing these as an OpenMP reduction? Does it work for the overloaded + operator of armadillo matrices?
If it did, we could remove the critical section, right?

This comment has been minimized.

@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

I had checked for this. :) Yes, that would be enough to remove the critical section.
But OpenMP as of now does not support reductions on non-POD types (or custom user-specified reductions). https://stackoverflow.com/questions/16176155/openmp-reduction-with-overloaded-operator
I think this is quite minimal synchronisation overhead. Armadillo's kmeans does this the hard way by splitting up the data among the number of threads, keeping track of that data and reducing it outside the parallel region. That makes for quite a lot of code.

@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

I had checked for this. :) Yes, that would be enough to remove the critical section.
But OpenMP as of now does not support reductions on non-POD types (or custom user-specified reductions). https://stackoverflow.com/questions/16176155/openmp-reduction-with-overloaded-operator
I think this is quite minimal synchronisation overhead. Armadillo's kmeans does this the hard way by splitting up the data among the number of threads, keeping track of that data and reducing it outside the parallel region. That makes for quite a lot of code.

This comment has been minimized.

@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

One thing I had in my mind, we need not do both the operations in the critical section at once. I'll try and give separate critical sections to both the operations to see if there is any effect on the runtime.

@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

One thing I had in my mind, we need not do both the operations in the critical section at once. I'll try and give separate critical sections to both the operations to see if there is any effect on the runtime.

This comment has been minimized.

@rcurtin

rcurtin Jun 14, 2017

Member

I am not sure if this is helpful but you can get the memory pointer of a matrix with .memptr(), which might allow you to write a reduction. Though even if you did I am not sure how much benefit would be seen.

@rcurtin

rcurtin Jun 14, 2017

Member

I am not sure if this is helpful but you can get the memory pointer of a matrix with .memptr(), which might allow you to write a reduction. Though even if you did I am not sure how much benefit would be seen.

This comment has been minimized.

@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

I am not too sure about how that would make a reduction possible. Any variable inside the reduction clause is restricted to be a scalar (OpenMP does not define what scalar means, but it most probably means numeric POD types).
Each thread calculates its share of centroids and counts, so the reduction must be able to sum up the centroids and counts which are vectors themselves. I can't think of a way to achieve this in the absence of user-defined reductions.
I'll measure how much time is spent on waiting for the critical section to unlock.

@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

I am not too sure about how that would make a reduction possible. Any variable inside the reduction clause is restricted to be a scalar (OpenMP does not define what scalar means, but it most probably means numeric POD types).
Each thread calculates its share of centroids and counts, so the reduction must be able to sum up the centroids and counts which are vectors themselves. I can't think of a way to achieve this in the absence of user-defined reductions.
I'll measure how much time is spent on waiting for the critical section to unlock.

This comment has been minimized.

@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

I made a few measurements and found that the time spent in the critical section by a thread is quite varied, but small. With a sample size of 8000 measurements, the average time was around 365ns. This measurements is quite shaky, as the resolution of the clock on my system is around 20ns. The maximum time spent was around 82us. But this is too low to affect the runtime significantly.

@shikharbhardwaj

shikharbhardwaj Jun 14, 2017

Contributor

I made a few measurements and found that the time spent in the critical section by a thread is quite varied, but small. With a sample size of 8000 measurements, the average time was around 365ns. This measurements is quite shaky, as the resolution of the clock on my system is around 20ns. The maximum time spent was around 82us. But this is too low to affect the runtime significantly.

This comment has been minimized.

@mentekid

mentekid Jun 14, 2017

Contributor

Nice, thanks for doing the correct thing (profiling) and not what I did (guessing).

I'll take a closer look on the algorithm (this time, I solemnly swear to profile), and see if I can come up with any other potential optimizations.

However, 1.2x the Armadillo times is still a significant bump :)

@mentekid

mentekid Jun 14, 2017

Contributor

Nice, thanks for doing the correct thing (profiling) and not what I did (guessing).

I'll take a closer look on the algorithm (this time, I solemnly swear to profile), and see if I can come up with any other potential optimizations.

However, 1.2x the Armadillo times is still a significant bump :)

Show outdated Hide outdated src/mlpack/methods/kmeans/naive_kmeans_impl.hpp
@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jun 15, 2017

Contributor

This final commit sets the mlpack implementation at max around 10% slower than the Armadillo implementation of KMeans(when both are run with 4 threads), according to the benchmark given by #514 although the benchmark is not very stable (both the implementation may start at random centroids).

The mlpack implementation simply does more work than arma::kmeans by computing assignments for each point in the dataset. Without the assignment computation, the mlpack implementation finishes within 1% of arma::kmeans.

Contributor

shikharbhardwaj commented Jun 15, 2017

This final commit sets the mlpack implementation at max around 10% slower than the Armadillo implementation of KMeans(when both are run with 4 threads), according to the benchmark given by #514 although the benchmark is not very stable (both the implementation may start at random centroids).

The mlpack implementation simply does more work than arma::kmeans by computing assignments for each point in the dataset. Without the assignment computation, the mlpack implementation finishes within 1% of arma::kmeans.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 15, 2017

Member

Nice! I think that would be sufficient to close #514; Armadillo doesn't calculate the assignments and that definitely is going to affect the resulting runtime. The benchmarking system (https://github.com/mlpack/benchmarks) is setup to start from the same initial centroids when comparing against other libraries, so it's possible that when this code is run on that system we will get some more stable benchmark comparisons. Everything looks great to me, thank you for looking into this so effectively. Looking forward to the merge. :)

Member

rcurtin commented Jun 15, 2017

Nice! I think that would be sufficient to close #514; Armadillo doesn't calculate the assignments and that definitely is going to affect the resulting runtime. The benchmarking system (https://github.com/mlpack/benchmarks) is setup to start from the same initial centroids when comparing against other libraries, so it's possible that when this code is run on that system we will get some more stable benchmark comparisons. Everything looks great to me, thank you for looking into this so effectively. Looking forward to the merge. :)

shikharbhardwaj added some commits Jun 15, 2017

Simplify OpenMP loops with unsigned loop variables.
omp_size_t gives the correct counter type for the platform.
@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jun 21, 2017

Contributor

If there are no more comments I will go ahead and merge this one.

Contributor

mentekid commented Jun 21, 2017

If there are no more comments I will go ahead and merge this one.

@zoq

zoq approved these changes Jun 21, 2017

Looks good for me, just pointed out a minor style issue.

{
minDistance = distance;
closestCluster = j;
const double distance = metric.Evaluate(dataset.col(i),

This comment has been minimized.

@zoq

zoq Jun 21, 2017

Member

Pedantic style issue: when wrapping a line, the next line should be tabbed twice: for more information take a look at: https://github.com/mlpack/mlpack/wiki/DesignGuidelines#line-length-and-wrapping. This also applies to localCentroids.

@zoq

zoq Jun 21, 2017

Member

Pedantic style issue: when wrapping a line, the next line should be tabbed twice: for more information take a look at: https://github.com/mlpack/mlpack/wiki/DesignGuidelines#line-length-and-wrapping. This also applies to localCentroids.

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jun 23, 2017

Contributor

I have made the style fixes. Ready to merge. :)
Just a reminder, this merge closes #514.

Contributor

shikharbhardwaj commented Jun 23, 2017

I have made the style fixes. Ready to merge. :)
Just a reminder, this merge closes #514.

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jun 23, 2017

Contributor

Thanks for noticing that Marcus.
I think this is ready to merge but I'll leave it open for the weekend, that way anybody that has comments can pitch in.

Contributor

mentekid commented Jun 23, 2017

Thanks for noticing that Marcus.
I think this is ready to merge but I'll leave it open for the weekend, that way anybody that has comments can pitch in.

@mentekid mentekid merged commit 980ba7e into mlpack:master Jun 26, 2017

3 checks passed

Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment