New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of parallel SGD #1037

Merged
merged 42 commits into from Jul 28, 2017

Conversation

Projects
None yet
4 participants
@shikharbhardwaj
Contributor

shikharbhardwaj commented Jun 24, 2017

Initial ideas for the implementation of parallel SGD based on HOGWILD! in mlpack.

The code is untested and may have a lot of style issues. I'm trying to find reasonable a way to test parallel SGD.

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jun 26, 2017

Contributor

I think the last commit fixes the issues with Visual Studio not being able to compile in-class member initializer with braces. I moved the init to the default constructor.

I have added a simple test for parallel SGD right now. Any other ways we could test this?

Contributor

shikharbhardwaj commented Jun 26, 2017

I think the last commit fixes the issues with Visual Studio not being able to compile in-class member initializer with braces. I moved the init to the default constructor.

I have added a simple test for parallel SGD right now. Any other ways we could test this?

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jun 26, 2017

Member

I think it would be neat to test it on the matrix completion method, I don't think it would be feasible to replicate the results from the paper, at least not as a unit test. Perhaps we can create a small synthetic dataset?

Member

zoq commented Jun 26, 2017

I think it would be neat to test it on the matrix completion method, I don't think it would be feasible to replicate the results from the paper, at least not as a unit test. Perhaps we can create a small synthetic dataset?

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jun 26, 2017

Contributor

I was able to run this on the RCV1 dataset for training a sparse SVM classifier. The problem I had was of defining exactly what metric would be a good test.
If we create a synthetic dataset for Matrix Completion, what property would signify the correctness?
(reduction in the objective function?)
Or do we create a synthetic dataset such that it is guaranteed to optimize to a particular solution, or a particular threshold value of the objective function within a fixed number of iterations?

Contributor

shikharbhardwaj commented Jun 26, 2017

I was able to run this on the RCV1 dataset for training a sparse SVM classifier. The problem I had was of defining exactly what metric would be a good test.
If we create a synthetic dataset for Matrix Completion, what property would signify the correctness?
(reduction in the objective function?)
Or do we create a synthetic dataset such that it is guaranteed to optimize to a particular solution, or a particular threshold value of the objective function within a fixed number of iterations?

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jun 27, 2017

Member

Testing the optimizer on a synthetic dataset with a fixed number of iterations where the resulting objective lies in a reasonable range is definitely something we could do. Perhaps testing it on the maxcut problem is another option.

Member

zoq commented Jun 27, 2017

Testing the optimizer on a synthetic dataset with a fixed number of iterations where the resulting objective lies in a reasonable range is definitely something we could do. Perhaps testing it on the maxcut problem is another option.

@rcurtin

I only took a quick look through so far. It looks nice! I am looking forward to adding it to the codebase.

Are you planning to add the sparse SVM function to the tests or anything? I guess it could also be adapted into a full fledged mlpack method.

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jul 2, 2017

Contributor

Thanks for the initial thoughts, Ryan.

Regarding the tests, I am currently finding how to generate the synthetic datasets for testing both sparse SVM and Matrix Completion methods.

I'll look into the changes required in LogisticRegression and SoftmaxRegression.

Contributor

shikharbhardwaj commented Jul 2, 2017

Thanks for the initial thoughts, Ryan.

Regarding the tests, I am currently finding how to generate the synthetic datasets for testing both sparse SVM and Matrix Completion methods.

I'll look into the changes required in LogisticRegression and SoftmaxRegression.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 2, 2017

Member

Sounds good, looking forward to future improvements. :) Thanks!

Member

rcurtin commented Jul 2, 2017

Sounds good, looking forward to future improvements. :) Thanks!

shikharbhardwaj added some commits Jul 3, 2017

Improve comments in sparse MC implementation.
Add another method for testing Parallel SGD.
@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jul 13, 2017

Contributor

The final commit sets the implementation ready for merge from my side. The one thing remaining is a test for the sparse SVM function. I had tried generating some random sparse data, but the run terminated with 0 RMSE, I don't think it makes for a good test.

The run on RCV1 can be added as a test, but the dataset is ~27MB in Armadillo's binary format, a bit large to include within the repo.

I decided to remove the implementation of the sparse matrix completion loss function as the implementation was quite similar to Regularized SVD.

Contributor

shikharbhardwaj commented Jul 13, 2017

The final commit sets the implementation ready for merge from my side. The one thing remaining is a test for the sparse SVM function. I had tried generating some random sparse data, but the run terminated with 0 RMSE, I don't think it makes for a good test.

The run on RCV1 can be added as a test, but the dataset is ~27MB in Armadillo's binary format, a bit large to include within the repo.

I decided to remove the implementation of the sparse matrix completion loss function as the implementation was quite similar to Regularized SVD.

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jul 14, 2017

Contributor

Thank you Shikhar.

Have you run any benchmarks of HOGWILD! vs "vanilla" SGD? How much boost does plugging in HOGWILD! give us?

I'll play with this more extensively in the weekend and leave you any comments I have.

Contributor

mentekid commented Jul 14, 2017

Thank you Shikhar.

Have you run any benchmarks of HOGWILD! vs "vanilla" SGD? How much boost does plugging in HOGWILD! give us?

I'll play with this more extensively in the weekend and leave you any comments I have.

@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jul 14, 2017

Contributor

The difference is pretty large I think.

I ran HOGWILD! with 4 threads and StandardSGD on the RCV1 dataset (the RMSE here is computed over the training set only).

# HOGWILD!
λ mlpack_spike/hogwild ∴ time OMP_NUM_THREADS=4 ./a.out
RMSE : 0.797936
Initial loss : 70176.4
Final loss : 114.276
RMSE : 0.0496217
OMP_NUM_THREADS=4 ./a.out  16.12s user 0.18s system 392% cpu 4.155 total

# StandardSGD
λ mlpack_spike/hogwild ∴ time ./a.out
RMSE : 0.797936
Initial loss : 70176.4
Final loss : 5969.06
RMSE : 0.227869
./a.out  7.53s user 1.44s system 102% cpu 8.763 total

HOGWILD! finishes 2x faster, with a much better RMSE.

Contributor

shikharbhardwaj commented Jul 14, 2017

The difference is pretty large I think.

I ran HOGWILD! with 4 threads and StandardSGD on the RCV1 dataset (the RMSE here is computed over the training set only).

# HOGWILD!
λ mlpack_spike/hogwild ∴ time OMP_NUM_THREADS=4 ./a.out
RMSE : 0.797936
Initial loss : 70176.4
Final loss : 114.276
RMSE : 0.0496217
OMP_NUM_THREADS=4 ./a.out  16.12s user 0.18s system 392% cpu 4.155 total

# StandardSGD
λ mlpack_spike/hogwild ∴ time ./a.out
RMSE : 0.797936
Initial loss : 70176.4
Final loss : 5969.06
RMSE : 0.227869
./a.out  7.53s user 1.44s system 102% cpu 8.763 total

HOGWILD! finishes 2x faster, with a much better RMSE.

@mentekid

Looks good in general. Mostly style comments on my part.

Show outdated Hide outdated ...pack/core/optimizers/parallel_sgd/decay_policies/exponential_backoff.hpp
Show outdated Hide outdated src/mlpack/core/optimizers/parallel_sgd/parallel_sgd_impl.hpp
}
// Output current objective function.
Log::Info << "Parallel SGD: iteration " << i << ", objective "

This comment has been minimized.

@mentekid

mentekid Jul 17, 2017

Contributor

I would print this to Log::Debug I think, to avoid sending too much information to the screen. However, in SGD, we also print this in Log::Info.

@mentekid

mentekid Jul 17, 2017

Contributor

I would print this to Log::Debug I think, to avoid sending too much information to the screen. However, in SGD, we also print this in Log::Info.

This comment has been minimized.

@shikharbhardwaj

shikharbhardwaj Jul 17, 2017

Contributor

I think the reason for using Log::Info here is to allow the user to choose to display the output (with the -v verbose flag to the executable) without recompiling the library in debug mode.

@shikharbhardwaj

shikharbhardwaj Jul 17, 2017

Contributor

I think the reason for using Log::Info here is to allow the user to choose to display the output (with the -v verbose flag to the executable) without recompiling the library in debug mode.

This comment has been minimized.

@zoq

zoq Jul 20, 2017

Member

I think since, Log::Infois only printed if the user ask for it, it's fine here to use it.

@zoq

zoq Jul 20, 2017

Member

I think since, Log::Infois only printed if the user ask for it, it's fine here to use it.

Show outdated Hide outdated src/mlpack/core/optimizers/parallel_sgd/parallel_sgd_impl.hpp
Show outdated Hide outdated src/mlpack/core/optimizers/parallel_sgd/sparse_test_function.hpp
@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jul 17, 2017

Contributor

Thanks for the reviews @zoq and @mentekid. I pushed some changes to according to the comments.

The Windows build is failing, I don't think its related to the code.

Is there some way to run the style checks for mlpack locally?

Contributor

shikharbhardwaj commented Jul 17, 2017

Thanks for the reviews @zoq and @mentekid. I pushed some changes to according to the comments.

The Windows build is failing, I don't think its related to the code.

Is there some way to run the style checks for mlpack locally?

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jul 18, 2017

Member

About the AppVeyor build: https://appveyor.statuspage.io/incidents/m2vdvw39kdk8, we can restart the build once the issue is resolved.

About the style checks, you can clone the jenkins repository (https://github.com/mlpack/jenkins-conf.git) and run

mkdir reports
./lint.sh --root . --reports reports/cpplint.xml

Let us know if we should clarify anything.

Member

zoq commented Jul 18, 2017

About the AppVeyor build: https://appveyor.statuspage.io/incidents/m2vdvw39kdk8, we can restart the build once the issue is resolved.

About the style checks, you can clone the jenkins repository (https://github.com/mlpack/jenkins-conf.git) and run

mkdir reports
./lint.sh --root . --reports reports/cpplint.xml

Let us know if we should clarify anything.

Show outdated Hide outdated ...pack/core/optimizers/parallel_sgd/decay_policies/exponential_backoff.hpp
Show outdated Hide outdated ...pack/core/optimizers/parallel_sgd/decay_policies/exponential_backoff.hpp
}
// Output current objective function.
Log::Info << "Parallel SGD: iteration " << i << ", objective "

This comment has been minimized.

@zoq

zoq Jul 20, 2017

Member

I think since, Log::Infois only printed if the user ask for it, it's fine here to use it.

@zoq

zoq Jul 20, 2017

Member

I think since, Log::Infois only printed if the user ask for it, it's fine here to use it.

Show outdated Hide outdated src/mlpack/methods/regularized_svd/regularized_svd_function.hpp
Show outdated Hide outdated src/mlpack/methods/sparse_svm/sparse_svm_function.hpp
@shikharbhardwaj

This comment has been minimized.

Show comment
Hide comment
@shikharbhardwaj

shikharbhardwaj Jul 20, 2017

Contributor

Thanks for the suggestions. I will update the code ASAP.

A few things I noticed while running this algorithm on the Netflix dataset :

  1. Shuffling visitationOrder takes a large amount of time. The number of indices is of the order 1e7 for that problem. arma::shuffle creates a copy of the vector, which can be avoided (with std::shuffle I think). Can we use any other approach to partition data among the threads? I remember the authors mentioned running the algorithm on a 12 core machine, out of which 2 cores were just used for handing out data to the worker threads.

  2. Performing the update as is done currently (iterating over the columns of the gradient and updating non-zero components) is wasteful when the gradient has a very large size(as just running an empty loop is expensive when n ≈ 1e6). The way the existing regularized SVD function solves this is to provide a specialisation for Optimize which updates only the needed components. I can't think of a less intrusive way to allow for the function to relay this to the Optimizer, without introducing some extra bookkeeping.

Contributor

shikharbhardwaj commented Jul 20, 2017

Thanks for the suggestions. I will update the code ASAP.

A few things I noticed while running this algorithm on the Netflix dataset :

  1. Shuffling visitationOrder takes a large amount of time. The number of indices is of the order 1e7 for that problem. arma::shuffle creates a copy of the vector, which can be avoided (with std::shuffle I think). Can we use any other approach to partition data among the threads? I remember the authors mentioned running the algorithm on a 12 core machine, out of which 2 cores were just used for handing out data to the worker threads.

  2. Performing the update as is done currently (iterating over the columns of the gradient and updating non-zero components) is wasteful when the gradient has a very large size(as just running an empty loop is expensive when n ≈ 1e6). The way the existing regularized SVD function solves this is to provide a specialisation for Optimize which updates only the needed components. I can't think of a less intrusive way to allow for the function to relay this to the Optimizer, without introducing some extra bookkeeping.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jul 20, 2017

Member

About 1. What if we would use #pragma omp task and run the shuffel process along with the gradient update process.

Member

zoq commented Jul 20, 2017

About 1. What if we would use #pragma omp task and run the shuffel process along with the gradient update process.

@zoq

Just made some minor comments, besides that this looks ready for me.

@zoq

zoq approved these changes Jul 24, 2017

No more comments from my side, this looks ready for me.

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jul 25, 2017

Contributor

I agree, I'll merge it on Friday if there are no more comments.

Contributor

mentekid commented Jul 25, 2017

I agree, I'll merge it on Friday if there are no more comments.

@mentekid mentekid merged commit 00c746e into mlpack:master Jul 28, 2017

3 checks passed

Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jul 28, 2017

Contributor

Merged. Thank you for the great contribution.

@zoq, thanks for the detailed code reviews, I was a bit out of my depth on this one.

Contributor

mentekid commented Jul 28, 2017

Merged. Thank you for the great contribution.

@zoq, thanks for the detailed code reviews, I was a bit out of my depth on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment