Implementation of Stochastic Coordinate Descent #1075

shikharbhardwaj · 2017-08-01T22:08:24Z

Here is an implementation of another optimizer, SCD. I have started with a serial implementation first, and made some changes to LogisticRegressionFunction to test this optimizer. The optimizer requires another addition to the requirements for the function type to be optimized.

Currently, I am working on getting two things done :

Implement the greedy descent policy, using ideas from : http://www.maths.ed.ac.uk/~prichtar/Optimization_and_Big_Data_2015/slides/Schmidt.pdf.
Optimize the FeatureGradient method in LogisticRegressionFunction to not use Gradient.

I'll update soon with more changes.

Minor style fixes.

zoq · 2017-08-04T13:32:04Z

src/mlpack/core/optimizers/scd/descent_policies/greedy_descent.hpp

+   * @return The index of the coordinate to be descended.
+   */
+
+  // TODO: Find a way to implement this.


I'm not sure I get the point of the TODO, can you elaborate on this?

I am planning to implement this descent policy which gives the feature with the maximum guaranteed descent, based on : http://www.maths.ed.ac.uk/~prichtar/Optimization_and_Big_Data_2015/slides/Schmidt.pdf and ideas from this implementation of SCD : http://ttic.uchicago.edu/~tewari/code/scd/

The above mentioned implementation uses a parameter "rho", which is a part of the objective function, which is 0.25 for logistic and hinge loss and 1 for quadratic loss in the part for calculating the predicted descent and finding the feature with the steepest descent.

Use parallel SGD test function for testing SCD.

rcurtin

Looks good to me. I do think we need a formalized write-up of all the different FunctionTypes that we have now and what the differences between them are. Would you be willing to sketch up some documentation for that? I figured it could be look the documentation in doc/policies/.

Another question I have is, are you planning on making any of the other functions (like softmax regression) into ResolvableFunctionTypes?

And last question for now, have you done any timing simulations to compare SCD with, e.g., LBFGS or SGD for logistic regression? I am excited to see how it performs. :)

rcurtin · 2017-08-10T02:53:10Z

src/mlpack/tests/scd_test.cpp

+  arma::mat predictors("0 0 0.4; 0 0 0.6; 0 0.3 0; 0.2 0 0; 0.2 -0.5 0;");
+  arma::Row<size_t> responses("1  1  0;");
+
+


Looks like there is an extra line here.

rcurtin · 2017-08-10T03:03:11Z

src/mlpack/core/optimizers/scd/scd.hpp

+ *  NumFeatures() should return the number of features in the decision variable.
+ *  Evaluate gives the value of the loss function at the current decision
+ *  variable and FeatureGradient is used to evaluate the partial gradient with
+ *  respect to the jth feature.


Thanks for documenting this. What happens if a user passes a FunctionType that isn't a ResolvableFunctionType? How difficult is the error message to understand?

I guess if someone passes in a FunctionType without the ResolvableFunctionType interface, the compiler will complain about the absence of member functions NumFeatures and FeatureGradient. Are there ways we can make this more understandeable to the user?

Sorry for the slow response here. We could use a static_assert() check to ensure that it has the proper methods. But for the sake of this PR I think that is not necessary. If you are interested afterwards in making the error reporting more robust for template parameters, I have some ideas. But if you are not interested that is ok too, you should not feel obligated. :)

shikharbhardwaj · 2017-08-10T20:18:41Z

Thanks for the suggestions Ryan.
I am currently working on some more improvements. I'll add more functions with the Resolvable requirements soon.

Documenting the policies added is a nice improvement. I'll do that.

mentekid

Only minor comments from me.

This looks good. I am glad you went into the testing rabbit hole with HOGWILD!, now testing this feels more straightforward.

mentekid · 2017-08-10T21:23:00Z

src/mlpack/core/optimizers/parallel_sgd/sparse_test_function.hpp

+    return coordinates[i] * coordinates[i] + bi[i] * coordinates[i] +
+      intercepts[i];
+  }
+
  //! Evaluate a function.


Minor - I think this comment and the one on the overload above should be swapped?

mentekid · 2017-08-10T21:27:02Z

src/mlpack/core/optimizers/scd/descent_policies/random_descent.hpp

+   * The DescentFeature method is used to get the descent coordinate for the
+   * current iteration.
+   *
+   * @tparam ResolvableFunctionType The type of the function to be optimized.


Should we have a pointer to the more detailed description of what the signature of this function is in scd.hpp? Something like "For more information see the SCD implementation"

mentekid · 2017-08-10T21:30:04Z

src/mlpack/core/optimizers/scd/scd.hpp

+ *  @tparam DescentPolicy Descent policy to decide the order in which the
+ *      coordinate for descent is selected.
+ */
+template <typename DescentPolicyType = RandomDescent>


I like that you have a default policy here.

mentekid · 2017-08-10T21:35:56Z

src/mlpack/core/optimizers/scd/scd.hpp

+  size_t updateInterval;
+
+  //! The descent policy used to pick the coordinates for the update.
+  DescentPolicyType descentPolicy;


I would argue that the user would be unlikely to change the descent policy during the object's lifetime - would it make sense to make it const and remove the set function? I realise it's a very minor point though.

ResolvableFunctionType API changes

Unify changes with test functions

shikharbhardwaj · 2017-08-15T19:19:22Z

I finished up the changes in SoftmaxRegressionFunction with the last commit. A minor inconsistency I noticed while working with the code in softmax and logistic regression is the way the decision variable is represented. Softmax regression has the features arranged column-wise, where as logistic regression has it row-wise (so it gives out a single column vector as the output from Gradient).

I guess it would be nice to make this uniform across all the functions (SCD could then do the update on the relevant column instead of working on the entire decision variable), also easing up the parallelisation?

zoq · 2017-08-17T22:20:30Z

I think we should see if we can work out the inconsistency, @rcurtin already looked into the logistic regression method so he can probably provide some more insights, regarding what needs to be done.

shikharbhardwaj · 2017-08-18T21:09:23Z

I am working on the changes in Logistic regression (using a rowvec for the decision variable and obtaining the submatrix views with tail_cols). I am done with the changes in the function, but they break some other tests (which assumed the gradient to be in the other shape).

zoq · 2017-08-18T21:45:13Z

Can you refactor the test cases too, don't feel obligated.

shikharbhardwaj · 2017-08-19T09:15:02Z

Sure, I am already on it. :)

Combine simultaneous changes.

…tation.

mentekid

Again only minor comments.

Thanks for normalizing the logistic regression interface as well, that was a good catch :)

mentekid · 2017-08-21T08:18:50Z

src/mlpack/methods/logistic_regression/logistic_regression_function_impl.hpp

  const double regularization = lambda * (1.0 / (2.0 * predictors.n_cols)) *
-      arma::dot(parameters.col(0).subvec(1, parameters.n_elem - 1),
-                parameters.col(0).subvec(1, parameters.n_elem - 1));
+    norm * norm;


indent 2 more spaces here

mentekid · 2017-08-21T08:25:57Z

src/mlpack/methods/softmax_regression/softmax_regression_function.cpp

@@ -205,3 +205,16 @@ void SoftmaxRegressionFunction::Gradient(const arma::mat& parameters,
               lambda * parameters;
  }
 }
+
+void SoftmaxRegressionFunction::FeatureGradient(const arma::mat& parameters,
+    const size_t j,


You could also tab this to align the consts in this line and the one above, but that's really minor.

mentekid · 2017-08-21T08:39:14Z

src/mlpack/core/optimizers/parallel_sgd/parallel_sgd_impl.hpp

@@ -54,10 +54,7 @@ double ParallelSGD<DecayPolicyType>::Optimize(
    lastObjective = overallObjective;
    overallObjective = 0;


Also, you can remove this line since you're now evaluating it all in one go.

mentekid · 2017-08-22T13:14:40Z

This looks ready to merge on my end - AppVeyor failure seems to be a random timeout.

I am leaving it open to any comments or reviews, and I'll merge this on Friday if nothing comes up.

@shikharbhardwaj nice work - thank you and well done :)

zoq · 2017-08-22T17:42:33Z

Let's restart the windows build, looks like the build could not fetch the repo.

zoq · 2017-08-22T17:44:58Z

src/mlpack/core/optimizers/scd/descent_policies/cyclic_descent.hpp

+ * @file cyclic_descent.hpp
+ * @author Shikhar Bhardwaj
+ *
+ * Cyclic descent policy for Stochastic Co ordinate Descent (SCD).


Looks like an unnecessary space?

zoq · 2017-08-22T17:49:02Z

src/mlpack/core/optimizers/scd/descent_policies/cyclic_descent.hpp

+
+} // namespace optimization
+} // namespace mlpack
+#endif


Do you mind to add a newline after #endif I think that applies for almost all files in this PR.

zoq · 2017-08-22T17:57:46Z

src/mlpack/core/optimizers/scd/scd.hpp

+  SCD(const double stepSize = 0.01,
+      const size_t maxIterations = 100000,
+      const double tolerance = 1e-5,
+      const size_t updateInterval = 1e3,


Not sure, but it sounds like updateInterval is what we call batchSize in e.g. MinibatchSGD I guess if it's the same, we should think about renaming the parameter. Let me know what you think.

updateInterval is different from batchSize. It is the number of iterations after which we print the diagnostic information to the logs. As printing requires a call to Evaluate (which may take time), we need to make sure that the diagnostic info does not slow down the iteration (each iteration in SCD is expected to be much faster than Evaluate).

I see, thanks for the clarification.

zoq · 2017-08-22T17:59:59Z

src/mlpack/methods/logistic_regression/logistic_regression_function_impl.hpp


  // Calculate sigmoid.
  const double exponent = parameters(0, 0) + arma::dot(predictors.col(i),
-      parameters.col(0).subvec(1, parameters.n_elem - 1));
+      parameters.tail_cols(parameters.n_elem - 1).t());


zoq · 2017-08-22T18:01:45Z

src/mlpack/methods/logistic_regression/logistic_regression_function_impl.hpp

+  const arma::rowvec sigmoids = (1 / (1 + arma::exp(-parameters(0, 0)
+      - parameters.tail_cols(parameters.n_elem - 1) * predictors)));
+
+  arma::mat diffs = responses - sigmoids;


I think we could inline the operation here, since it only needed once.

zoq · 2017-08-22T18:06:15Z

src/mlpack/tests/scd_test.cpp

+
+  BOOST_REQUIRE_EQUAL(descentPolicy.DescentFeature(0, point, f), 2);
+
+  point[1] = 10;


Can you eloborate on this (add a comment)?

rcurtin

Looks great to me. Thank you so much for adding the tutorial on FunctionType parameters, I think it is a great improvement. These are the last comments I have; I think they are all pretty minor. From my end the PR is good to go if you can address them. Great work!

rcurtin · 2017-08-22T18:30:28Z

doc/policies/functiontype.hpp

+@endcode
+
+To evaluate the gradient at the given coordinates, where \c gradient is an
+out-param for the required gradient.


It might be easier to read if you include this comment as part of the @code snippet:

// Evaluate the gradient at the given coordinates, where 'gradient' is an output // parameter for the required gradient. void Gradient(const arma::mat& coordinates, arma::mat& gradient);

There are lots of little @code blocks you could modify like this. What do you think?

Sure, I'll do that. I was thinking about improving the formatting but somehow it slipped my mind.

rcurtin · 2017-08-22T18:35:18Z

src/mlpack/core/optimizers/scd/descent_policies/cyclic_descent.hpp

+/**
+ * Cyclic descent policy for Stochastic Co-ordinate Descent(SCD). This
+ * descent scheme picks a the co-ordinate for the descent in a cyclic manner
+ * serially.


Is there an easy paper to cite here for SCD? It's always nice to provide references to the original paper.

rcurtin · 2017-08-22T18:36:16Z

doc/policies/functiontype.hpp

+out-param for the required gradient. The out-param is a sparse matrix(with
+dimensions equal to the decision variable), for storing the gradient of the
+jth feature.  The \c gradient matrix is supposed to be non-zero in the jth
+column, which contains the relavant partial gradient.


Minor misspelling: 'relevant' :)

rcurtin · 2017-08-22T18:38:05Z

doc/policies/functiontype.hpp

+\c FunctionType interface.
+
+@code
+void FeatureGradient(const arma::mat& coordinates, const size_t j, arma::sp_mat& gradient);


Do you think it would be better to name this function PartialGradient() since it is the partial gradient with respect to one of the coordinates features?

I agree, PartialGradient describes the function better.

rcurtin · 2017-08-22T18:39:10Z

src/mlpack/core/optimizers/scd/descent_policies/greedy_descent.hpp

+   *
+   * @tparam ResolvableFunctionType The type of the function to be optimized.
+   * @param numEpoch The iteration number for which the feature is to be
+   *    obtained.


To me an epoch is a term used for an entire pass over the whole dataset, but it seems like the wrong word to use here. I think it might be clearer to just name this parameter iteration.

rcurtin · 2017-08-22T18:42:33Z

doc/policies/functiontype.hpp

+The interface expects the following member functions from the function class
+
+@code
+size_t NumFeatures();


I am not sure on this---is it always true that NumFeatures() == coordinates.n_elem? If so, I am not sure this function is needed.

Yes, in the current scheme, NumFeatures() == coordinates.n_cols. I guess we can remove this function then (I kept it for consistency with SGD).

rcurtin · 2017-08-22T18:45:36Z

src/mlpack/methods/logistic_regression/logistic_regression_function_impl.hpp

-      arma::dot(parameters.col(0).subvec(1, parameters.n_elem - 1),
-                parameters.col(0).subvec(1, parameters.n_elem - 1));
+  // term and take every term except the last one in the decision variable.
+  double norm = arma::norm(parameters.tail_cols(parameters.n_elem - 1));


I think the use of tail_cols() is a nice improvement, but is the switch of parameters from arma::vec to arma::rowvec necessary? That's an API change and might break some user's code. I think you could also just do tail_rows() and leave it as an arma::vec. Let me know what you think. Also I think it might be faster (very incrementally) to use arma::dot() instead of arma::norm() and then multiplying.

rcurtin · 2017-08-22T18:47:56Z

src/mlpack/methods/softmax_regression/softmax_regression_function.cpp

+
+  gradient.set_size(arma::size(denseGrad));
+
+  gradient.col(j) = denseGrad.col(j);


Would it be faster to explicitly calculate only the part of the gradient that is needed here?

rcurtin · 2017-08-22T18:49:37Z

src/mlpack/tests/scd_test.cpp

+/**
+ * Test the greedy descent policy.
+ */
+BOOST_AUTO_TEST_CASE(GreedyDescentTest)


It might be also useful to add a simple test for RandomDescent and CyclicDescent; very simple tests can just make sure they give the expected result. These can be helpful later, to make sure that nothing is broken by later maintenance/refactorings of the code.

rcurtin · 2017-08-22T18:50:22Z

src/mlpack/tests/scd_test.cpp

+}
+
+/**
+ * Test changes to Softmax regression function.


I think this might be an inaccurate comment (or at least it won't make sense after merge), maybe something more like Test that SoftmaxRegressionFunction::FeatureGradient() works as expected.

Add descriptive comments and inline linear algbra operations in LogisticRegressionFunction::FeatureGradient

Use arma::dot instead of arma::norm in LogisiticRegressionFunction::PartialGradient

rcurtin · 2017-08-28T15:36:16Z

@shikharbhardwaj: thanks for the responses, everything looks good to me. I think there are still two comments worth addressing---SoftmaxRegressionFunction::PartialGradient() could be accelerated still, and tests could be added for RandomDescent and CyclicDescent, but those are up to you. I think it's fine to merge as-is regardless. Thanks for the hard work, I think this is a nice addition. 👍

shikharbhardwaj · 2017-09-01T12:04:36Z

Sure, I had added tests for the descent policies (CyclicDescentTest and RandomDescenTest in the SCD test file).
I had started with optimizing PartialGradient in SoftmaxRegressionFunction but got confused. I'll start again with a clear head today.
I also looked into the idea of checking the function API for consistency at compile time. I guess this could be applied to almost all optimizers, so I'll do this as a separate PR.

rcurtin · 2017-09-14T15:00:18Z

I think this is ready to merge; @mentekid: was there anything else you were waiting on for this one?

mentekid · 2017-09-14T15:29:56Z

Sorry, I was planning to review this but forgot. I haven't looked at the latest push, but if you think it is ready to go just go ahead and merge.

Sorry for the delay, @shikharbhardwaj thanks for the contribution!

rcurtin · 2017-09-14T15:50:18Z

Ok, sure, I'll go ahead and merge it then. I think the latest changes we good. Thanks! :)

shikharbhardwaj added 5 commits July 30, 2017 16:01

Initial implementation of SCD

3f2e40c

Add an initial test for SCD

348f809

Fix CMakeLists to include all tests

19f70ca

Tidy up unnecessary changes in CMakeLists

00dadf4

Changes in SCD test parameters

88efa8a

Minor style fixes.

zoq reviewed Aug 4, 2017

View reviewed changes

shikharbhardwaj added 3 commits August 6, 2017 14:48

Merge branch 'master' of https://github.com/mlpack/mlpack into scd

e1c80d9

Use parallel SGD test function for testing SCD.

Add sparse test function to SCD tests

821be93

Make LogisticRegressionFunction::FeatureGradient to do less work.

1a095c7

rcurtin reviewed Aug 10, 2017

View reviewed changes

mentekid reviewed Aug 10, 2017

View reviewed changes

shikharbhardwaj added 7 commits August 12, 2017 00:08

Resolve issues with failing SCD test

f46de47

Add clarification comments and tidy up formatting.

c43cc15

Add initial documentation for FunctionType interface.

42dba55

Implement greedy descent policy and relevant test for it.

8439bf3

ResolvableFunctionType API changes

Update documentation to reflect changes in ResolvableFunctionType API

4208c57

Add FeatureGradient to SoftmaxRegressionFunction and relevant tests

db94857

Update documentation with the changes.

a4bfb4f

Unify changes with test functions

Fix issue with wrong gradient size.

d534140

Refactor Logistic regression function to use consistent gradient layout.

fda4994

shikharbhardwaj added 4 commits August 19, 2017 18:31

Refactor tests to use new decision variable format.

4a09875

Refactor tests to comply with changes in LogisticRegressionFunction.

0fd812f

Modify SCD to update only relavant part of the decision variable.

204a80d

Wrap lines to 80 char limit.

0211aa4

shikharbhardwaj added 3 commits August 19, 2017 20:03

Merge branch master with scd

46140ea

Combine simultaneous changes.

Fix bug in RandomDescent implementation.

82ecbb6

Make parallel SGD implementation consistent with FunctionType documen…

c5727c2

…tation.

mentekid reviewed Aug 21, 2017

View reviewed changes

Style and static code analysis fixes.

8299bc1

zoq reviewed Aug 22, 2017

View reviewed changes

rcurtin reviewed Aug 22, 2017

View reviewed changes

shikharbhardwaj added 5 commits August 23, 2017 16:57

Fix comments and minor style issues.

c0590e1

Better formatting and minor fixes in FunctionType documentation

4ffdd78

Add descriptive comments and inline linear algbra operations in LogisticRegressionFunction::FeatureGradient

Update documentation to rename FeatureGradient -> PartialGradient

6ff9b46

Add tests for SCD descent policies.

3078f3e

Add references for relevant papers in SCD implementation.

1d42aa9

Use arma::dot instead of arma::norm in LogisiticRegressionFunction::PartialGradient

Optimize SoftmaxRegressionFunction::PartialGradient to do less work.

f11eb5a

rcurtin merged commit 21349b3 into mlpack:master Sep 14, 2017

		arma::mat predictors("0 0 0.4; 0 0 0.6; 0 0.3 0; 0.2 0 0; 0.2 -0.5 0;");
		arma::Row<size_t> responses("1 1 0;");

		@@ -54,10 +54,7 @@ double ParallelSGD<DecayPolicyType>::Optimize(
		lastObjective = overallObjective;
		overallObjective = 0;


		BOOST_REQUIRE_EQUAL(descentPolicy.DescentFeature(0, point, f), 2);

		point[1] = 10;


		gradient.set_size(arma::size(denseGrad));

		gradient.col(j) = denseGrad.col(j);

Implementation of Stochastic Coordinate Descent #1075

Implementation of Stochastic Coordinate Descent #1075

Conversation

shikharbhardwaj commented Aug 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shikharbhardwaj commented Aug 10, 2017

mentekid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shikharbhardwaj commented Aug 15, 2017 • edited

zoq commented Aug 17, 2017

shikharbhardwaj commented Aug 18, 2017

zoq commented Aug 18, 2017

shikharbhardwaj commented Aug 19, 2017

mentekid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mentekid Aug 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mentekid commented Aug 22, 2017

zoq commented Aug 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin commented Aug 28, 2017

shikharbhardwaj commented Sep 1, 2017

rcurtin commented Sep 14, 2017

mentekid commented Sep 14, 2017

rcurtin commented Sep 14, 2017

shikharbhardwaj commented Aug 15, 2017 •

edited

mentekid Aug 21, 2017 •

edited