Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Stochastic Coordinate Descent #1075

Merged
merged 31 commits into from Sep 14, 2017

Conversation

shikharbhardwaj
Copy link
Contributor

Here is an implementation of another optimizer, SCD. I have started with a serial implementation first, and made some changes to LogisticRegressionFunction to test this optimizer. The optimizer requires another addition to the requirements for the function type to be optimized.

Currently, I am working on getting two things done :

  1. Implement the greedy descent policy, using ideas from : http://www.maths.ed.ac.uk/~prichtar/Optimization_and_Big_Data_2015/slides/Schmidt.pdf.

  2. Optimize the FeatureGradient method in LogisticRegressionFunction to not use Gradient.

I'll update soon with more changes.

* @return The index of the coordinate to be descended.
*/

// TODO: Find a way to implement this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get the point of the TODO, can you elaborate on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to implement this descent policy which gives the feature with the maximum guaranteed descent, based on : http://www.maths.ed.ac.uk/~prichtar/Optimization_and_Big_Data_2015/slides/Schmidt.pdf and ideas from this implementation of SCD : http://ttic.uchicago.edu/~tewari/code/scd/

The above mentioned implementation uses a parameter "rho", which is a part of the objective function, which is 0.25 for logistic and hinge loss and 1 for quadratic loss in the part for calculating the predicted descent and finding the feature with the steepest descent.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I do think we need a formalized write-up of all the different FunctionTypes that we have now and what the differences between them are. Would you be willing to sketch up some documentation for that? I figured it could be look the documentation in doc/policies/.

Another question I have is, are you planning on making any of the other functions (like softmax regression) into ResolvableFunctionTypes?

And last question for now, have you done any timing simulations to compare SCD with, e.g., LBFGS or SGD for logistic regression? I am excited to see how it performs. :)

arma::mat predictors("0 0 0.4; 0 0 0.6; 0 0.3 0; 0.2 0 0; 0.2 -0.5 0;");
arma::Row<size_t> responses("1 1 0;");


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is an extra line here.

* NumFeatures() should return the number of features in the decision variable.
* Evaluate gives the value of the loss function at the current decision
* variable and FeatureGradient is used to evaluate the partial gradient with
* respect to the jth feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting this. What happens if a user passes a FunctionType that isn't a ResolvableFunctionType? How difficult is the error message to understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if someone passes in a FunctionType without the ResolvableFunctionType interface, the compiler will complain about the absence of member functions NumFeatures and FeatureGradient. Are there ways we can make this more understandeable to the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow response here. We could use a static_assert() check to ensure that it has the proper methods. But for the sake of this PR I think that is not necessary. If you are interested afterwards in making the error reporting more robust for template parameters, I have some ideas. But if you are not interested that is ok too, you should not feel obligated. :)

@shikharbhardwaj
Copy link
Contributor Author

Thanks for the suggestions Ryan.
I am currently working on some more improvements. I'll add more functions with the Resolvable requirements soon.

Documenting the policies added is a nice improvement. I'll do that.

Copy link
Contributor

@mentekid mentekid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments from me.

This looks good. I am glad you went into the testing rabbit hole with HOGWILD!, now testing this feels more straightforward.

return coordinates[i] * coordinates[i] + bi[i] * coordinates[i] +
intercepts[i];
}

//! Evaluate a function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - I think this comment and the one on the overload above should be swapped?

* The DescentFeature method is used to get the descent coordinate for the
* current iteration.
*
* @tparam ResolvableFunctionType The type of the function to be optimized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a pointer to the more detailed description of what the signature of this function is in scd.hpp? Something like "For more information see the SCD implementation"

* @tparam DescentPolicy Descent policy to decide the order in which the
* coordinate for descent is selected.
*/
template <typename DescentPolicyType = RandomDescent>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you have a default policy here.

size_t updateInterval;

//! The descent policy used to pick the coordinates for the update.
DescentPolicyType descentPolicy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that the user would be unlikely to change the descent policy during the object's lifetime - would it make sense to make it const and remove the set function? I realise it's a very minor point though.

@shikharbhardwaj
Copy link
Contributor Author

shikharbhardwaj commented Aug 15, 2017

I finished up the changes in SoftmaxRegressionFunction with the last commit. A minor inconsistency I noticed while working with the code in softmax and logistic regression is the way the decision variable is represented. Softmax regression has the features arranged column-wise, where as logistic regression has it row-wise (so it gives out a single column vector as the output from Gradient).

I guess it would be nice to make this uniform across all the functions (SCD could then do the update on the relevant column instead of working on the entire decision variable), also easing up the parallelisation?

@zoq
Copy link
Member

zoq commented Aug 17, 2017

I think we should see if we can work out the inconsistency, @rcurtin already looked into the logistic regression method so he can probably provide some more insights, regarding what needs to be done.

@shikharbhardwaj
Copy link
Contributor Author

I am working on the changes in Logistic regression (using a rowvec for the decision variable and obtaining the submatrix views with tail_cols). I am done with the changes in the function, but they break some other tests (which assumed the gradient to be in the other shape).

@zoq
Copy link
Member

zoq commented Aug 18, 2017

Can you refactor the test cases too, don't feel obligated.

@shikharbhardwaj
Copy link
Contributor Author

Sure, I am already on it. :)

Copy link
Contributor

@mentekid mentekid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again only minor comments.

Thanks for normalizing the logistic regression interface as well, that was a good catch :)

const double regularization = lambda * (1.0 / (2.0 * predictors.n_cols)) *
arma::dot(parameters.col(0).subvec(1, parameters.n_elem - 1),
parameters.col(0).subvec(1, parameters.n_elem - 1));
norm * norm;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent 2 more spaces here

@@ -205,3 +205,16 @@ void SoftmaxRegressionFunction::Gradient(const arma::mat& parameters,
lambda * parameters;
}
}

void SoftmaxRegressionFunction::FeatureGradient(const arma::mat& parameters,
const size_t j,
Copy link
Contributor

@mentekid mentekid Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also tab this to align the consts in this line and the one above, but that's really minor.

@@ -54,10 +54,7 @@ double ParallelSGD<DecayPolicyType>::Optimize(
lastObjective = overallObjective;
overallObjective = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can remove this line since you're now evaluating it all in one go.

@mentekid
Copy link
Contributor

This looks ready to merge on my end - AppVeyor failure seems to be a random timeout.

I am leaving it open to any comments or reviews, and I'll merge this on Friday if nothing comes up.

@shikharbhardwaj nice work - thank you and well done :)

@zoq
Copy link
Member

zoq commented Aug 22, 2017

Let's restart the windows build, looks like the build could not fetch the repo.

* @file cyclic_descent.hpp
* @author Shikhar Bhardwaj
*
* Cyclic descent policy for Stochastic Co ordinate Descent (SCD).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an unnecessary space?


} // namespace optimization
} // namespace mlpack
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind to add a newline after #endif I think that applies for almost all files in this PR.

SCD(const double stepSize = 0.01,
const size_t maxIterations = 100000,
const double tolerance = 1e-5,
const size_t updateInterval = 1e3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but it sounds like updateInterval is what we call batchSize in e.g. MinibatchSGD I guess if it's the same, we should think about renaming the parameter. Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateInterval is different from batchSize. It is the number of iterations after which we print the diagnostic information to the logs. As printing requires a call to Evaluate (which may take time), we need to make sure that the diagnostic info does not slow down the iteration (each iteration in SCD is expected to be much faster than Evaluate).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the clarification.


// Calculate sigmoid.
const double exponent = parameters(0, 0) + arma::dot(predictors.col(i),
parameters.col(0).subvec(1, parameters.n_elem - 1));
parameters.tail_cols(parameters.n_elem - 1).t());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

const arma::rowvec sigmoids = (1 / (1 + arma::exp(-parameters(0, 0)
- parameters.tail_cols(parameters.n_elem - 1) * predictors)));

arma::mat diffs = responses - sigmoids;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could inline the operation here, since it only needed once.


BOOST_REQUIRE_EQUAL(descentPolicy.DescentFeature(0, point, f), 2);

point[1] = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you eloborate on this (add a comment)?

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Thank you so much for adding the tutorial on FunctionType parameters, I think it is a great improvement. These are the last comments I have; I think they are all pretty minor. From my end the PR is good to go if you can address them. Great work!

@endcode

To evaluate the gradient at the given coordinates, where \c gradient is an
out-param for the required gradient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to read if you include this comment as part of the @code snippet:

// Evaluate the gradient at the given coordinates, where 'gradient' is an output
// parameter for the required gradient.
void Gradient(const arma::mat& coordinates, arma::mat& gradient);

There are lots of little @code blocks you could modify like this. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll do that. I was thinking about improving the formatting but somehow it slipped my mind.

/**
* Cyclic descent policy for Stochastic Co-ordinate Descent(SCD). This
* descent scheme picks a the co-ordinate for the descent in a cyclic manner
* serially.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy paper to cite here for SCD? It's always nice to provide references to the original paper.

out-param for the required gradient. The out-param is a sparse matrix(with
dimensions equal to the decision variable), for storing the gradient of the
jth feature. The \c gradient matrix is supposed to be non-zero in the jth
column, which contains the relavant partial gradient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor misspelling: 'relevant' :)

\c FunctionType interface.

@code
void FeatureGradient(const arma::mat& coordinates, const size_t j, arma::sp_mat& gradient);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be better to name this function PartialGradient() since it is the partial gradient with respect to one of the coordinates features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, PartialGradient describes the function better.

*
* @tparam ResolvableFunctionType The type of the function to be optimized.
* @param numEpoch The iteration number for which the feature is to be
* obtained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me an epoch is a term used for an entire pass over the whole dataset, but it seems like the wrong word to use here. I think it might be clearer to just name this parameter iteration.

The interface expects the following member functions from the function class

@code
size_t NumFeatures();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure on this---is it always true that NumFeatures() == coordinates.n_elem? If so, I am not sure this function is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the current scheme, NumFeatures() == coordinates.n_cols. I guess we can remove this function then (I kept it for consistency with SGD).

arma::dot(parameters.col(0).subvec(1, parameters.n_elem - 1),
parameters.col(0).subvec(1, parameters.n_elem - 1));
// term and take every term except the last one in the decision variable.
double norm = arma::norm(parameters.tail_cols(parameters.n_elem - 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of tail_cols() is a nice improvement, but is the switch of parameters from arma::vec to arma::rowvec necessary? That's an API change and might break some user's code. I think you could also just do tail_rows() and leave it as an arma::vec. Let me know what you think. Also I think it might be faster (very incrementally) to use arma::dot() instead of arma::norm() and then multiplying.


gradient.set_size(arma::size(denseGrad));

gradient.col(j) = denseGrad.col(j);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be faster to explicitly calculate only the part of the gradient that is needed here?

/**
* Test the greedy descent policy.
*/
BOOST_AUTO_TEST_CASE(GreedyDescentTest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be also useful to add a simple test for RandomDescent and CyclicDescent; very simple tests can just make sure they give the expected result. These can be helpful later, to make sure that nothing is broken by later maintenance/refactorings of the code.

}

/**
* Test changes to Softmax regression function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be an inaccurate comment (or at least it won't make sense after merge), maybe something more like Test that SoftmaxRegressionFunction::FeatureGradient() works as expected.

@rcurtin
Copy link
Member

rcurtin commented Aug 28, 2017

@shikharbhardwaj: thanks for the responses, everything looks good to me. I think there are still two comments worth addressing---SoftmaxRegressionFunction::PartialGradient() could be accelerated still, and tests could be added for RandomDescent and CyclicDescent, but those are up to you. I think it's fine to merge as-is regardless. Thanks for the hard work, I think this is a nice addition. 👍

@shikharbhardwaj
Copy link
Contributor Author

Sure, I had added tests for the descent policies (CyclicDescentTest and RandomDescenTest in the SCD test file).
I had started with optimizing PartialGradient in SoftmaxRegressionFunction but got confused. I'll start again with a clear head today.
I also looked into the idea of checking the function API for consistency at compile time. I guess this could be applied to almost all optimizers, so I'll do this as a separate PR.

@rcurtin
Copy link
Member

rcurtin commented Sep 14, 2017

I think this is ready to merge; @mentekid: was there anything else you were waiting on for this one?

@mentekid
Copy link
Contributor

Sorry, I was planning to review this but forgot. I haven't looked at the latest push, but if you think it is ready to go just go ahead and merge.

Sorry for the delay, @shikharbhardwaj thanks for the contribution!

@rcurtin
Copy link
Member

rcurtin commented Sep 14, 2017

Ok, sure, I'll go ahead and merge it then. I think the latest changes we good. Thanks! :)

@rcurtin rcurtin merged commit 21349b3 into mlpack:master Sep 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants