New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Stochastic Coordinate Descent #1075
Conversation
* @return The index of the coordinate to be descended. | ||
*/ | ||
|
||
// TODO: Find a way to implement this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get the point of the TODO, can you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning to implement this descent policy which gives the feature with the maximum guaranteed descent, based on : http://www.maths.ed.ac.uk/~prichtar/Optimization_and_Big_Data_2015/slides/Schmidt.pdf and ideas from this implementation of SCD : http://ttic.uchicago.edu/~tewari/code/scd/
The above mentioned implementation uses a parameter "rho", which is a part of the objective function, which is 0.25 for logistic and hinge loss and 1 for quadratic loss in the part for calculating the predicted descent and finding the feature with the steepest descent.
Use parallel SGD test function for testing SCD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I do think we need a formalized write-up of all the different FunctionType
s that we have now and what the differences between them are. Would you be willing to sketch up some documentation for that? I figured it could be look the documentation in doc/policies/
.
Another question I have is, are you planning on making any of the other functions (like softmax regression) into ResolvableFunctionType
s?
And last question for now, have you done any timing simulations to compare SCD with, e.g., LBFGS or SGD for logistic regression? I am excited to see how it performs. :)
src/mlpack/tests/scd_test.cpp
Outdated
arma::mat predictors("0 0 0.4; 0 0 0.6; 0 0.3 0; 0.2 0 0; 0.2 -0.5 0;"); | ||
arma::Row<size_t> responses("1 1 0;"); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is an extra line here.
* NumFeatures() should return the number of features in the decision variable. | ||
* Evaluate gives the value of the loss function at the current decision | ||
* variable and FeatureGradient is used to evaluate the partial gradient with | ||
* respect to the jth feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for documenting this. What happens if a user passes a FunctionType
that isn't a ResolvableFunctionType
? How difficult is the error message to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if someone passes in a FunctionType
without the ResolvableFunctionType
interface, the compiler will complain about the absence of member functions NumFeatures
and FeatureGradient
. Are there ways we can make this more understandeable to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow response here. We could use a static_assert()
check to ensure that it has the proper methods. But for the sake of this PR I think that is not necessary. If you are interested afterwards in making the error reporting more robust for template parameters, I have some ideas. But if you are not interested that is ok too, you should not feel obligated. :)
Thanks for the suggestions Ryan. Documenting the policies added is a nice improvement. I'll do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor comments from me.
This looks good. I am glad you went into the testing rabbit hole with HOGWILD!, now testing this feels more straightforward.
return coordinates[i] * coordinates[i] + bi[i] * coordinates[i] + | ||
intercepts[i]; | ||
} | ||
|
||
//! Evaluate a function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor - I think this comment and the one on the overload above should be swapped?
* The DescentFeature method is used to get the descent coordinate for the | ||
* current iteration. | ||
* | ||
* @tparam ResolvableFunctionType The type of the function to be optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a pointer to the more detailed description of what the signature of this function is in scd.hpp
? Something like "For more information see the SCD implementation"
* @tparam DescentPolicy Descent policy to decide the order in which the | ||
* coordinate for descent is selected. | ||
*/ | ||
template <typename DescentPolicyType = RandomDescent> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you have a default policy here.
size_t updateInterval; | ||
|
||
//! The descent policy used to pick the coordinates for the update. | ||
DescentPolicyType descentPolicy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that the user would be unlikely to change the descent policy during the object's lifetime - would it make sense to make it const and remove the set function? I realise it's a very minor point though.
ResolvableFunctionType API changes
Unify changes with test functions
I finished up the changes in I guess it would be nice to make this uniform across all the functions (SCD could then do the update on the relevant column instead of working on the entire decision variable), also easing up the parallelisation? |
I think we should see if we can work out the inconsistency, @rcurtin already looked into the logistic regression method so he can probably provide some more insights, regarding what needs to be done. |
I am working on the changes in Logistic regression (using a |
Can you refactor the test cases too, don't feel obligated. |
Sure, I am already on it. :) |
Combine simultaneous changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again only minor comments.
Thanks for normalizing the logistic regression interface as well, that was a good catch :)
const double regularization = lambda * (1.0 / (2.0 * predictors.n_cols)) * | ||
arma::dot(parameters.col(0).subvec(1, parameters.n_elem - 1), | ||
parameters.col(0).subvec(1, parameters.n_elem - 1)); | ||
norm * norm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent 2 more spaces here
@@ -205,3 +205,16 @@ void SoftmaxRegressionFunction::Gradient(const arma::mat& parameters, | |||
lambda * parameters; | |||
} | |||
} | |||
|
|||
void SoftmaxRegressionFunction::FeatureGradient(const arma::mat& parameters, | |||
const size_t j, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also tab this to align the const
s in this line and the one above, but that's really minor.
@@ -54,10 +54,7 @@ double ParallelSGD<DecayPolicyType>::Optimize( | |||
lastObjective = overallObjective; | |||
overallObjective = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can remove this line since you're now evaluating it all in one go.
This looks ready to merge on my end - AppVeyor failure seems to be a random timeout. I am leaving it open to any comments or reviews, and I'll merge this on Friday if nothing comes up. @shikharbhardwaj nice work - thank you and well done :) |
Let's restart the windows build, looks like the build could not fetch the repo. |
* @file cyclic_descent.hpp | ||
* @author Shikhar Bhardwaj | ||
* | ||
* Cyclic descent policy for Stochastic Co ordinate Descent (SCD). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an unnecessary space?
|
||
} // namespace optimization | ||
} // namespace mlpack | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind to add a newline after #endif
I think that applies for almost all files in this PR.
SCD(const double stepSize = 0.01, | ||
const size_t maxIterations = 100000, | ||
const double tolerance = 1e-5, | ||
const size_t updateInterval = 1e3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but it sounds like updateInterval
is what we call batchSize
in e.g. MinibatchSGD
I guess if it's the same, we should think about renaming the parameter. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateInterval
is different from batchSize
. It is the number of iterations after which we print the diagnostic information to the logs. As printing requires a call to Evaluate
(which may take time), we need to make sure that the diagnostic info does not slow down the iteration (each iteration in SCD is expected to be much faster than Evaluate
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the clarification.
|
||
// Calculate sigmoid. | ||
const double exponent = parameters(0, 0) + arma::dot(predictors.col(i), | ||
parameters.col(0).subvec(1, parameters.n_elem - 1)); | ||
parameters.tail_cols(parameters.n_elem - 1).t()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
const arma::rowvec sigmoids = (1 / (1 + arma::exp(-parameters(0, 0) | ||
- parameters.tail_cols(parameters.n_elem - 1) * predictors))); | ||
|
||
arma::mat diffs = responses - sigmoids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could inline the operation here, since it only needed once.
src/mlpack/tests/scd_test.cpp
Outdated
|
||
BOOST_REQUIRE_EQUAL(descentPolicy.DescentFeature(0, point, f), 2); | ||
|
||
point[1] = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you eloborate on this (add a comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Thank you so much for adding the tutorial on FunctionType parameters, I think it is a great improvement. These are the last comments I have; I think they are all pretty minor. From my end the PR is good to go if you can address them. Great work!
doc/policies/functiontype.hpp
Outdated
@endcode | ||
|
||
To evaluate the gradient at the given coordinates, where \c gradient is an | ||
out-param for the required gradient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to read if you include this comment as part of the @code
snippet:
// Evaluate the gradient at the given coordinates, where 'gradient' is an output
// parameter for the required gradient.
void Gradient(const arma::mat& coordinates, arma::mat& gradient);
There are lots of little @code
blocks you could modify like this. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll do that. I was thinking about improving the formatting but somehow it slipped my mind.
/** | ||
* Cyclic descent policy for Stochastic Co-ordinate Descent(SCD). This | ||
* descent scheme picks a the co-ordinate for the descent in a cyclic manner | ||
* serially. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an easy paper to cite here for SCD? It's always nice to provide references to the original paper.
doc/policies/functiontype.hpp
Outdated
out-param for the required gradient. The out-param is a sparse matrix(with | ||
dimensions equal to the decision variable), for storing the gradient of the | ||
jth feature. The \c gradient matrix is supposed to be non-zero in the jth | ||
column, which contains the relavant partial gradient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor misspelling: 'relevant' :)
doc/policies/functiontype.hpp
Outdated
\c FunctionType interface. | ||
|
||
@code | ||
void FeatureGradient(const arma::mat& coordinates, const size_t j, arma::sp_mat& gradient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be better to name this function PartialGradient()
since it is the partial gradient with respect to one of the coordinates
features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, PartialGradient
describes the function better.
* | ||
* @tparam ResolvableFunctionType The type of the function to be optimized. | ||
* @param numEpoch The iteration number for which the feature is to be | ||
* obtained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me an epoch
is a term used for an entire pass over the whole dataset, but it seems like the wrong word to use here. I think it might be clearer to just name this parameter iteration
.
doc/policies/functiontype.hpp
Outdated
The interface expects the following member functions from the function class | ||
|
||
@code | ||
size_t NumFeatures(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure on this---is it always true that NumFeatures() == coordinates.n_elem
? If so, I am not sure this function is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the current scheme, NumFeatures() == coordinates.n_cols
. I guess we can remove this function then (I kept it for consistency with SGD).
arma::dot(parameters.col(0).subvec(1, parameters.n_elem - 1), | ||
parameters.col(0).subvec(1, parameters.n_elem - 1)); | ||
// term and take every term except the last one in the decision variable. | ||
double norm = arma::norm(parameters.tail_cols(parameters.n_elem - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of tail_cols()
is a nice improvement, but is the switch of parameters
from arma::vec
to arma::rowvec
necessary? That's an API change and might break some user's code. I think you could also just do tail_rows()
and leave it as an arma::vec
. Let me know what you think. Also I think it might be faster (very incrementally) to use arma::dot()
instead of arma::norm()
and then multiplying.
|
||
gradient.set_size(arma::size(denseGrad)); | ||
|
||
gradient.col(j) = denseGrad.col(j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be faster to explicitly calculate only the part of the gradient that is needed here?
/** | ||
* Test the greedy descent policy. | ||
*/ | ||
BOOST_AUTO_TEST_CASE(GreedyDescentTest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be also useful to add a simple test for RandomDescent
and CyclicDescent
; very simple tests can just make sure they give the expected result. These can be helpful later, to make sure that nothing is broken by later maintenance/refactorings of the code.
src/mlpack/tests/scd_test.cpp
Outdated
} | ||
|
||
/** | ||
* Test changes to Softmax regression function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be an inaccurate comment (or at least it won't make sense after merge), maybe something more like Test that SoftmaxRegressionFunction::FeatureGradient()
works as expected.
Add descriptive comments and inline linear algbra operations in LogisticRegressionFunction::FeatureGradient
Use arma::dot instead of arma::norm in LogisiticRegressionFunction::PartialGradient
@shikharbhardwaj: thanks for the responses, everything looks good to me. I think there are still two comments worth addressing--- |
Sure, I had added tests for the descent policies ( |
I think this is ready to merge; @mentekid: was there anything else you were waiting on for this one? |
Sorry, I was planning to review this but forgot. I haven't looked at the latest push, but if you think it is ready to go just go ahead and merge. Sorry for the delay, @shikharbhardwaj thanks for the contribution! |
Ok, sure, I'll go ahead and merge it then. I think the latest changes we good. Thanks! :) |
Here is an implementation of another optimizer, SCD. I have started with a serial implementation first, and made some changes to
LogisticRegressionFunction
to test this optimizer. The optimizer requires another addition to the requirements for the function type to be optimized.Currently, I am working on getting two things done :
Implement the greedy descent policy, using ideas from : http://www.maths.ed.ac.uk/~prichtar/Optimization_and_Big_Data_2015/slides/Schmidt.pdf.
Optimize the
FeatureGradient
method inLogisticRegressionFunction
to not useGradient
.I'll update soon with more changes.