New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdaSqrt - Second-order Information in First-order Optimization Methods #234
Conversation
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
@mlpack-jenkins test this please |
Let's fix the static analysis build. |
@mlpack-jenkins test this please |
Alright, it's working again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paper was a bit harder to read (I guess I read a preprint). I'm not sure I followed the idea behind the approach well, but the PR looks like it exactly implements the algorithm of the paper. I only have a couple minor comments. Sorry this review took so long to---it took forever to find some time to read the paper... :)
* [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) | ||
* [AdaGrad in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#AdaGrad) | ||
* [AdaDelta](#adadelta) | ||
* [Differentiable separable functions](#differentiable-separable-functions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, thanks for adding this 'see also' section.
// Instantiated parent class. | ||
AdaSqrtUpdate& parent; | ||
// The squared gradient matrix. | ||
GradType squaredGradient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you but it may be clearer to name this sumSquaredGradients
. I got a little confused and had to double-check the paper to understand.
tests/CMakeLists.txt
Outdated
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
tests/ada_grad_test.cpp
Outdated
@@ -1,8 +1,6 @@ | |||
/** | |||
* @file ada_grad_test.cpp | |||
* @author Abhinav Moudgil | |||
* @file ada_sqrt_test.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, did you mean to commit these changes to ada_grad_test.cpp
or just to the new test file?
tests/ada_grad_test.cpp
Outdated
@@ -33,4 +33,4 @@ TEST_CASE("AdaGradLogisticRegressionTestFMat", "[AdaGradTest]") | |||
{ | |||
AdaGrad adagrad(0.99, 32, 1e-8, 5000000, 1e-9, true); | |||
LogisticRegressionFunctionTest<arma::fmat>(adagrad, 0.003, 0.006); | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should revert this? I don't see a reason to remove the trailing newline :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed something but I think we don't use an extra trailing newline for other test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see what you mean, reverted with the last commit.
Co-authored-by: Ryan Curtin <ryan@ratml.org>
Co-authored-by: Conrad Sanderson <conradsnicta@users.noreply.github.com>
#define ENSMALLEN_ADA_SQRT_ADA_SQRT_IMPL_HPP | ||
|
||
// In case it hasn't been included yet. | ||
#include "ada_sqrt.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this #include
necessary?
ada_sqrt_impl.hpp
gets included from ada_sqrt.hpp
, which is in turn included from ensmallen.hpp
.
If I recall correctly, at the outset we decided not to support usage of only specific optimisers, as it's too messy (maintenance nightmare) and potentially buggy. Only usage via #include <ensmallen.hpp>
is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all, it's an artifact from the mlpack code base, I'll remove it and also adapt existing optimizer in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I originally started doing that years ago to be nice to users who strangely included an _impl.hpp
only. That makes more sense in the context of mlpack where you could choose multiple headers to include, but here in ensmallen it may not be needed, since users should only include the main ensmallen header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍
Implementation of AdaSqrt - "Second-order Information in First-order Optimization Methods", Yuzheng Hu, Licong Lin, Shange Tang.