Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVE optimizer #45

Merged
merged 8 commits into from Dec 13, 2018

Conversation

@zoq
Copy link
Member

commented Oct 31, 2018

Implementation of "Improving Stochastic Gradient Descent with Feedback" by Jayanth Koushik and Hiroaki Hayashi.

zoq added some commits Oct 31, 2018

@rcurtin
Copy link
Member

left a comment

Actually I did not like this paper very much at all, although I read a different version than the Koushik2016 reference given in the comments (I read "Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates", arXiv 1611.01505v3).

I think the intuitive argument for their step size calculation is (kind of) reasonable, but the main problem is that in general the optimum value may not be known (although they are right that in some cases it is 0). In fact it seems that the optimizer doesn't really work if f^* is not known or greater than 0---it's not clear what to set f^* to in this case (or how it will affect convergence). The results for their CNN are so astoundingly much better than the other optimizers that I can't help but suspect that either there was some problem in the code, or that Eve won't perform like this in most situations, or that at the very least the experiments warrant much further investigation.

Also, I suspect that in practice the effective value of $\hat{d}_t$ is going to be the clipped values of either 1/c or c, especially as the function converges to the optimum f^*. (I suppose maybe larger batch sizes could help with that?)

Anyway, those are all just comments on the paper itself. I'd be curious if you had a different reaction to the paper. I am always interested in learning more. :) (And maybe the 2016 paper was more comprehensive, but I don't have any internet connection as I write this so I can't get a copy...)

Like the others, I think we should merge it, but once we've added documentation.

* the first point in the dataset (presumably, the dataset is held internally in
* the DecomposableFunctionType).
*/
class EVE

This comment has been minimized.

Copy link
@rcurtin

rcurtin Nov 19, 2018

Member

Do you think we should just call it "Eve" not "EVE"? THe paper doesn't suggest the name as an acronym, just a reference to 'Adam and Eve' I guess.

Show resolved Hide resolved include/ensmallen_bits/eve/eve_impl.hpp Outdated
Show resolved Hide resolved include/ensmallen_bits/eve/eve.hpp
}
else
{
lastObjective = objective;

This comment has been minimized.

Copy link
@rcurtin

rcurtin Nov 19, 2018

Member

Shouldn't this happen every iteration, not just when i == 0?

This comment has been minimized.

Copy link
@zoq

zoq Dec 10, 2018

Author Member

You are right, nice catch!

zoq added some commits Dec 10, 2018

@rcurtin
Copy link
Member

left a comment

Looks good to me once the test is fixed and once we add documentation to optimizers.md. 👍

Show resolved Hide resolved include/ensmallen_bits/eve/eve.hpp Outdated
if (i > 0)
{
const double d = std::abs(objective - lastObjective) /
std::min(objective, lastObjective);

This comment has been minimized.

Copy link
@rcurtin

rcurtin Dec 10, 2018

Member

I wonder if this is the source of the nan in the tests. Maybe it needs some correction in the denominator?

This comment has been minimized.

Copy link
@zoq

zoq Dec 12, 2018

Author Member

Right, std::min(objective, lastObjective) could be zero.

zoq added some commits Dec 12, 2018

@zoq zoq merged commit 19e7c77 into mlpack:master Dec 13, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.