Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Padam optimizer #46

Merged
merged 7 commits into from Dec 7, 2018

Conversation

@zoq
Copy link
Member

commented Nov 6, 2018

Implementation of "Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks" by Jinghui Chen and Quanquan Gu.

zoq added some commits Nov 3, 2018

@rcurtin
Copy link
Member

left a comment

I enjoyed reading the paper and the results look compelling. I'm curious how they would generalize across different networks and problems.

I guess, since we have documentation in function_types.md and optimizers.md, that we should probably update those with the information for Padam before we merge.

Also, do you want to add a note to HISTORY.md?

* Construct the Padam update policy with the given parameters.
*
* @param epsilon The epsilon value used to initialise the squared gradient
* parameter.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Nov 19, 2018

Member

Would it be more accurate to say something like "Epsilon is the minimum allowed gradient"? My understanding of epsilon is that it's necessary to prevent dividing by very small values in the case where v_t is very small. (Perhaps that might be a documentation update to fix in the AdamUpdate and other similar optimizers---not sure?)

vImproved = arma::max(vImproved, v);

iterate -= (stepSize * std::sqrt(biasCorrection2) / biasCorrection1) *
m / arma::pow(arma::sqrt(vImproved) + epsilon, partial * 2);

This comment has been minimized.

Copy link
@rcurtin

rcurtin Nov 19, 2018

Member

It's possible I have misunderstood something here, but my reading of Algorithm 1 in the paper suggests that this could be m / arma::pow(vImproved + epsilon, partial). But I think this depends on how the epsilon correction should be applied. Personally I don't think it makes particularly much difference whether it is applied to the square root of vImproved or not (the suggestion I wrote would be a little quicker in practice). But maybe there is some reason I don't know of to write it the way you did?

(In the end I'm indifferent on how it gets implemented, mostly I am just curious here.)

This comment has been minimized.

Copy link
@zoq

zoq Dec 3, 2018

Author Member

Good point this is what we did for the AMSGrad update rule, it shouldn't make much of a difference, but I will change it.

++iteration;

// And update the iterate.
m *= beta1;

This comment has been minimized.

Copy link
@rcurtin

rcurtin Nov 19, 2018

Member

Do we also need to update beta1 as beta1 *= lambda like the theory for Theorem 4.2 requires for convergence? I guess that Algorithm 1 does not include it, so perhaps it is not necessary. It also seems like they did not decay beta1 in the experiments.

This comment has been minimized.

Copy link
@zoq

zoq Dec 3, 2018

Author Member

Just checked the reference implementation, and it looks like they don't used a decay rate for the beta value, so I at this point I think it's fine to leave it as it is.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Dec 3, 2018

Member

Sounds good to me. It can always be changed later if someone wants.

@@ -186,6 +187,8 @@ using NadaMax = AdamType<NadaMaxUpdate>;

using OptimisticAdam = AdamType<OptimisticAdamUpdate>;

using Padam = AdamType<PadamUpdate>;

This comment has been minimized.

Copy link
@rcurtin

rcurtin Nov 19, 2018

Member

I guess that the user does not have any easy way through the constructor to set partial; do you think it is worth writing an extra constructor or something like this?

This comment has been minimized.

Copy link
@zoq

zoq Dec 3, 2018

Author Member

Agreed, will do that.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Dec 3, 2018

Also, we should probably notify the authors of the paper that we have their optimizer implemented, just so that they know. (I guess we should probably do that in general!)

zoq added some commits Dec 4, 2018

@rcurtin

rcurtin approved these changes Dec 4, 2018

Copy link
Member

left a comment

Looks great to me. We should make this part of the 1.12.0 release which can also include #60.

Show resolved Hide resolved doc/optimizers.md Outdated

zoq added some commits Dec 4, 2018

@zoq zoq merged commit d067ce4 into mlpack:master Dec 7, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.